Implicit image and video compression using machine learning systems
Implicit neural models using INRs address the challenges of high-quality image and video compression by minimizing data size and computational complexity, offering efficient and privacy-focused solutions adaptable to diverse data types and domains.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- QUALCOMM INC
- Filing Date
- 2022-03-31
- Publication Date
- 2026-06-12
Smart Images

Figure 0007873680000091 
Figure 0007873680000092 
Figure 0007873680000093
Abstract
Description
【Technical Field】 【0001】 【0001】 This disclosure generally relates to data compression. For example, aspects of this disclosure include using a machine learning system to compress image and / or video content. 【Background Art】 【0002】 【0002】 Many devices and systems enable media data (e.g., image data, video data, audio data, etc.) to be processed and output for consumption. Media data includes large amounts of data to meet increasing demands for image / video / audio quality, performance, and features. For example, consumers of video data generally desire high-quality videos with high fidelity, resolution, frame rate, etc. The large amounts of video data often required to meet these demands place a significant burden on communication networks and devices that process and store video data. Video coding techniques can be used to compress video data. One exemplary goal of video coding is to compress video data into a form that uses a lower bitrate while avoiding or minimizing degradation of video quality. As video services that are constantly evolving become available and the demand for large amounts of video data increases, coding techniques with better performance and efficiency are needed. 【Summary of the Invention】 【0003】
[0003] In some examples, systems and techniques for data compression and / or decompression using one or more machine learning systems are described. In some examples, a machine learning system (e.g., using one or more neural network systems) is provided for compressing and / or decompressing media data (e.g., video data, image data, audio data, etc.). According to at least one exemplary example, a method for processing image data is provided. This method may include receiving a plurality of images for compression by a neural network compression system, determining a first plurality of weight values associated with a first model of the neural network compression system based on a first image from the plurality of images, generating a first bitstream having a compressed version of the first plurality of weight values, and outputting the first bitstream for transmission to a receiver. 【0004】
[0004] In another example, a device for processing media data is provided, comprising at least one memory and at least one processor (for example, configured in a circuit) commutatically coupled to the at least one memory. The at least one processor may be configured to receive a plurality of images for compression by a neural network compression system, determine a first plurality of weight values associated with a first model of the neural network compression system based on a first image from the plurality of images, generate a first bitstream having a compressed version of the first plurality of weight values, and output the first bitstream for transmission to a receiver. 【0005】
[0005] In another example, a non-temporary computer-readable medium is provided which includes at least one instruction stored thereon, the at least one instruction, when executed by one or more processors, can cause one or more processors to receive a plurality of images for compression by a neural network compression system, determine a first plurality of weight values associated with a first model of the neural network compression system based on a first image from the plurality of images, generate a first bitstream having a compressed version of the first plurality of weight values, and output the first bitstream for transmission to a receiver. 【0006】
[0006] In another example, a device for processing image data is provided. The device may include means for receiving input data for compression by a neural network compression system; means for receiving a plurality of images for compression by a neural network compression system; means for determining a first plurality of weight values associated with a first model of the neural network compression system based on a first image from the plurality of images; means for generating a first bitstream having a compressed version of the first plurality of weight values; and means for outputting the first bitstream for transmission to a receiver. 【0007】
[0007] In another example, a method for processing media data is provided. This method may include receiving a compressed version of a first plurality of neural network weight values associated with a first image from a plurality of images, decompressing a first plurality of neural network weight values, and processing the first plurality of neural network weight values to produce a first image using a first neural network model. 【0008】
[0008] In another example, an apparatus for processing image data is provided, comprising at least one memory and at least one processor (for example, configured in a circuit) commutatically coupled to the at least one memory. The at least one processor may be configured to receive a compressed version of a first plurality of neural network weight values associated with a first image from a plurality of images, to decompress the first plurality of neural network weight values, and to process the first plurality of neural network weight values to produce a first image using a first neural network model. 【0009】
[0009] In another example, a non-temporary computer-readable medium is provided which includes at least one instruction stored thereon, which, when executed by one or more processors, can cause one or more processors to receive a compressed version of a first set of neural network weight values associated with a first image from a set of images, decompress the first set of neural network weight values, and process the first set of neural network weight values using a first neural network model to produce a first image. 【0010】
[0010] In another example, an apparatus for processing image data is provided. The apparatus may include means for receiving a compressed version of a first plurality of neural network weight values associated with a first image from a plurality of images, means for decompressing the first plurality of neural network weight values, and means for processing the first plurality of neural network weight values to produce a first image using a first neural network model. 【0011】
[0011] In some embodiments, the device may be or be part of a camera (e.g., an IP camera), a mobile device (e.g., a mobile phone or so-called “smartphone”, or other mobile device), a smart wearable device, an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a personal computer, a laptop computer, a server computer, a 3D scanner, a multi-camera system, or other device. In some embodiments, the device includes one or more cameras for capturing one or more images. In some embodiments, the device further includes a display for displaying one or more images, notifications, and / or other displayable data. In some embodiments, the device described above may include one or more sensors. 【0012】
[0012] The summary of the present invention is not intended to identify the main or essential features of the claimed subject matter, nor is it intended to be used independently to determine the scope of the claimed subject matter. The subject matter should be understood by referring to the entire specification of this patent, any or all of the drawings, and the appropriate portion of each claim. 【0013】
[0013] The above will become clearer with reference to the following specification, claims, and accompanying drawings, along with other features and embodiments. 【0014】
[0014] Exemplary embodiments of this application will be described in detail below with reference to the following drawings. [Brief explanation of the drawing] 【0015】 [Figure 1]
[0015] A figure illustrating an example of an image processing system, according to some examples of the present disclosure. [Figure 2A] 【0016】A figure illustrating an example of a fully-connected neural network, based on several examples of the disclosure. [Figure 2B] 【0017】 A figure illustrating an example of a locally-connected neural network, based on several examples of the disclosure. [Figure 2C] 【0018】 A diagram illustrating an example of a convolutional neural network, using several examples from this disclosure. [Figure 2D] 【0019】 A figure illustrating an example of a deep convolutional network (DCN) for recognizing visual features from an image, using several examples from this disclosure. [Figure 3] 【0020】 Block diagrams illustrating exemplary deep convolutional networks (DCNs) using several examples from this disclosure. [Figure 4] 【0021】 A diagram illustrating an example of a system including a transmitting device for compressing video content and a receiving device for decompressing a received bitstream into video content, as shown in some examples of the present disclosure. [Figure 5A] 【0022】 A figure illustrating exemplary rate-distortion autoencoder systems, as shown in some examples of the present disclosure. [Figure 5B] A figure illustrating exemplary rate-distortion autoencoder systems, as shown in some examples of the present disclosure. [Figure 6] 【0023】 A diagram illustrating an exemplary inference process implemented by an exemplary neural network compression system, which is fine-tuned using a model prior, as shown in several examples of the present disclosure. [Figure 7A] 【0024】A figure illustrating exemplary image compression codecs based on implicit neural representation, using several examples from this disclosure. [Figure 7B] 【0025】 A figure illustrating another exemplary image compression codec based on implicit neural representation, using several examples from this disclosure. [Figure 8A] 【0026】 A diagram illustrating an example of a compression pipeline for a picture group using implicit neural representation, with several examples from this disclosure. [Figure 8B] 【0027】 A diagram illustrating another example of a compression pipeline for picture groups using implicit neural representations, with several examples from this disclosure. [Figure 8C] 【0028】 A diagram illustrating another example of a compression pipeline for picture groups using implicit neural representations, with several examples from this disclosure. [Figure 9] 【0029】 A diagram illustrating video frame encoding order using several examples from this disclosure. [Figure 10] 【0030】 A diagram illustrating an exemplary process for performing implicit neural compression, using several examples from this disclosure. [Figure 11] 【0031】 A flowchart illustrating an example of a process for compressing image data based on implicit neural representations, using several examples from this disclosure. [Figure 12] 【0032】 A flowchart illustrating another example of a process for compressing image data based on implicit neural representations, using several examples from this disclosure. [Figure 13] 【0033】A flowchart illustrating an example of the process for unfolding image data based on implicit neural representations, using several examples from this disclosure. [Figure 14] 【0034】 A flowchart illustrating an example of a process for compressing image data based on implicit neural representations, using several examples from this disclosure. [Figure 15] 【0035】 A flowchart illustrating an example of the process for unfolding image data based on implicit neural representations, using several examples from this disclosure. [Figure 16] 【0036】 A diagram illustrating an exemplary computing system, using several examples from this disclosure. [Modes for carrying out the invention]
[0016] 【0037】 Several aspects and embodiments of this disclosure are provided below. As will be apparent to those skilled in the art, some of these aspects and embodiments may be applied independently, and some may be applied in combination. For illustrative purposes, specific details are provided in the following description to provide a complete understanding of the embodiments of this application. However, it will be apparent that various embodiments may be carried out without these specific details. The figures and description are not limiting.
[0017] 【0038】 The following description provides only exemplary embodiments and does not limit the scope, applicability, or configuration of the disclosure. Rather, the following description of exemplary embodiments provides a description that enables the implementation of the exemplary embodiments to those skilled in the art. It should be understood that various modifications can be made to the function and configuration of the elements without departing from the spirit and scope of this application, as described in the appended claims.
[0018] 【0039】As mentioned above, media data (e.g., image data, video data, and / or audio data) can contain large amounts of data, especially as the demand for high-quality video data continues to grow. For example, consumers of image, audio, and video data generally desire increasingly higher levels of quality, such as high fidelity, resolution, and frame rate. However, the large amounts of data required to meet such demands can place a considerable burden on communication networks and devices that process and store video data, including high bandwidth and network resource requirements. Therefore, compression algorithms (also called coding algorithms or tools) are advantageous for reducing the amount of data required for the storage and / or transmission of image and video data.
[0019] 【0040】 Various techniques can be used to compress media data. Image data compression has been achieved, in particular, using algorithms such as Joint Photographic Expert Group (JPEG) and Better Portable Graphics (BPG). In recent years, neural network-based compression methods have been considered quite promising in compressing image data. Video coding can be performed according to specific video coding standards. Exemplary video coding standards include High Efficiency Video Coding (HEVC), Essential Video Coding (EVC), Advanced Video Coding (AVC), Moving Picture Expert Group (MPEG) coding, and General-Purpose Video Coding (VVC). However, such conventional image and video coding techniques can produce artifacts in the reconstructed image after decoding has been performed.
[0020] 【0041】In some embodiments, systems, apparatus, processes (also called methods), and computer-readable media (collectively referred to herein as “systems and techniques”) for performing compression and decompression of data (e.g., images, video, audio, etc.) using one or more machine learning systems (collectively referred to as coding, also called encoding and decoding) are described herein. For example, systems and techniques may be implemented using implicit neural models. Implicit neural models may be based on implicit neural representations (INRs). As described herein, implicit neural models may take coordinate positions (e.g., coordinates within an image or video frame) as input and output pixel values (e.g., color values for an image or video frame, such as color values for each coordinate position or pixel). In some cases, implicit neural models may also be based on IPB frame schemes. In some cases, implicit neural models can modify the input data to model optical flow.
[0021] 【0042】 In some cases, implicit neural models can model optical flow using implicit neural representations where local transformations may be element-wise additions. In some cases, implicit models can model optical flow by adjusting input coordinate positions to produce corresponding output pixel values. For example, element-wise addition of inputs may lead to local transformations in the output, which can eliminate the need for pixel movement and the associated computational complexity.
[0022] 【0043】One or more machine learning systems, trained as described herein, may be used to perform data compression and / or decompression, including image, video, and / or audio compression and decompression. The machine learning systems described herein may be trained to perform compression / decompression techniques that produce high-quality data output. The systems and techniques described herein can perform compression and / or decompression of any type of data. For example, in some cases, the systems and techniques described herein can perform compression and / or decompression of image data. As another example, in some cases, the systems and techniques described herein can perform compression and / or decompression of video data. The terms “image” and “frame” as used herein are used interchangeably and refer to a standalone image or frame (e.g., a photograph), or a group or sequence of images or frames (e.g., a video, or another sequence of images / frames). As another example, in some cases, the systems and techniques described herein can perform compression and / or decompression of audio data. For simplicity, illustrative and explanatory purposes, the systems and techniques described herein will be described in relation to the compression and / or decompression of image data (e.g., images or frames, video, etc.). However, as stated above, the concepts described herein may also apply to other modalities, such as audio data and any other types of data.
[0023] 【0044】The compression models used by encoders and / or decoders may be generalizable to different types of data. Furthermore, by utilizing implicit neural models with the various properties described herein, machine learning systems can increase the compression and / or decompression performance, bitrate, quality, and / or efficiency for specific sets of data. For example, implicit neural model-based machine learning systems can eliminate the need to store a pre-trained neural network on the receiver side (and, in some cases, on the transmitter side). The neural networks on the transmitter and receiver sides can be implemented using lightweight frameworks. Another advantage of such machine learning systems is the absence of flow behavior by actual machine learning systems (e.g., neural networks), which can be difficult to implement in some cases (e.g., in hardware). In addition, the decoding function may be faster than that in standard machine learning-based coders / decoders (codecs). In some cases, the implicit neural model-based machine learning systems described herein do not require separate training datasets, as they can be implicitly trained using the data to be encoded (e.g., coordinate grids, and current instances such as images, video frames, and videos). The implicit neural model configurations described herein can also help avoid potential privacy issues. The system works well with data from different domains, including those for which suitable training data is unavailable.
[0024] 【0045】In some examples, a machine learning system can include one or more neural networks. Machine learning (ML) is a subset of artificial intelligence (AI). ML systems include algorithms and statistical models that computer systems can use to perform various tasks by relying on patterns and reasoning without using explicit instructions. An example of an ML system is a neural network (also called an artificial neural network), which may consist of interconnected groups of artificial neurons (e.g., neuron models). Neural networks can be used for a variety of applications and / or devices, particularly image analysis and / or computer vision applications, Internet Protocol (IP) cameras, Internet of Things (IoT) devices, autonomous vehicles, and service robots.
[0025] 【0046】 Individual nodes in a neural network can emulate biological neurons by taking input data and performing simple operations on that data. The results of simple operations performed on the input data are selectively passed to other neurons. Weight values are associated with each vector and node in the network, and these values constrain how the input data relates to the output data. For example, the input data of each node may be multiplied by the corresponding weight value, and these products may be summed. The sum of the products may be adjusted by a discretionary bias, and an activation function may be applied to the result, giving rise to the node's output signal or "output activation" (sometimes called an activation map or feature map). The weight values may initially be determined by an iterative flow of training data through the network (for example, the weight values are determined during the training phase, when the network learns how to identify certain classes based on typical input data characteristics of those classes).
[0026] 【0047】In particular, there are different types of neural networks, such as deep generative neural network models (e.g., generative adversarial networks (GANs)), recurrent neural network (RNN) models, multilayer perceptron (MLP) neural network models, convolutional neural network (CNN) models, and autoencoders (AEs). For example, a GAN is a form of generative neural network that can learn patterns in input data so that the neural network model can produce a new synthetic output that is reasonably likely to be from the original dataset. A GAN can include two neural networks working together. One of the neural networks (called the generative neural network or generator, denoted as G(z)) produces the synthetic output, and the other neural network (called the discriminative neural network or discriminator, denoted as D(X)) evaluates the output for authenticity (whether the output is from the original dataset, such as the training dataset, or generated by the generator). Training inputs and outputs can include images as illustrative examples. The generator is trained to deceive the discriminator into deciding that the synthesized images it generates are real images from the dataset. The training process continues, and the generator becomes better at producing synthesized images that look like real images. The discriminator continues to find defects in the synthesized images, and the generator deciphers what the discriminator is looking at to determine the defects in the images. Once the network is trained, the generator is capable of producing images that look real but that the discriminator cannot distinguish from real images.
[0027] 【0048】RNNs operate on the principle of saving the output of a layer and feeding this output back into the input to help predict the results of the layer. In MLP neural networks, data may be fed into an input layer, and one or more hidden layers provide a level of abstraction to the data. Predictions can then be made on the output layer based on the abstracted data. MLPs may be particularly well-suited to classification prediction problems where the input is assigned a class or label. Convolutional neural networks (CNNs) are a type of feedforward artificial neural network. A CNN may consist of a set of artificial neurons that collectively tile the input space, each having a receptive field (e.g., a spatially localized region of the input space). CNNs have numerous applications, including pattern recognition and classification.
[0028] 【0049】 In a layered neural network architecture (called a deep neural network when multiple hidden layers exist), the output of the first layer of artificial neurons becomes the input to the second layer of artificial neurons, the output of the second layer of artificial neurons becomes the input to the third layer of artificial neurons, and so on. Convolutional neural networks can be trained to recognize a hierarchy of features. Computation in a convolutional neural network architecture can be distributed across a set of processing nodes that may consist of one or more computational chains. These multilayer architectures can be trained one layer at a time and fine-tuned using backpropagation.
[0029] 【0050】Autoencoders (AEs) can learn efficient data coding in an unsupervised manner. In some examples, an AE can learn a representation (e.g., data coding) for a set of data by training the network to ignore signal noise. An AE can include an encoder and a decoder. The encoder can map input data to a code, and the decoder can map the code to a reconstruction of the input data. In some examples, a rate-distortion autoencoder (RD-AE) can be trained to minimize the average rate-distortion loss across a dataset of data points, such as image and / or video data points. In some cases, an RD-AE can perform a forward pass in inference time to encode new data points.
[0030] 【0051】 In some examples, a machine learning system for data compression and / or decompression may include a neural network that is implicitly trained (for example, using image data to be compressed). In some cases, data compression and / or decompression based on implicit neural representations (INRs) may be implemented using a convolutional-based architecture. In some embodiments, encoding image data may include selecting a neural network architecture and overfitting the network weights to the image data. In some examples, the decoder may include a neural network architecture and receive network weights from the encoder. In other examples, the decoder may receive a neural network architecture from the encoder.
[0031] 【0052】In some cases, neural network weights can be large, which can increase the bitrate and / or computational overhead required to send the weights to the decoder. In some examples, weights can be quantized to reduce their overall size. In some embodiments, quantized weights can be compressed using a weight prior. A weight prior can reduce the amount of data sent to the decoder. In some cases, a weight prior can be designed to reduce the cost of sending model weights. For example, a weight prior can be used to reduce and / or limit the bitrate overhead of the weights.
[0032] 【0053】 In some cases, the design of the weight pliers can be improved as described further herein. In some exemplary examples, the weight plier design may include an independent Gaussian weight plier. In other exemplary examples, the weight plier design may include an independent Laplace weight plier. In other exemplary examples, the weight plier design may include an independent spike and slab plier. In some exemplary examples, the weight plier may include complex dependencies learned by the neural network.
[0033] 【0054】Figure 1 shows an example of an image processing system 100, according to some examples of the present disclosure. In some cases, the image processing system 100 may include a central processing unit (CPU) 102 or a multi-core CPU configured to perform one or more of the functions described herein. Among the information, variables (e.g., neural signals and synaptic weights), system parameters associated with computing devices (e.g., weighted neural networks), delays, frequency bin information, and task information may be stored in memory blocks associated with a neural processing unit (NPU) 108, memory blocks associated with a CPU 102, memory blocks associated with a graphics processing unit (GPU) 104, memory blocks associated with a digital signal processor (DSP) 106, memory block 118, or distributed across multiple blocks. Instructions executed in the CPU 102 may be loaded from program memory associated with the CPU 102 and / or from memory block 118.
[0034] 【0055】 The image processing system 100 may include additional processing blocks adapted to specific functions, such as a GPU 104 and a DSP 106, a connectivity block 110 which may include fifth-generation (5G) connectivity, fourth-generation long-term evolution (4G LTE®) connectivity, Wi-Fi® connectivity, USB connectivity, Bluetooth® connectivity, and / or a multimedia processor 112 capable of detecting and recognizing features, for example. In one implementation, the NPU 108 is implemented in the CPU 102, DSP 106, and / or GPU 104. The image processing system 100 may also include a sensor processor 114, one or more image signal processors (ISPs) 116, and / or storage 120. In some examples, the image processing system 100 may be based on the ARM instruction set.
[0035] 【0056】The image processing system 100 may be part of one or more computing devices. In some examples, the image processing system 100 may be part of one or more electronic devices, such as a camera system (e.g., a digital camera, IP camera, video camera, security camera, etc.), a telephone system (e.g., a smartphone, cellular phone, conferencing system, etc.), a desktop computer, an XR device (e.g., a head-mounted display, etc.), a smart wearable device (e.g., a smartwatch, smart glasses, etc.), a laptop or notebook computer, a tablet computer, a set-top box, a television, a display device, a digital media player, a gaming console, a video streaming device, a drone, a computer in a car, a system-on-a-chip (SOC), an Internet of Things (IoT) device, or any other suitable electronic device.
[0036] 【0057】 While the image processing system 100 is shown to include several components, those skilled in the art will understand that the image processing system 100 may include more or fewer components than those shown in Figure 1. For example, in some cases, the image processing system 100 may also include one or more memory devices (e.g., RAM, ROM, cache, etc.), one or more networking interfaces (e.g., wired and / or wireless communication interfaces, etc.), one or more display devices, and / or other hardware or processing devices not shown in Figure 1. Exemplary examples of computing devices and hardware components that may be implemented with the image processing system 100 are described below with respect to Figure 16.
[0037] 【0058】Image processing system 100 and / or its components may be configured to perform compression and / or decompression (collectively referred to as image coding, also called encoding and / or decoding) using machine learning systems and techniques described herein. In some cases, image processing system 100 and / or its components may be configured to perform image or video compression and / or decompression using techniques described herein. In some examples, the machine learning system may leverage a deep learning neural network architecture to perform compression and / or decompression of image, video, and / or audio data. By using a deep learning neural network architecture, the machine learning system can increase the efficiency and speed of compression and / or decompression of content on a device. For example, a device using the compression and / or decompression techniques described may efficiently compress one or more images using machine learning-based techniques, send the compressed one or more images to a receiving device, and the receiving device may efficiently decompress one or more compressed images using the machine learning-based techniques described herein. Images as used herein may refer to still images and / or video frames associated with sequences of frames (e.g., video).
[0038] 【0059】As mentioned above, a neural network is an example of a machine learning system. A neural network can include an input layer, one or more hidden layers, and an output layer. Data is provided from the input nodes of the input layer, processing is carried out by the hidden nodes of one or more hidden layers, and the output is produced through the output nodes of the output layer. Deep learning networks generally include multiple hidden layers. Each layer of a neural network can include a feature map or activation map, which can include artificial neurons (or nodes). Feature maps can include filters, kernels, etc. Nodes can include one or more weights used to indicate the importance of one or more nodes in the layer. In some cases, a deep learning network can have a series of many hidden layers, with early layers used to determine simple, low-level characteristics of the input, and later layers building a hierarchy of more complex and abstract characteristics.
[0039] 【0060】 Deep learning architectures can learn feature hierarchies. For example, given visual data, the first layer may learn to recognize relatively simple features in the input stream, such as edges. In another example, given auditory data, the first layer may learn to recognize spectral power at a specific frequency. A second layer, taking the output of the first layer as input, may learn to recognize combinations of features, such as simple shapes in the case of visual data, or combinations of sounds in the case of auditory data. For example, higher layers may learn to represent complex shapes in visual data, or words in auditory data. Even higher layers may learn to recognize common visual objects or speech phrases.
[0040] 【0061】 Deep learning architectures can work particularly well when applied to problems with natural hierarchical structures. For example, classifying motorized vehicles can benefit from initial learning to recognize wheels, windshields, and other features. These features can then be combined in different ways in higher layers to recognize cars, trucks, and airplanes.
[0041] 【0062】 Neural networks can be designed using various connectivity patterns. In a feedforward network, information is passed from lower layers to higher layers, and each neuron in a given layer communicates with neurons in higher layers. As described above, a hierarchical representation can be constructed in the consecutive layers of a feedforward network. Neural networks can also have recurrent or (also called top-down) feedback connectivity. In recurrent connectivity, the output from a neuron in a given layer can communicate with another neuron in the same layer. Recurrent architectures can be useful for recognizing patterns across two or more chunks of input data delivered sequentially to the neural network. Connectivity from a neuron in a given layer to a neuron in a lower layer is called feedback (or top-down) connectivity. Networks with many feedback connectivity can be useful when the recognition of a higher-level concept can help discriminate specific lower-level features of the input.
[0042] 【0063】The connections between layers of a neural network can be fully connected or locally connected. Figure 2A shows an example of a fully connected neural network 202. In the fully connected neural network 202, neurons in the first layer can communicate their outputs to any neuron in the second layer, such that each neuron in the second layer receives input from any neuron in the first layer. Figure 2B shows an example of a locally connected neural network 204. In the locally connected neural network 204, neurons in the first layer can be connected to a limited number of neurons in the second layer. More generally, the locally connected layers of the locally connected neural network 204 can be configured such that each neuron in the layer has the same or similar connectivity pattern, but with different connectivity strengths (e.g., 210, 212, 214, and 216). The connectivity pattern of local connections can create spatially distinct receptive fields in the upper layers, as higher-layer neurons in a given region can receive inputs that have been trained to the properties of a limited portion of the total input to the network.
[0043] 【0064】 An example of a locally connected neural network is a convolutional neural network. Figure 2C shows an example of a convolutional neural network 206. The convolutional neural network 206 may be configured such that the connection strength associated with the input for each neuron in the second layer is shared (e.g., 208). Convolutional neural networks may be suitable for problems where the spatial location of the input is meaningful. The convolutional neural network 206 may be used to perform one or more embodiments of video compression and / or decompression according to embodiments of this disclosure.
[0044] 【0065】One type of convolutional neural network is the deep convolutional network (DCN). Figure 2D shows a detailed example of DCN200 designed to recognize visual features from an image 226 input from an image capture device 230, such as an in-vehicle camera. In this example, DCN200 can be trained to identify traffic signs and the numbers provided on them. Of course, DCN200 can be trained for other tasks, such as identifying lane markings or traffic signals.
[0045] 【0066】 DCN200 can be trained using supervised learning. During training, DCN200 may be presented with images, such as image 226 of a speed limit sign, and then a forward pass may be computed to produce output 222. DCN200 may include a feature extraction section and a classification section. Upon receiving image 226, a convolutional layer 232 may apply a convolutional kernel (not shown) to image 226 to generate a first set of feature maps 218. As an example, the convolutional kernel for convolutional layer 232 may be a 5x5 kernel that generates a 28x28 feature map. In this example, four different feature maps are generated in the first set of feature maps 218, so four different convolutional kernels were applied to image 226 in convolutional layer 232. Convolutional kernels are sometimes called filters or convolutional filters.
[0046] 【0067】A first set of feature maps 218 may be subsampled by a max pooling layer (not shown) to generate a second set of feature maps 220. The max pooling layer reduces the size of the first set of feature maps 218; that is, the size of the second set of feature maps 220, such as 14×14, is smaller than the size of the first set of feature maps 218, such as 28×28. The reduced size provides similar information to subsequent layers while reducing memory consumption. The second set of feature maps 220 may be further convolved through one or more subsequent convolutional layers (not shown) to generate one or more subsequent sets of feature maps (not shown).
[0047] 【0068】 In the example in Figure 2D, a second set of feature maps 220 is convolved to generate a first feature vector 224. Furthermore, the first feature vector 224 is further convolved to generate a second feature vector 228. Each feature in the second feature vector 228 may contain a number corresponding to a possible feature of image 226, such as "label", "60", and "100". A softmax function (not shown) can convert the numbers in the second feature vector 228 into probabilities. Thus, the output 222 of DCN200 is the probability that image 226 contains one or more features.
[0048] 【0069】 In this example, the probabilities in output 222 for "label" and "60" are higher than the probabilities for other outputs in output 222, such as "30", "40", "50", "70", "80", "90", and "100". Before training, the outputs 222 generated by DCN200 may be inaccurate. Therefore, an error can be calculated between output 222 and the target output. The target output is the ground truth of image 226 (e.g., "label" and "60"). The weights of DCN200 can then be adjusted so that the output 222 of DCN200 is more closely matched to the target output.
[0049] 【0070】To adjust the weights, the learning algorithm may compute a gradient vector for the weights. The gradient may indicate the amount by which the error increases or decreases when the weights are adjusted. In the top layer, the gradient may directly correspond to the weight values connecting the activated neurons in the second-to-last layer to the neurons in the output layer. In lower layers, the gradient may depend on the weight values and the computed error gradients from the upper layers. The weights can then be adjusted to reduce the error. This method of adjusting weights is sometimes called "backpropagation" because it involves a "backward path" through the neural network.
[0050] 【0071】 In practice, the error gradient of the weights can be calculated over a small number of examples so that the calculated gradient approximates the true error gradient. This approximation method is sometimes called stochastic gradient descent. Stochastic gradient descent can be repeated until the achievable error rate for the entire system no longer decreases, or until the error rate reaches a target level. After training, the DCN may be presented with new images, and the forward pass through the network may produce an output 222 that can be considered the DCN's inference or prediction.
[0051] 【0072】A deep belief network (DBN) is a probabilistic model with multiple layers of hidden nodes. DBNs can be used to extract hierarchical representations of training datasets. DBNs can be obtained by stacking layers of restricted Boltzmann machines (RBMs). RBMs are a type of artificial neural network that can learn probability distributions across a set of inputs. Because RBMs can learn probability distributions in the absence of information about the class to which each input should be categorized, RBMs are often used in unsupervised learning. Using hybrid unsupervised and supervised paradigms, the lower RBM of a DBN can be trained in an unsupervised manner and can act as a feature extractor, while the upper RBM can be trained in a supervised manner (on a joint distribution of inputs from previous layers and target classes) and can act as a classifier.
[0052] 【0073】 A deep convolutional network (DCN) is a network of convolutional networks consisting of additional pooling and normalization layers. DCNs have achieved state-of-the-art performance for many tasks. DCNs can be trained using supervised learning, where both the input and output targets are known for many samples and the network weights are modified using gradient descent methods.
[0053] 【0074】 A DCN can be a feedforward network. Furthermore, as described above, the connections from neurons in the first layer of a DCN to groups of neurons in the next higher layer are shared across the neurons in the first layer. The feedforward and covalent connections of a DCN can be leveraged for high-speed processing. The computational burden of a DCN can be much less than that of a similarly sized neural network with recurrent or feedback connections, for example.
[0054] 【0075】The processing of each layer of a convolutional network can be considered as a spatially invariant template or basis projection. If the input is initially decomposed into multiple channels, such as the red, green, and blue channels of a color image, the convolutional network trained on that input can be considered three-dimensional, with two spatial dimensions along the image axes and a third dimension capturing the color information. The output of the convolutional connections can be thought of as forming a feature map in subsequent layers, where each element of the feature map (e.g., 220) may receive input from various neurons in the previous layer (e.g., feature map 218) and from each of the multiple channels. The values in the feature map can be further processed using nonlinearity, such as rectification or max(0,x). Values from adjacent neurons can be further pooled, which corresponds to downsampling and may provide further local invariance and dimensionality reduction.
[0055] 【0076】 Figure 3 is a block diagram showing an example of a deep convolutional network 350. The deep convolutional network 350 may include several different types of layers based on connectivity and weight sharing. As shown in Figure 3, the deep convolutional network 350 includes convolutional blocks 354A and 354B. Each of the convolutional blocks 354A and 354B may consist of a convolutional layer (CONV) 356, a normalization layer (LNorm) 358, and a maximum pooling layer (MAX POOL) 360.
[0056] 【0077】A convolutional layer 356 may include one or more convolutional filters that can be applied to the input data 352 to generate a feature map. Although only two convolutional blocks 354A and 354B are shown, this disclosure is not limited in that way, and instead, any number of convolutional blocks (e.g., blocks 354A and 354B) may be included in the deep convolutional network 350 according to design preferences. A normalization layer 358 may normalize the output of the convolutional filters. For example, the normalization layer 358 may perform whitening or lateral suppression. A maximum pooling layer 360 may perform downsampling aggregation across space for local invariance and dimensionality reduction.
[0057] 【0078】 For example, the parallel filter bank of the deep convolutional network may be loaded onto the CPU 102 or GPU 104 of the image processing system 100 to achieve high performance and low power consumption. In an alternative embodiment, the parallel filter bank may be loaded onto the DSP 106 or ISP 116 of the image processing system 100. Furthermore, the deep convolutional network 350 may have access to other processing blocks that may be present on the image processing system 100, such as the sensor processor 114.
[0058] 【0079】The deep convolutional network 350 may also include one or more fully connected layers, such as layer 362A (labeled "FC1") and layer 362B (labeled "FC2"). The deep convolutional network 350 may further include a logistic regression (LR) layer 364. Between each layer 356, 358, 360, 362, 364 of the deep convolutional network 350 are weights (not shown) to be updated. The output of each layer (e.g., 356, 358, 360, 362, 364) may serve as input to subsequent layers among the layers (e.g., 356, 358, 360, 362, 364) in the deep convolutional network 350 to learn a hierarchical feature representation from the input data 352 (e.g., image, audio, video, sensor data, and / or other input data) supplied in the first of the convolutional blocks 354A. The output of the deep convolutional network 350 is a classification score 366 for the input data 352. The classification score 366 can be a set of probabilities, where each probability is the probability that the input data contains a feature from the set of features.
[0059] 【0080】Image, audio, and video content can be stored and / or shared between devices. For example, image, audio, and video content can be uploaded to media that host a service and share a platform, and transmitted to various devices. Recording uncompressed image, audio, and video content generally results in large file sizes, which increase significantly as the resolution of the image, audio, and video content increases. For example, uncompressed 16-bit per channel video recorded at 1080p / 24 (for example, a resolution of 1920 pixels wide and 1080 pixels high, captured at 24 frames per second) may occupy 12.4 megabytes per frame, or 297.6 megabytes per second. Uncompressed 16-bit per channel video recorded at 4K resolution at 24 frames per second may occupy 49.8 megabytes per frame, or 1195.2 megabytes per second.
[0060] 【0081】 Since uncompressed image, audio, and video content can result in large files that require considerable memory for physical storage and substantial bandwidth for transmission, techniques for compressing such video content may be used. For example, various compression algorithms can be applied to image, audio, and video content to reduce the size of image content, and therefore the amount of storage space involved in storing image content and the amount of bandwidth involved in delivering video content.
[0061] 【0082】In some cases, image content can be compressed using a priori defined compression algorithms, particularly Joint Photographic Expert Group (JPEG) and Better Portable Graphics (BPG). JPEG is an irreversible form of compression, for example, based on the Discrete Cosine Transform (DCT). For example, a device performing JPEG compression on an image can convert the image to an optimal color space (e.g., the YCbCr color space, which includes luminance (Y), chrominance blue (Cb), and chrominance red (Cr)), downsample the chrominance component by averaging groups of pixels together, and apply the DCT function to blocks of pixels to remove redundant image data and thus compress the image data. Compression is based on identifying similar regions within an image and converting those regions to the same color code (based on the DCT function). Video content can also be compressed using a priori defined compression algorithms, such as the Motion Picture Expert Group (MPEG) algorithm, H.264, or High Efficiency Video Coding algorithms.
[0062] 【0083】 These a priori-defined compression algorithms may be capable of preserving most of the information in raw image and video content and can be defined a priori based on signal processing and information theory concepts. However, while these predefined compression algorithms may generally be applicable (for example, to any type of image / video content), they may not take into account the similarity of the content, new resolutions or frame rates for video capture and distribution, unnatural images (for example, radar images, or other images captured through various sensors), etc.
[0063] 【0084】A compression algorithm defined a priori is considered an irreversible compression algorithm. In irreversible compression of an input image (or video frame), the input image is not coded in such a way that the exact input image is reconstructed, and therefore cannot be decoded / reconstructed. Rather, irreversible compression produces an approximate version of the input image after decoding / reconstruction of the compressed input image. Irreversible compression results in a reduction of the bitrate at the expense of distortion, which produces artifacts present in the reconstructed image. Thus, there is a rate-distortion trade-off in irreversible compression systems. In the case of some compression methods (e.g., JPEG, BPG in particular), distortion-based artifacts can take the form of blocking artifacts or other artifacts. In some cases, neural network-based compression can be used, which can produce high-quality compression of image and video data. In some cases, blur and color shift are examples of artifacts.
[0064] 【0085】 Whenever the bitrate falls below the true entropy of the input data, it may be difficult or impossible to reconstruct the exact input data. However, the fact that there is distortion / loss resulting from data compression / decompression does not mean that the reconstructed image or frame will not have artifacts. In fact, it may be possible to reconstruct a compressed image into another similar, but different, image with high visual quality.
[0065] 【0086】In some cases, compression and decompression can be performed using one or more machine learning (ML) systems. In some examples, such ML-based systems can provide image and / or video compression that produces high-quality visual output. In some examples, such systems can perform compression and decompression of content (e.g., image content, video content, audio content, etc.) using one or more deep neural networks, such as rate-distortion autoencoders (RD-AEs). Deep neural networks can include autoencoders (AEs) that map images to a latent code space (e.g., containing a set of codes z). The latent code space can include a code space used by encoders and decoders, where content is encoded into codes z. The codes (e.g., code z) are sometimes called latent, latent variables, or latent representations. Deep neural networks can include probabilistic models (also called pryors or code models) that can reversibly compress the codes z from the latent code space. The probabilistic model can generate a probability distribution across a set of codes z that can represent the encoded data, based on the input data. In some cases, the probability distribution may be denoted as (P(z)).
[0066] 【0087】In some examples, a deep neural network may include an arithmetic coder that generates a bitstream containing compressed data to be output, based on a probability distribution P(z) and / or a set of codes z. The bitstream containing the compressed data may be stored and / or transmitted to a receiving device. The receiving device may perform the reverse process to decode or decode the bitstream using, for example, an arithmetic decoder, a probability (or code) model, and an AE decoder. The device that generated the bitstream containing the compressed data may also perform a similar decoding / decoding process when retrieving the compressed data from storage. Similar techniques may be employed to compress / encode and decode updated model parameters.
[0067] 【0088】 In some cases, an RD-AE can be trained and operated to function as a multi-rate AE (including high-rate and low-rate operations). For example, the latent code space generated by the encoder of a multi-rate AE may be divided into two or more chunks (e.g., code z is divided into chunk z1 and chunk z2). In high-rate operation, the multi-rate AE can send out a bitstream based on the entire latent space (e.g., code z including z1, z2, etc.) which can be used by the receiving device to unpack the data, similar to the operation described above for an RD-AE. In low-rate operation, the bitstream sent to the receiving device is based on a subset of the latent space (e.g., z1 instead of chunk z2). The receiving device can infer the rest of the latent space based on the sent subset and use the subset of the latent space and the inferred rest of the latent space to generate reconstructed data.
[0068] 【0089】By compressing (and decompressing) content using RD-AE or multirate AE, encoding and decoding mechanisms can be adapted to a variety of use cases. Machine learning-based compression techniques can produce compressed content with high quality and / or reduced bitrate. In some examples, an RD-AE can be trained to minimize average rate distortion loss across a dataset of data points, such as image and / or video data points. In some cases, an RD-AE can also be fine-tuned for specific data points that are sent to a receiver and decoded by the receiver. In some examples, by fine-tuning the RD-AE on data points, the RD-AE can achieve high compression (rate / distortion) performance. The encoder associated with the RD-AE can send an AE model or a portion of an AE model to a receiver (e.g., a decoder) for decoding the bitstream.
[0069] 【0090】 In some cases, a neural network compression system can reconstruct an input instance (e.g., an input image, video, audio, etc.) from a (quantized) latent representation. The neural network compression system can also use pliers to reversibly compress the latent representation. In some cases, the neural network compression system can be determined to have a test-time data distribution that is known and relatively low entropy (e.g., a camera viewing a static scene, a dashcam in an autonomous vehicle, etc.) and can be fine-tuned or adapted to such a distribution. Fine-tuning or adaptation can lead to improved rate / distortion (RD) performance. In some examples, the model of the neural network compression system can be adapted to a single input instance to be compressed. The neural network compression system can provide model updates that, along with the latent representation, can be quantized and compressed using parameter-space pliers in some examples.
[0070] 【0091】Fine-tuning can take into account the effects of model quantization and the additional costs incurred by sending out model updates. In some examples, neural network compression systems may be fine-tuned using an additional model rate term M that measures the RD loss as well as the number of bits required to send out a model update under model pliers, resulting in a composite RDM loss.
[0071] 【0092】 Figure 4 shows a system 400 including a transmitting device 410 and a receiving device 420, according to some examples of the present disclosure. The transmitting device 410 and the receiving device 420 may each be referred to as RD-AE in some cases. The transmitting device 410 can compress image content, store the compressed image content, and / or transmit the compressed image content to the receiving device 420 for decompression. The receiving device 420 can decompress the compressed image content, output the decompressed image content on the receiving device 420 (for example, for viewing, editing, etc.), and / or output the decompressed image content to another device connected to the receiving device 420 (for example, a television, mobile device, or other device). In some cases, the receiving device 420 can become a transmitting device by compressing image content (using an encoder 422), storing the compressed image content, and / or transmitting it to another device such as the transmitting device 410 (in which case the transmitting device 410 becomes the receiving device). While System 400 is described herein in relation to image compression and decompression, those skilled in the art will understand that System 400 can also be used to compress and decompress video content using the techniques described herein.
[0072] 【0093】As shown in Figure 4, the transmitting device 410 includes an image compression pipeline, and the receiving device 420 includes an image bitstream decompression pipeline. The image compression pipeline in the transmitting device 410 and the bitstream decompression pipeline in the receiving device 420 generally use one or more artificial neural networks to compress image content and / or decompress the received bitstream into image content, according to aspects of this disclosure. The image compression pipeline in the transmitting device 410 includes an autoencoder 401, a code model 404, and an arithmetic coder 406. In some implementations, the arithmetic coder 406 is optional and may be omitted in some cases. The image decompression pipeline in the receiving device 420 includes an autoencoder 421, a code model 424, and an arithmetic decoder 426. In some implementations, the arithmetic decoder 426 is optional and may be omitted in some cases. The autoencoder 401 and code model 404 of the transmitting device 410 are shown in Figure 4 as a previously trained machine learning system and are therefore configured to perform actions during inference or operation of the trained machine learning system. The autoencoder 421 and code model 424 are also shown as a previously trained machine learning system.
[0073] 【0094】The autoencoder 401 includes an encoder 402 and a decoder 403. The encoder 402 can perform lossy compression on received uncompressed image content by mapping pixels in one or more images of the uncompressed image content to a latent code space (containing code z). Generally, the encoder 402 may be configured such that the code z representing the compressed (or encoded) image is discrete or binary. These codes may be generated based on stochastic perturbation techniques, soft vector quantization, or other techniques that can generate distinct codes. In some embodiments, the autoencoder 401 may map uncompressed images to codes having a compressible (low-entropy) distribution. These codes may be close to a predefined or learned prior distribution in cross-entropy.
[0074] 【0095】In some examples, the autoencoder 401 may be implemented using a convolutional architecture. For example, in some cases, the autoencoder 401 may be configured as a two-dimensional convolutional neural network (CNN) so that the autoencoder 401 learns spatial filters for mapping image content into latent code space. In an example where system 400 is used to code video data, the autoencoder 401 may be configured as a three-dimensional CNN so that the autoencoder 401 learns spatiotemporal filters for mapping video into latent code space. In such a network, the autoencoder 401 may encode video with respect to keyframes (e.g., an initial frame that marks the beginning of a sequence of frames; subsequent frames in the sequence are described as differences in the sequence with respect to the initial frame), warping (or differences) between keyframes and other frames in the video, and residual factors. In other embodiments, the autoencoder 401 may be implemented as a two-dimensional neural network conditioned on the previous frame, the residual factor between frames, and conditioning by stacking channels or including a recurrent layer.
[0075] 【0096】The encoder 402 of the autoencoder 401 can receive a first image (specified as image x in Figure 4) as input and can map the first image x to a code z in the latent code space. As described above, the encoder 402 can be implemented as a two-dimensional convolutional network such that the latent code space has a vector at each (x,y) position that describes a block of image x centered at that position. The x coordinate can represent a horizontal pixel location in a block of image x, and the y coordinate can represent a vertical pixel location in a block of image x. When coding video data, the latent code space can have a variable t or position, where the t variable represents a timestamp in a block of video data (in addition to the spatial x and y coordinates). By using the two dimensions of horizontal and vertical pixel position, the vector can describe an image patch in image x.
[0076] 【0097】 Next, the decoder 403 of the autoencoder 401 unpacks the code z and reconstructs the first image x.
[0077] 【number】
[0078] It can be obtained. Generally, reconstruction
[0079] 【number】
[0080] The reconstructed image can be an approximation of the uncompressed first image x, and does not need to be an exact copy of the first image x. In some cases, the reconstructed image
[0081] 【number】
[0082] This can be output as a compressed image file for storage on the transmitting device.
[0083] 【0098】 The code model 404 receives a code z representing an encoded image or a portion thereof and generates a probability distribution P(z) over a set of compressed codewords that can be used to represent the code z. In some examples, the code model 404 may include a probabilistic autoregressive generative model. In some cases, the codes from which the probability distribution can be generated include a learned distribution that controls bit assignments based on the arithmetic coder 406. For example, using the arithmetic coder 406, a compressed code for a first code z may be predicted independently, a compressed code for a second code z may be predicted based on the compressed code for the first code z, a compressed code for a third code z may be predicted based on the compressed code for the first code z and the compressed code for the second code z, and so on. The compressed codes generally represent different spatiotemporal chunks of a given image to be compressed.
[0084] 【0099】 In some embodiments, z can be represented as a three-dimensional tensor. The three dimensions of the tensor are (for example, code z) c,w,h The feature channel dimension (shown as) may include height and width spatial dimensions. Each code z (representing a code indexed by the channel and horizontal and vertical positions) c,w,h The code can be predicted based on the previous code, which can be fixed and theoretically arbitrary in its order. In some examples, the code can be generated by analyzing a given image file from start to finish and by analyzing each block in the image in raster scan order.
[0085] 【0100】 Code model 404 can learn a probability distribution for input code z using a stochastic autoregressive model. The probability distribution may be conditional on its previous value (as described above). In some examples, the probability distribution may be expressed by the following formula:
[0086] 【number】
[0087] Here, c is the channel index for all image channels C (e.g., R, G, and B channels, Y, Cb, and Cr channels, or other channels), w is the width index for the total image frame width W, and h is the height index for the total image frame height H.
[0088] 【0101】 In some cases, a probability distribution P(z) can be predicted by a fully convolutional neural network of causal convolution. In some embodiments, the kernel of each layer of the convolutional neural network takes the previous value z when the convolutional network calculates the probability distribution. 0:c,0:w,0:h It can be masked so that one value is noticed, while other values are not. In some embodiments, the final layer of a convolutional network may include a softmax function that determines the probability that a code in latent space is applicable across input values (e.g., the likelihood that a given code can be used to compress a given input).
[0089] 【0102】 The arithmetic coder 406 uses the probability distribution P(z) generated by the code model 404 to generate a bitstream 415 (shown as "0010011..." in Figure 4) corresponding to the prediction of code z. The prediction of code z can be represented as the code with the highest probability score in the probability distribution P(z) generated over the set of possible codes. In some embodiments, the arithmetic coder 406 can output a bitstream of variable length based on the accuracy of the prediction of code z and the actual code z generated by the autoencoder 401. For example, bitstream 415 may correspond to short codewords if the prediction is accurate, but bitstream 415 may correspond to longer codewords as the magnitude of the difference between code z and the prediction of code z increases.
[0090] 【0103】In some cases, bitstream 415 may be output by arithmetic coder 406 for storage in a compressed image file. Bitstream 415 may also be output for transmission to a requesting device (for example, receiving device 420 as shown in Figure 4). Generally, bitstream 415 output by arithmetic coder 406 may reversibly encode z so that z can be accurately restored during a decompression process applied to a compressed image file.
[0091] 【0104】 The bitstream 415 generated by the arithmetic coder 406 and transmitted from the transmitting device 410 may be received by the receiving device 420. Transmission between the transmitting device 410 and the receiving device 420 may be performed using any of a variety of suitable wired or wireless communication techniques. Communication between the transmitting device 410 and the receiving device 420 may be direct or may be carried out through one or more network infrastructure components (e.g., base stations, relay stations, mobile stations, network hubs, routers, and / or other network infrastructure components).
[0092] 【0105】 As shown in the figure, the receiving device 420 may include an arithmetic decoder 426, a code model 424, and an autoencoder 421. The autoencoder 421 includes an encoder 422 and a decoder 423. The decoder 423 can produce the same or similar output as the decoder 403 for a given input. Although the autoencoder 421 is shown as including an encoder 422, the encoder 422 is derived from the code z received from the transmitting device 410.
[0093] 【number】
[0094] It does not need to be used during the decoding process to obtain an approximation of the original image x compressed in the transmitting device 410 (for example).
[0095] 【0106】 The received bitstream 415 may be input to an arithmetic decoder 426 to obtain one or more codes z from the bitstream. The arithmetic decoder 426 may extract the expanded codes z based on a probability distribution P(z) generated by the code model 424 over a set of possible codes and information relating each generated code z to the bitstream. Assuming a received portion of the bitstream and a probability prediction of the next code z, the arithmetic decoder 426 may generate a new code z, just as it was encoded by the arithmetic coder 406 in the transmitting device 410. Using the new code z, the arithmetic decoder 426 may make probability predictions for consecutive codes z, read additional portions of the bitstream, and decode consecutive codes z until the entire received bitstream is decoded. The expanded codes z may be provided to a decoder 423 in the autoencoder 421. The decoder 423 expands the codes z and approximates the image content x (sometimes called the reconstructed or decoded image).
[0096] 【number】
[0097] Outputs the content x approximation. In some cases, it approximates the content x.
[0098] 【number】
[0099] It can be stored for later retrieval. In some cases, the content x is an approximation.
[0100] 【number】
[0101] This can be restored by the receiving device 420 and displayed on a screen that is communicatively coupled to or integrated with the receiving device 420.
[0102] 【0107】 As described above, the autoencoder 401 and code model 404 of the transmitting device 410 are shown in Figure 4 as a previously trained machine learning system. In some embodiments, the autoencoder 401 and code model 404 can be trained together using image data. For example, the encoder 402 of the autoencoder 401 can receive a first training image n as input and map the first training image n to a code z in latent code space. The code model 404 can learn a probability distribution P(z) for code z using a stochastic autoregressive model (similar to the techniques described above). The arithmetic coder 406 can use the probability distribution P(z) generated by the code model 404 to generate an image bitstream. Using the bitstream and probability distribution P(z) from the code model 404, the arithmetic coder 406 can generate code z and output code z to the decoder 403 of the autoencoder 401. The decoder 403 then decodes code z to reconstruct the first training image n.
[0103] 【number】
[0104] It is possible to obtain (here, rebuild
[0105] 【number】
[0106] (This is an approximation of the first uncompressed training image n).
[0107] 【0108】In some cases, a backpropagation engine used during training of the transmitting device 410 may perform a backpropagation process to tune the parameters (e.g., weights, biases, etc.) of the neural network of the autoencoder 401 and the code model 404 based on one or more loss functions. In some cases, the backpropagation process may be based on stochastic gradient descent techniques. Backpropagation may include a forward pass, one or more loss functions, a backward pass, and weight (and / or (one or more) other parameter) updates. The forward pass, loss function, backward pass, and parameter updates may be performed for one training iteration. The process may be repeated for a certain number of iterations for each set of training data until the weights and / or other parameters of the neural network are precisely tuned.
[0108] 【0109】 For example, the autoencoder 401 has n and
[0109] 【number】 【0110】 By comparing the first training image n with the reconstructed first training image 【0111】 【number】 【0112】 The loss between (for example, represented by a distance vector or other difference values) can be determined. The loss function can be used to analyze the error in the output. In some examples, the loss may be based on maximum likelihood. Using an uncompressed image n as input, and a reconstructed image 【0113】 【number】 【0114】 In one exemplary example using as the output, the loss function Loss = D + β * R could be used to train the neural network system of the autoencoder 401 and the code model 404, where R is the rate, D is the strain, * indicates the multiplication function, and β is a trade-off parameter set to a value that defines the bitrate. In another example, the loss function 【0115】 【number】 【0116】 This can be used to train the neural network system of the autoencoder 401 and the code model 404. In some cases, other loss functions may be used, such as when other training data is used. An example of another loss function is: 【0117】 【number】 【0118】 This includes the Mean Squared Error (MSE), which is defined as follows: The MSE is calculated as half the sum of the actual response minus the squared predicted (output) response. 【0119】
[0110] Based on the determined loss (e.g., a distance vector or other difference value), and using a backpropagation process, the parameters of the neural network system of the autoencoder 401 and the code model 404 (e.g., weights, biases, etc.) can be adjusted to reduce the loss between the input uncompressed image and the compressed image content produced as output by the autoencoder 401 (effectively adjusting the mapping between the received image content and the latent code space). 【0120】
[0111] Since the actual output value (reconstructed image) can differ significantly from the input image, the loss (or error) can be high for the first training image. The goal of training is to minimize the amount of loss for the predicted output. The neural network can perform a backward pass by determining which node of the neural network (with corresponding weights) contributed the most to the loss of the neural network, and can adjust the weights (and / or other parameters) so that the loss decreases and is eventually minimized. To determine the weight that contributed the most to the loss of the neural network, the derivative of the loss with respect to the weights (expressed as dL / dW, where W is the weight in a particular layer) can be calculated. For example, the weights can be updated so that they change in the opposite direction of the gradient. Weight updates are, 【0121】 【number】 【0122】 It can be shown as, where w represents the weight, and w i η represents the initial weights, and η represents the learning rate. The learning rate can be set to any suitable value; a higher learning rate indicates larger weight updates, and a lower value indicates smaller weight updates. 【0123】
[0112] The neural network system of the autoencoder 401 and the code model 404 can continue to be trained in that manner until a desired output is achieved. For example, the autoencoder 401 and the code model 404 can train the reconstructed image resulting from the unfolding of the input image n and the generated code z. 【0124】 【number】 【0125】 The backpropagation process can be repeated to minimize or reduce the difference between the two. 【0126】
[0113] The autoencoder 421 and the code model 424 may be trained using techniques similar to those described above for training the autoencoder 401 and the code model 404 of the transmitting device 410. In some cases, the autoencoder 421 and the code model 424 may be trained using the same or different training datasets used to train the autoencoder 401 and the code model 404 of the transmitting device 410. 【0127】
[0114] In the example shown in Figure 4, the rate-distortion autoencoder (transmitting device 410 and receiving device 420) is trained and operated in inference according to the bitrate. In some implementations, the rate-distortion autoencoder may be trained at multiple bitrates to enable the generation and output of high-quality reconstructed image or video frames (with or without limited artifacts due to distortion of the input image, for example) when a fluctuating amount of information is provided in the latent code z. 【0128】
[0115] In some implementations, the latent code z can be divided into at least two chunks z1 and z2. When the RD-AE model is used in a high-rate setting, both chunks are sent to the device for decoding. When the rate-distortion autoencoder model is used in a low-rate setting, only chunk z1 is sent, and chunk z2 is inferred from z1 on the decoder side. The inference of z2 from z1 can be performed using various techniques, as will be described in more detail below. 【0129】
[0116] In some implementations, a set of continuous latens (which can transmit a large amount of information, for example) and corresponding quantized discrete latens (which contain less information, for example) may be used. An auxiliary inverse quantization model may be trained after the RD-AE model has been trained. In some cases, when using RD-AE, only discrete latens are transmitted, and the auxiliary inverse quantization model is used on the decoder side to infer continuous latens from discrete latens. 【0130】
[0117] Although System 400 is shown to include several components, those skilled in the art will understand that System 400 may include more or fewer components than those shown in Figure 4. For example, the transmitting device 410 and / or receiving device 420 of System 400 may also, in some cases, include one or more memory devices (e.g., RAM, ROM, cache, etc.), one or more networking interfaces (e.g., wired and / or wireless communication interfaces, etc.), one or more display devices, and / or other hardware or processing devices not shown in Figure 4. The components shown in Figure 4 and / or other components of System 400 may be implemented using one or more computing or processing components. One or more computing components may include a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), and / or an image signal processor (ISP). Exemplary examples of computing devices and hardware components that may be implemented with System 1600 are described below with respect to Figure 16. 【0131】
[0118] System 400 may be part of or implemented by a single computing device or multiple computing devices. In some examples, the transmitting device 410 may be part of a first device and the receiving device 420 may be part of a second computing device. In some examples, the transmitting device 410 and / or the receiving device 420 may be included as part of (one or more) electronic devices, such as a telephone system (e.g., a smartphone, cellular phone, conference system, etc.), a desktop computer, a laptop or notebook computer, a tablet computer, a set-top box, a smart television, a display device, a gaming console, a video streaming device, a SOC, an IoT (Internet of Things) device, a smart wearable device (e.g., a head-mounted display (HMD), smart glasses, etc.), a camera system (e.g., a digital camera, an IP camera, a video camera, a security camera, etc.), or (one or more) any other suitable electronic devices. In some cases, System 400 may be implemented by the image processing system 100 shown in Figure 1. In other cases, System 400 may be implemented by one or more other systems or devices. 【0132】
[0119] Figure 5A shows an exemplary neural network compression system 500. In some examples, the neural network compression system 500 may include an RD-AE system. In Figure 5A, the neural network compression system 500 includes an encoder 502, an arithmetic encoder 508, an arithmetic decoder 512, and a decoder 514. In some cases, the encoder 502 and / or decoder 514 may be the same as the encoder 402 and / or decoder 403, respectively. In other cases, the encoder 502 and / or decoder 514 may be different from the encoder 402 and / or decoder 403, respectively. 【0133】
[0120] Encoder 502 controls image 501 (image xi ) can be received as an input, and the image 501 (image x i ) can be mapped and / or converted to the latent code 504 (latent z i ) in the latent code space. The image 501 can represent a still image and / or a video frame associated with a sequence of frames (e.g., video). In some cases, the encoder 502 can perform a forward pass to generate the latent code 504. In some examples, the encoder 502 can implement a learnable function. In some cases, the encoder 502 can 【0134】 [Number] 【0135】 implement a learnable function parameterized by. For example, the encoder 502 can implement the function 【0136】 [Number] 【0137】 . In some examples, the learnable function need not be shared with or known to the decoder 514. 【0138】
[0121] The arithmetic encoder 508 can generate the bitstream 510 based on the latent code 504 (latent z i ) and the latent prior 506. In some examples, the latent prior 506 can implement a learnable function. In some cases, the latent prior 506 can implement a learnable function parameterized by ψ. For example, the latent prior 506 can be the function p ψ(z) can be implemented. The latent plier 506 uses lossless compression to implement latent code 504 (latent z i ) may be used to convert to bitstream 510. The latent plier 506 may be shared and / or made available on both the sender side (e.g., encoder 502 and / or arithmetic encoder 508) and the receiver side (e.g., arithmetic decoder 512 and / or decoder 514). 【0139】
[0122] The arithmetic decoder 512 receives the encoded bitstream 510 from the arithmetic encoder 508 and uses the latent pair 506 to convert the latent code 504 (latent z) in the encoded bitstream 510. i Decoder 514 can decode latent code 504 (latent z i ) Approximate reconstructed image 516 (reconstruction) 【0140】 【number】 【0141】 It can be decoded into ). In some cases, the decoder 514 can implement a learnable function parameterized by θ. For example, the decoder 514 can implement the function p θ (x|z) can be implemented. The learnable function implemented by decoder 514 can be shared and / or made available on both the transmitter side (e.g., encoder 502 and / or arithmetic encoder 508) and the receiver side (e.g., arithmetic decoder 512 and / or decoder 514). 【0142】
[0123] The neural network compression system 500 can be trained to minimize rate distortion. In some examples, the rate reflects the length of bitstream 510 (bitstream b), and the distortion is the length of image 501 (image x). i) and reconstructed image 516 (reconstruction) 【0143】 【number】 【0144】 It reflects the strain between ) and ). The parameter β can be used to train a model for a particular rate-strain ratio. In some examples, the parameter β can be used to define and / or implement a trade-off between rate and strain. 【0145】
[0124] In some cases, the loss may be shown as follows: 【0146】 【number】 【0147】 Here, function E is the expected value. The strain (x│z;θ) can be determined based on a loss function, such as the mean squared error (MSE). In some examples, the term -logp θ (x│z) represents and / or expresses the strain D(x|z;θ). 【0148】
[0125] The rate for sending out latents is R z It can be shown as (z;ψ). In some examples, the term logp ψ (z) is the rate R z (z;ψ) can be shown and / or represented. In some cases, the loss can be minimized over the full dataset D, as follows: 【0149】 【number】 【0150】
[0126] Figure 5B shows an exemplary neural network compression system 530 for implementing the inference process. As shown in the figure, the encoder 502 can convert the image 501 into latent code 504. In some examples, the image 501 can represent a still image and / or a video frame associated with a sequence of frames (e.g., video). 【0151】
[0127] In some examples, the encoder 502 can encode the image 501 using a single forward pass. 【0152】 【number】 【0153】 The arithmetic encoder 508 then outputs the latent code 504 (latent z) under the latent plier 506. i Perform arithmetic coding of ) bitstream 520 【0154】 【number】 【0155】 It is possible to generate the following. In some examples, the arithmetic encoder 508 can generate the bitstream 520 as follows: 【0156】 【number】 【0157】
[0128] The arithmetic decoder 512 receives the bitstream 520 from the arithmetic encoder 508 and converts the latent code 504 (latent z) under the latent pair 506. i Arithmetic decoding can be performed on the bitstream 520. In some examples, the arithmetic decoder 512 can decode the latent code 504 from the bitstream 520 as follows: 【0158】 【number】 【0159】 Decoder 514 is latent code 504 (latent z i ) decodes and reconstructs image 516 (reconstruction 【0160】 【number】 【0161】 ) can be generated. In some examples, decoder 514 can use a single forward pass to generate latent code 504 (latent z) as follows. i ) can be decrypted. 【0162】 【number】 【0163】
[0129] In some cases, the RD-AE system may be trained using a set of training data and further fine-tuned for data points (e.g., image data, video data, audio data) that are sent to a receiver (e.g., a decoder) and decoded by it. For example, during inference time, the RD-AE system may be fine-tuned with respect to the image data being sent to the receiver. Since the compressed model is generally large, sending the parameters associated with the model to the receiver can be extremely costly in terms of resources, such as network (e.g., bandwidth), storage, and computational resources. In some cases, the RD-AE system may be fine-tuned with respect to a single data point that is compressed and sent to the receiver for decompression. This can limit the amount of information (and associated costs) sent to the receiver while maintaining and / or increasing compression / decompression efficiency, performance, and / or quality. 【0164】
[0130] Figure 6 shows an exemplary inference process implemented by an exemplary neural network compression system 600, which is fine-tuned using a model plier. In some examples, the neural network compression system 600 may include an RD-AE system that is fine-tuned using an RDM-AE model plier. In some cases, the neural network compression system 600 may include an AE model that is fine-tuned using a model plier. 【0165】
[0131] In this exemplary example, the neural network compression system 600 includes an encoder 602, an arithmetic encoder 608, an arithmetic decoder 612, a decoder 614, a model plier 616, and a latent plier 606. In some cases, encoder 602 may be the same as or different from encoder 402 or encoder 502, and decoder 614 may be the same as or different from decoder 403 or decoder 514. Arithmetic encoder 608 may be the same as or different from arithmetic decoder 406 or arithmetic encoder 508, and arithmetic decoder 612 may be the same as or different from arithmetic decoder 426 or arithmetic decoder 512. 【0166】
[0132] The neural network compression system 600 generates latent code 604 (latent z) for image 601. i The neural network compression system 600 can generate image 601 (image x) using latent code 604 and latent plier 606. i ) encodes and reconstructs image 620 (reconstruction 【0167】 【number】 【0168】 A bitstream 610 can be generated which can be used by the receiver to produce the image 601. In some examples, the image 601 may represent a still image and / or a video frame associated with a sequence of frames (e.g., a video). 【0169】
[0133] In some cases, the neural network compression system 600 can be fine-tuned using RDM-AE loss. The neural network compression system 600 can be trained by minimizing the rate-distortion-model rate (RDM) loss. In some cases, on the encoder side, the AE model uses RDM loss to image 601 (image x) as follows. i ) can be fine-tuned. 【0170】 【number】 【0171】
[0134] The finely adjusted encoder 602 is shown in image 601 (image x i The latent code 604 can be generated by encoding the image 601 (image x) using a single forward pass as follows. In some cases, the finely tuned encoder 602 can encode the image 601 (image x) using a single forward pass as follows. i ) can be encoded. 【0172】 【number】 【0173】 The arithmetic encoder 608 can use the latent plier 606 to convert the latent code 604 into a bitstream 610 for the arithmetic decoder 612. Under the model plier 616, the arithmetic encoder 608 can entropy code the parameters of the finely tuned decoder 614 and the finely tuned latent plier 606 to generate a bitstream 611 containing the compressed parameters of the finely tuned decoder 614 and the finely tuned latent plier 606. In some examples, the bitstream 611 may contain updated parameters of the finely tuned decoder 614 and the finely tuned latent plier 606. The updated parameters may include parameter updates for the baseline decoder and latent plier, such as the decoder 614 and latent plier 606 before the fine-tuning. 【0174】
[0135] In some cases, the finely tuned latent plier 606 may be entropy coded under the model plier 616 as follows: 【0175】 【number】 【0176】 The finely tuned decoder 614 can be entropy coded under model pair 616 as follows: 【0177】 【number】 【0178】 Latent code 604 (latent z i ) can be entropy coded under the finely tuned latent plier 606 as follows: 【0179】 【number】 【0180】 In some cases, on the decoder side, the finely tuned latent plier 606 may be entropy coded under the model plier 616 as follows: 【0181】 【number】 【0182】 The finely tuned decoder 614 can be entropy coded under model pair 616 as follows: 【0183】 【number】 【0184】 Latent code 604 (latent z i ) can be entropy coded under the finely tuned latent plier 606 as follows: 【0185】 【number】 【0186】
[0136] Decoder 614 is latent code 604 (latent z i ) Approximate reconstructed image 620 (reconstruction) 【0187】 【number】 【0188】 ) can be decoded. In some examples, decoder 614 can decode latent code 604 using a single forward pass of a finely tuned decoder, as follows: 【0189】 【number】 【0190】 【0137】 As described previously, the neural network compression system 600 can be trained by minimizing the RDM loss. In some cases, the rate can reflect the length of the bitstream b (e.g., bitstreams 610 and / or 611), and the distortion can reflect the distortion between the input image 601 (image x i ) and the reconstructed image 620 (reconstruction 【0191】 【Number】 【0192】 and can reflect the length of the bitstream used to send (e.g., to decoder 614) and / or required for model updates (e.g., updated parameters) to the receiver. The parameter β can be used to train the model for a specific rate-distortion ratio. 【0193】 【0138】 In some examples, the loss for the data point x can be minimized at inference time as follows. 【0194】 【Number】 【0195】 In some examples, the RDM loss can be shown as follows. 【0196】 【Number】 【0197】 In some cases, the distortion D(x│z;θ) can be determined based on a loss function, such as, for example, the mean squared error (MSE). 【0198】
[0139] Term-logp θ (x│z) represents and / or expresses the strain D(x|z;θ). Term β logp ψ (z) is latent R z (z;ψ) can indicate and / or represent the rate for sending out the term β logp ω (ψ,θ) is the finely tuned model update R ψ,θ This can indicate and / or represent the rate for sending out (ψ,θ;ω). 【0199】
[0140] In some cases, the model plier 616 may reflect the length of the bitrate overhead for sending out model updates. In some examples, the bitrate for sending out model updates may be described as follows: 【0200】 【number】 【0201】 In some cases, the model plier is designed so that it is inexpensive to send out the model without updating it, i.e., the bit length (model rate loss) is small: 【0202】 【number】 【0203】 It can be chosen in this way. 【0204】
[0141] In some cases, using the RDM loss function, the neural network compression system 600 will only update the bitstream for model updates if the latent rate or distortion decreases by at least the same number of bits. 【0205】 【number】 【0206】 Bits may be added to the bitstream. This can boost rate-distortion (R / D) performance. For example, the neural network compression system 600 may increase the number of bits in the bitstream 611 for sending out model updates if it can also reduce the rate or distortion by at least the same number of bits. In other cases, the neural network compression system 600 may increase the bitstream for model updates even if the rate or distortion is not reduced by at least the same number of bits. 【0207】 【number】 【0208】 Bits can be added to it. 【0209】
[0142] The neural network compression system 600 can be trained end-to-end. In some cases, RDM loss can be minimized in inference time end-to-end. In some examples, a certain amount of compute can be spent once (e.g., fine-tuning the model), and a high compression ratio can be obtained thereafter without any extra cost to the receiver. For example, a content provider may spend a high amount of compute to train and fine-tune the neural network compression system 600 over a wider range for videos to be delivered to a large number of receivers. A highly trained and fine-tuned neural network compression system 600 can provide high compression performance for its videos. Having spent a high amount of compute, the video provider can store the updated parameters of the model pliers and efficiently deliver the compressed video to each receiver for decompression. The video provider can achieve significant gains in compression on each transmission of video (as well as reductions in network and computing resources), which may significantly outweigh the initial compute costs of training and fine-tuning the model. 【0210】 【0143】 For a large number of pixels in video and images (e.g., high-resolution images), the training / learning and fine-tuning methods described above can be extremely beneficial for video compression and / or high-resolution images. In some cases, complexity and / or decoder calculations may be used as additional considerations for overall system design and / or implementation. For example, a very small network that makes inferences quickly can be fine-tuned. As another example, a cost term may be added due to receiver complexity, which can force and / or allow the model to remove one or more layers. In some examples, machine learning can be used to learn a more complex model prior to achieve a greater gain. 【0211】 【0144】 Model prior design can include various attributes. In some examples, the implemented model prior has a high probability of sending the model without updates 【0212】 【Number】 【0213】 and thus a low bit rate 【0214】 【Number】 【0215】 and can include a model prior that assigns. In some cases, the model prior 【0216】 【Number】 【0217】 , 【0218】 【Number】 【0219】 The model plier can include one that assigns non-zero probabilities to values around the θ, and thus, in practice, different instances of the finely tuned model can be encoded. In some cases, the model plier can include one that is quantized at inference time and can be used to perform entropy coding. 【0220】
[0145] Despite accelerated research and development, deep learning-based compression codecs (called "codecs") have not yet been introduced in commercial or consumer applications. One reason for this is that neural codecs are not yet more robust than legacy codecs in terms of rate distortion. Furthermore, existing neural-based codecs present additional implementation challenges. For example, a neural-based codec requires a trained neural network for every receiver. Therefore, every user across different platforms must store an equivalent copy of such a neural network in order to implement the decoding function. The storage of such a neural network consumes a considerable amount of memory, is difficult to maintain, and is susceptible to corruption. 【0221】
[0146] As described above, systems and techniques including implicit neural compression codecs that can address the problems described above are described herein. For example, aspects of the present disclosure include video compression codecs based on implicit neural representations (INRs), sometimes referred to as implicit neural models. As described herein, an implicit neural model can take coordinate positions (e.g., coordinates in an image or video frame) as input and output pixel values (e.g., color values for an image or video frame, such as red-green-blue (RGB) values for each coordinate position or pixel). In some cases, the implicit neural model may also be based on an IPB frame scheme. In some examples, the implicit neural model can modify the input data to model an optical flow, which is called an implicit neural optical flow (INOF). 【0222】
[0147] For example, an implicit neural model can model optical flow using an implicit neural representation in which local transformations may be element-wise additions. In some cases, optical flow can correspond to local transformations (e.g., pixel movement as a function of position). In some embodiments, optical flow can be modeled across video frames to improve compression performance. In some cases, an implicit model can model optical flow by adjusting the input coordinate positions to produce corresponding output pixel values. For example, element-wise addition of inputs may lead to local transformations in the output, which can eliminate the need for pixel movement and the associated computational complexity. In one exemplary example, transitions from a first frame with three pixels (e.g., P1|P2|P3) and a second frame with three pixels (e.g., P0|P1|P2) can be modeled by an implicit neural model by modifying the inputs (e.g., without the need to shift the pixel positions), such as by performing element-wise subtraction or addition. The following figure illustrates this example. 【0223】
[0148] 1|2|3→P1|P2|P3 【0149】 0|1|2→P0|P1|P2
[0150] As described above, an implicit neural model can take the coordinate positions of an image or video frame as input and can output pixel values for the image or video frame. In this case, the inputs (1|2|3 and 0|1|2) represent the input to the implicit neural model and contain coordinates in the image. The outputs (P1|P2|P3) and (P0|P1|P2) represent the output of the implicit neural model and can contain RGB values. Each of the two lines above (1|2|3→P1|P2|P3 and 0|1|2→P0|P1|P2) corresponds to the same model where the input changes by a value of "1", resulting in a corresponding shift in the output. In traditional optical flow, the machine learning model itself must shift the pixel positions from one frame to the next. Since implicit machine learning models take coordinates as input, the input may be preprocessed (before being processed by the codec) by subtracting 1 from every input value, in which case the output is shifted and thus effectively models optical flow. In some cases (for example, when an object in a frame moves in a particular direction), element-wise addition may be performed, where a value (for example, a value of 1) is added to the input value. 【0224】
[0151] In some examples, residuals can be modeled across frames along with weight updates of an implicit neural model. In some cases, this technique can be used to reduce the bitrate required to compress interpredictive frames (e.g., unidirectional frames (P frames) and / or bidirectional frames (B frames)). In some examples, a convolutional-based architecture can be used to process intraframes (e.g., intraframes or I frames). A convolutional-based architecture can be used to eliminate the decoding computation bottleneck of the implicit model, resulting in a model that is fast to encode and decode. In some embodiments, converting data to a bitstream can be done using post-training quantization for I frames and quantization-aware training for P frames and B frames. 【0225】
[0152] In some cases, the model may be quantized and / or encoded to form a complete neural compression codec. In some examples, the model may be sent to the receiver. In some cases, the model may be fine-tuned on the P-frames and B-frames, and the converged updates may be sent to the receiver. In some embodiments, the model may be fine-tuned using sparsity-inducing pliers, and / or quantization-aware procedures that can minimize the bitrate for the P-frames and B-frames. Compared to existing neural compression codecs, implicit neural model-based neural compression codecs eliminate the requirement for a pre-trained network on the receiver side (and, in some cases, on the transmitter side). The performance of this technique is superior to that of previous INR-based neural codecs, with improved performance compared to older neural-based codecs for both image and video datasets. 【0226】
[0153] In some embodiments, implicit neural representation (INR) methods / models may be used for video and image compression. A video or image may be represented as a function which can be implemented as a neural network. In some examples, encoding an image or video may involve selecting an architecture and overfitting network weights to a single image or video. In some examples, decoding may involve a neural network forward pass. One challenge with implicit neural models used for compression is the computational efficiency of decoding. Most existing implicit neural models require one forward pass for each pixel in the input data. In some embodiments, the technique includes a convolutional architecture as a generalization of implicit neural representation models which can reduce the computational overhead associated with decoding high-resolution video or images and thus reduce decoding time and memory requirements. 【0227】
[0154] In some examples, the bitrate may be determined by the size of the stored model weights. In some cases, the model size may be reduced to improve the bitrate in order to improve the performance of the implicit neural methods disclosed herein. In some configurations, reducing the model size may be done by quantizing the weights and fitting a weight plier which can be used to reversibly compress the quantized network weights. 【0228】
[0155] In some cases, the technique can match the compression performance of state-of-the-art neural image and video codecs. One exemplary advantage of the codecs disclosed herein is that it can be implemented using a lightweight framework, eliminating the need for the receiver to store the neural network. Another advantage (compared to, for example, a neural codec such as scale-space flow (SSF)) is the absence of flow behavior, which can be difficult to implement in hardware. Furthermore, the decoding function can be faster than that in standard neural codecs. In addition, the technique does not require a separate training dataset, as it can be implicitly trained using the data to be encoded (e.g., the current instance of an image, video frame, video, etc.). The construction of the implicit neural model described herein can help avoid potential privacy issues and works well for data from different domains, including those for which suitable training data is not available. 【0229】
[0156] In one example relating to neural compression code, neural video compression can be implemented using a variational or compression autoencoder framework. Such a model is configured to optimize rate-distortion (RD) loss as follows: 【0230】 【number】 【0231】
[0157] In this example, encoder q φThe first method maps each instance x to a latent z, and the decoder p reconstructs the result. Assuming a trained decoder is available on the receiver side, the transmitted bitstream contains the encoded latent z. Examples of this type of configuration include 3D convolutional architectures and IP frameflow architectures, which condition each P frame on the previous frame. Another example involves instance adaptive tuning, where the model is fine-tuned for each test instance, and the model is transmitted along with the latent. This method may offer advantages over the previous approach (e.g., robustness against domain shifts and reduced model size), but it still requires a pre-trained global decoder to be available at the receiver size. 【0232】
[0158] In another example relating to neural compression codecs, models may be used to compress images through their implicit representation as neural network weights. This configuration implements sinusoidal representation network (SIREN) based models with different numbers of layers and channels, quantizing them to 16-bit precision. The implicit neural codecs described differ from other systems that may use SIREN models for image compression tasks. For example, in some examples, the implicit neural codecs described herein may include convolutional architectures with positional encoding, implement more advanced compression schemes including quantization and entropy coding, and perform video compression. 【0233】
[0159] In one example relating to implicit neural representations, implicit representations were used to learn 3D structures and bright-field views. In some cases, these representations allow a neural network to be trained on a single scene such that a single scene is encoded by network weights. New views of the scene can then be generated through the network's forward pass. In some embodiments, these methods can be more efficient than discrete equivalents because there is high redundancy in discrete representations when object data lies on a low-dimensional manifold in a high-dimensional coordinate frame, where each set of coordinates is associated with a value. In some examples, implicit neural representations can leverage such redundancy and thereby learn more efficient representations. 【0234】
[0160] Implicit representation can be applied to data with lower-dimensional coordinates, such as images and videos, but its relative efficiency compared to discrete or latent representation has not yet been determined. Furthermore, the performance of existing configurations using implicit representation must match or exceed the performance of configurations using discrete representation or established compression codecs. 【0235】
[0161] Regardless of the dimensionality of the input data, it is important to select the correct class of representation. In some examples, Fourier domain features can help implicit neural models learn the structure of realistic scenes. For example, Fourier domain features have been implemented for natural language processing, where Fourier position coding of words in a sentence is shown to enable state-of-the-art language modeling using a full attention architecture. Furthermore, with respect to implicit neural modeling of vision tasks, the configuration can use randomly sampled Fourier frequencies as encoders before passing in the MLP model. Furthermore, some configurations include all MLP activations that can be sinusoidal, given that the weights are carefully initialized, where X intis an integer tensor with b bits, and s is the floating-point scaling factor (or vector). 【0236】
[0162] In some examples, neural network quantization can be used to reduce model size in order to enable the model to run more efficiently on resource-constrained devices. Examples of neural network quantization include vector quantization, in which the quantized tensor can be represented using a codebook and fixed-point quantization, which can represent the tensor with a fixed-point number including an integer tensor and a scaling factor. In fixed-point, the quantization function may be defined as follows: 【0237】 【number】 【0238】
[0163] Here, θ int s is an integer tensor with b bits, and s is the scaling factor (or vector) in floating-point. In some embodiments, the symbol τ=(s,b) can be used to refer to the entire set of quantization parameters. 【0239】
[0164] In some cases, low-bit quantization of weight tensors (e.g., all weight tensors) in a neural network can lead to considerable quantization noise. In quantization-aware training, neural networks can adapt to quantization noise by training them end-to-end using quantization operations. Since the rounding operation in Equation 2 is not differentiable, a straight-through estimator (STE) is usually used to approximate its gradient. In some cases, in addition to learning the scaling factor along with the network, it may also be possible to learn the per-tensor bit width for each layer. In some embodiments, the technique can formulate the quantization bit width as a rate loss, minimizing the RD loss to implicitly learn the best trade-off between bit rate and distortion in pixel space. 【0240】
[0165] Figure 7A shows an exemplary codec based on the implicit neural network compression system 700. In some embodiments, the implicit neural compression system 700 may include a pipeline for training an implicit compression model configured to optimize strain and / or bitrate. In some examples, strain can be minimized by training the weights w706 of the implicit model Ψ(w)704 against a strain target. In some embodiments, the rate is quantized by the quantization function Q τ By quantizing weight 706 using (w), and over the quantized weight 708, the weight pair 【0241】 【number】 【0242】 By adapting to 712, this can be minimized. In some cases, these components can be combined to form a single target that reflects rate distortion loss, as follows: 【0243】 【number】 【0244】
[0166] In some examples, the first step of "encoding" a data point x (corresponding to input image data 702 which may include one or more images) is to find the minimum loss in equation (3) for the data point (e.g., input image data 702). In some cases, the minimum loss can be obtained using a search and / or training algorithm. For example, as shown in Figure 7A, a coordinate grid 703 is input to the implicit neural model 704 in order to train the implicit neural model 704 on the transmitter side. Before training, the weights of the implicit model 704 are initialized to initial values. The initial values of the weights are for processing the coordinate grid 703 and Ψ(Q) in equation (3). τ It is used to generate a reconstructed output value (Ψ(Q)) for the input image data 702, which is represented as (ω). The actual compressed input image data 702 can be used as a known output (or label), which is represented as data point x in equation (3). The reconstructed output value (Ψ(Q)) is then used. τ (ω))) and the known output (data point x, which is input image data 702 in Figure 7A), the loss (L NIC(Ψ,ω,τ,ω)) can be determined. Based on the loss, the weights of the implicit model 704 can be adjusted (for example, based on the backpropagation training technique). Such a process may be carried out a number of iterations until the weights are adjusted so that a certain loss value (for example, a minimized loss value) is obtained. Once the implicit model 704 is trained, the weights w 706 from the implicit model 704 may be output, as shown in Figure 7A. On the receiver side, the coordinate grid 703 may be processed using the implicit model 704 adjusted with the decoded weights w after inverse quantization (or with the quantized weights 708). In some cases, the architecture parameter (Ψ(w)) of the implicit model 704 may be determined based on the decoding of the bitstream 720 by the architecture decoder 726. 【0245】
[0167] In some embodiments, the first step may include determining the best implicit model 704 to use to compress the input image data 702 (from among the group of available implicit models) by exploring across network architectures Ψ(·) and training weights w706 for each model by minimizing the D loss without quantization. In some examples, this process may be used to select an implicit model 704. 【0246】
[0168] In some cases, the quantizer may be implemented to achieve an optimal strain D based on the quantizer hyperparameter τ. In some embodiments, the implicit model Ψ(w)704 may be fine-tuned based on the quantized weights 708. 【0247】
[0169] In some examples, the weight pair 712 may be implemented while fixing the quantizer parameters and implicit model weights (e.g., quantized weights 708 or weights 706). In some embodiments, the weight pair 712 may be used to determine the optimal setting (including weights w706) that minimizes the rate loss R. 【0248】
[0170] In some embodiments, the implicit neural network compression system 700 encodes the weight pry parameter w712 in the bitstream 722 (using the pry encoder 714) and quantizes the weights 【0249】 【number】 【0250】 708 in bitstream 724 weight pair 【0251】 【number】 【0252】 It can be used as an image or video codec, which can be configured to encode using entropy coding under 712 (by an arithmetic encoder (AE) 710). In some examples, decoding can be implemented in reverse. For example, on the receiver / decoder side, an arithmetic decoder (AD) 730 can decode the bitstream 724 by performing entropy decoding using the decoded weight pry (decoded by pry decoder 728) to generate weights (e.g., weights 706 or quantized weights 708). Using the weights and a neural network model architecture (e.g., Ψ(w)), an implicit model 704 can generate output image data 732. In one example, Ψ(·) and 【0253】 【number】 【0254】 Once decrypted, it is reconstructed. 【0255】 【number】 【0256】 However, it can be obtained using a forward pass. 【0257】 【number】 【0258】
[0171] As described above, the implicit model 704 may include one or more neural network architectures that can be selected by training the weights w706 and determining the minimum strain. For example, the implicit model 704 may include a multilayer perceptron (MLP) that takes coordinates in an image as input and returns RGB values (or other color values) as follows: 【0259】 【number】 【0260】
[0172] In some embodiments, the implicit model 704 can implement a SIREN architecture in which a periodic activation function can be used to ensure that fine details in images and videos can be accurately represented. In some examples, decoding an image can involve evaluating the MLP at any pixel location (x,y) of interest. In some cases, since the representation is continuous, the representation can be trained or evaluated at different resolution settings or on any type of pixel grid (e.g., an irregular grid). 【0261】
[0173] In some cases, the implicit model 704 may include a convolutional network that can be used to improve the computational efficiency of the code (e.g., especially on the receiver side). In some cases, an MLP-based implicit neural model may require a forward pass for each input pixel coordinate, which can result in many (e.g., about 2 million) forward passes to decode each frame of a 1K resolution video. 【0262】
[0174] In some embodiments, an MLP-based implicit neural model can be considered a convolution operation using a 1x1 kernel. In some examples, the techniques described herein can be generalized to a convolution architecture for the implicit model. 【0263】
[0175] Unlike MLPs that process one coordinate at a time, this technique can place all coordinates at once using coordinate values in the channel axes. In some embodiments, this technique can use a 3x3 kernel for transposed convolution blocks and a stride value of 2 (for example, indicating that the convolution kernel or filter is moved 2 positions after each convolution operation), which is 2 times the number of forward passes required to reconstruct the image. 2L This can result in a twofold reduction, where L is the number of convolutional layers. 【0264】
[0176] In some examples, random Fourier coding and the SIREN architecture can be generalized in this way. For example, the first layer in a convolutional architecture may include coordinate position coding as follows: 【0265】 【number】 【0266】
[0177] Here, c and i are indices along the channel and spatial dimensions, 【0267】 【number】 【0268】 This is N from a Gaussian distribution. ωThese are frequency samples. The standard deviation and the number of frequencies are hyperparameters. This position coding can be followed by alternating transposed convolution and ReLU activation. 【0269】
[0178] In some embodiments, the convolutional models from this technique can easily process high-resolution images using an arbitrarily low number of forward passes, and thus increase the speed of both encoding and decoding. It is also much more memory efficient at high bitrates. In some examples, training a 3x3 convolutional kernel at ultra-low bitrates can be implemented using different convolutional kernels (e.g., 1x1 and / or 3x3 convolutions in the pipeline). 【0270】
[0179] As described above, the input to the neural network compression system 700 may include image data 702 (for example, to train an implicit model), which may include video data. In some examples, the video data may have strong redundancy between subsequent frames. Existing video codecs often compress picture groups (GoPs) in such a way that each frame depends on the previous frame. In detail, the new frame prediction may be formulated as the sum of the warping and residuals of the previous frame. The technique can implement similar configurations for use with implicit neural compression schemes. In some cases, the implicit model has been shown to accurately represent the warping. In some embodiments, the technique can use temporal redundancy that can be implicitly utilized and share weights across frames. In some embodiments, the fully implicit method (disclosed herein) may have the advantages of conceptual simplicity and architectural freedom. 【0271】
[0180] In some cases, implicit video representation can be implemented using picture groups. For example, video may be divided into groups of N frames (or pictures), and each batch may be compressed on a separate network. In some cases, this implementation reduces the expressiveness required for implicit representation. In some cases, this implementation can enable buffered streaming, as only one small network needs to be sent out before the next N frames can be decoded. 【0272】
[0181] In some embodiments, the implicit video representation can be implemented using a 3D MLP. For example, the MLP representation can be easily extended to video data by adding a third input representing the frame number (or time component) t. In some examples, the SIREN architecture can be used with sine activation. 【0273】
[0182] In some cases, implicit video representation can be implemented using a 3D convolutional network. As described above, 3D MLP can be considered a 1×1×1 convolution operation. As with the 2D case, the technique can implement 3D MLP as a convolution operation with a 3D kernel. To keep the number of parameters to a minimum, the technique can use a spatial kernel of size k×k×1, followed by a frame-unit kernel of shape 1×1×k'. 【0274】
[0183] Regarding the Fourier coding in Equation 5, the additional coordinate is x i This can be considered by setting [t,x,y] and the resulting extra frequencies. Since the time correlation scale and the spatial correlation scale are probably quite different, this technique allows the time-conjugate frequency dispersion to be a separate hyperparameter. A sequence of 3D transposed convolutions alternating with ReLU activation can process the position-coded features into a video sequence. In some embodiments, implicit video representations can be implemented using time-modulated networks, which correspond to implicit representations that can adapt the representation to work on a set of data rather than a single instance. In some examples, the method may include the use of hypernetworks, as well as latent-based methods. In some cases, the technique can use time-modulated networks to generalize our instance model to frames in video (instead of a set of data points). In some examples, the technique can implement a synthesis-modulator composite network architecture for their conceptual simplicity and parameter-sharing efficiency. While previous implementations found that the SIREN MLP could not perform high-quality reconstruction at high resolutions and therefore split the image into overlapping spatial tiles for weight-sharing purposes, the technique implements a convolutional SIREN architecture that can generate high-resolution frames. In some cases, the technique can reserve modulation only along the frame axis. In this technique, the input to the model is still just spatial coordinates (x,y), except that the kth layer of this network is given by: 【0275】 【number】 【0276】
[0184] Here, σ(·) is the activation function, F is a neural network layer that includes either a 3x3 convolution or a 1x1 convolution, and z t g is a learnable latent vector for each frame, k (·) represents the output of the kth layer of the modulated MLP. Element-level multiplicative interactions allow for the modeling of complex time dependencies. 【0277】
[0185] In some examples, the implicit video representation can be implemented using IPB frame breakdown and / or configurations based on IP frame breakdown. Referring to Figure 9, a group of consecutive frames 902 can be encoded (for example, using IPB frame breakdown) by first compressing the intermediate frames as I-frames. Then, starting with the trained I-frame implicit model, the technique can be fine-tuned with respect to the first and last frames as P-frames. In some examples, fine-tuning with respect to the first and last frames can include using sparsity-inducing pliers and quantization-aware fine-tuning to minimize the bitrate. In some embodiments, the remaining frames can be encoded as B-frames. In some examples, IPB frame breakdown can be implemented by initializing the model weights as interpolation of the model weights on either side of the frame. In some cases, the overall bitstream can include the quantized parameters of the I-frame model encoded with a fitted model plier and quantized updates for the P-frames and B-frames encoded with a sparsity-inducing plier. In some examples, implicit video representation can be implemented using IP frame breakdown, as shown by frame 904 in Figure 9. 【0278】
[0186] Returning to Figure 7A, the neural network compression system 700 can implement a quantization algorithm that can be used to quantize the weight 706 to produce the quantized weight 708. In some embodiments, network quantization can be used to quantize any weight tensor w using fixed-point representation. (i) Quantizing ∈w can be used to reduce the model size. In some cases, the quantization parameters and bit width are, for example, the scale s and the clipping threshold q. max By learning this, it can be learned together. Then the bit width b is b(s,q max ) = log2(qmax It is implicitly defined as +1), which shows that this parameterization is superior to directly learning the bit width because it does not have the problem of the unbounded gradient norm. 【0279】
[0187] In some cases, encoding a bitstream involves all quantization parameters 【0280】 【number】 【0281】 and all integer tensors 【0282】 【number】 【0283】 It can include encoding all s (i) This is encoded as a 32-bit floating-point variable, with bit width b (i) This is encoded as INT4, and is an integer tensor. 【0284】 【number】 【0285】 The respective bit widths b (i) It is encoded in [location]. 【0286】
[0188] In some embodiments, the neural network compression system 700 can implement entropy coding. For example, the final training stage may include the arithmetic encoder (AE) 710 fitting the pliers over the weights (e.g., weights 706 or quantized weights 708) to generate a bitstream 724. As described above, on the receiver / decoder side, the arithmetic decoder (AD) 730 can perform entropy decoding using the decoded weight pliers (decoded by the plier decoder 728) to decode the bitstream 724 and generate weights (e.g., weights 706 or quantized weights 708). Using the weights and the neural network model architecture, the implicit model 704 can generate output image data 732. In some cases, the weights may be approximately distributed as Gaussian around 0 for most tensors. In some examples, the scale of any weights may differ, but the weight range is the (transmitted) quantization parameter 【0287】 【number】 【0288】 Since they are part of the system, the weights can be normalized. In some cases, the network compression system 700 can then fit Gauss to the normalized weights and use this for entropy coding (for example, to produce bitstream 724). 【0289】
[0189] In some examples, some weights (e.g., weight 706 or quantized weight 708) are sparsely distributed. In the case of sparsely distributed weights, the neural network compression system 700 can transmit a binary mask which can be used to redistribute the probability mass to only the binary that has content. In some cases, a signal bit may be included to encode whether the mask is being transmitted. 【0290】
[0190] Figure 7B shows an exemplary codec based on the implicit neural network compression system 700. In some embodiments, the implicit neural compression system 700 may include a pipeline for training implicit compression models configured to optimize distortion and / or bitrate. As described above with respect to Figure 7A, the first step may include determining the best implicit model 704 to use to compress the input image data 702 (from among the group of available implicit models) by exploring across network architectures Ψ(·) and training weights w706 for each model by minimizing distortion loss without quantization. In some examples, this process may be used to select the implicit model 704. In some examples, the implicit model 704 may be associated with one or more model characteristics, which may include model width, model depth, resolution, convolution kernel size, input dimension, and / or any other preferred model parameters or characteristics. 【0291】
[0191] In some embodiments, the receiver (e.g., decoder) has no prior knowledge of the network architecture Ψ(·) used to encode the input image data 702. In some cases, the implicit neural network compression system 700 may be configured to encode the model architecture Ψ(·) 718 in the bitstream 720 (using the architecture encoder 716). 【0292】
[0192] Figure 8A shows an example of pipeline 800 for a picture group using implicit neural representation. In some embodiments, pipeline 800 may be implemented by a video compression codec that can process images using a neural network that can map coordinates associated with input images (e.g., I-frame 802 and / or P1-frame 808) to pixel values (e.g., RGB values). In some examples, the output of pipeline 800 may include a compressed file with a header (e.g., used to identify the network architecture) and / or the weights of the neural network for the corresponding input frames. 【0293】
[0193] In some examples, pipeline 800 can be used to compress one or more image frames from a frame group associated with a video input, base model 804 (for example, base model f θ ) may include. In some cases, the base model 804 may include an I-frame model that is trained using a first frame from a frame group. In some embodiments, training the base model 804 may include compressing a first frame (e.g., an I-frame) from a frame group by mapping input coordinate positions to pixel values (e.g., using equation (4)). 【0294】
[0194] In some embodiments, the size of the base model 804 can be reduced by quantizing one or more of the weight tensors associated with the base model 804. In some examples, the weight tensors can be quantized using a fixed-point quantization function, such as the function from equation (2). For example, equation (2) quantizes the base model 804 to a quantized base model 806 (e.g., a quantized base model f Q(θ)) can be used to produce. In some embodiments, the quantized base model 806 can be compressed (for example, using an arithmetic encoder) and sent to a receiver. 【0295】
[0195] In some examples, pipeline 800 can be used to determine the optical flow field between two image frames (e.g., I frame 802 and P1 frame 808) using a flow model 810 (e.g., flow model h φ ) may include For example, the flow model 810 may be configured to determine an optical flow field or motion vector (e.g., a displacement vector field) between consecutive image frames from a video. In some embodiments, the flow model 810 may be trained using a second frame from a frame group (e.g., frame P1 808). In some cases, the displacement vector field determined by the flow model 810 may be applied to the previous frame to model the current frame. In some embodiments, the displacement from the optical flow field is h φ It can be expressed as (x,y)=(Δx,Δy). In some cases, the displacement from the optical flow field can be applied by adding the displacement vector to the input variable according to the following: 【0296】 【number】 【0297】
[0196] In some embodiments, the size of the flow model 810 can be reduced by quantizing one or more of the weight tensors associated with the flow model 810. In some examples, the weight tensors can be quantized using a fixed-point quantization function, such as the function from equation (2). For example, equation (2) quantizes the flow model 810 to a quantized flow model 812 (e.g., a quantized flow model h Q(φ)) can be used to produce. In some embodiments, the quantized flow model 812 can be compressed (for example, using an arithmetic encoder) and sent to a receiver. 【0298】
[0197] Figure 8B shows an example of pipeline 840 for a picture group using implicit neural representation. In some embodiments, pipeline 840 can represent a second pipeline phase that can follow pipeline 800. For example, pipeline 840 may be used to process and compress frames using a trained base model (e.g., base model 844) and a trained flow model (e.g., flow model 846). 【0299】
[0198] In some examples, pipeline 840 may be used to encode additional frames from a frame group by determining quantized updates of the parameters of the composite model. For example, pipeline 840 may be used to iterate continuously over subsequent P frames (e.g., frame P1 842) to learn base model weight updates δθ and flow model weight updates δφ for the previous frame. In some cases, the updates of the base model weights θ and flow model weights φ may be determined as follows: 【0300】 【number】 【0301】
[0199] In some embodiments, updated weights for base model 844 and flow model 846 can be sent to a receiver. In some cases, the weight updates δθ and δφ can be quantized on a fixed grid of n equal-sized bins of width t centered at δθ=0. In some examples, the weight updates can be entropy coded under a spike and slab plier, a mixed model of narrow and wide Gaussian distributions, and given by: 【0302】 【number】 【0303】
[0200] In some embodiments, the variance used in equation (9) 【0304】 【number】 【0305】 A "slab" component with a narrow standard deviation σ can minimize the bitrate required to send the updated weights to the receiver. In some cases, a narrow standard deviation σ spike <<σ slab The "spike" component, which has a 0 update, can minimize the processing cost associated with the 0 update. In some examples, similar subsequent frames may have a sparse update δθ, associated with a relatively low bitrate cost. In some embodiments, the quantization grid parameters n and t and the previous standard deviation σ spike and σ slab The spike-to-slab ratio ∝ corresponds to a hyperparameter. As shown in Figure 8B, the receiver (frame 【0306】 【number】 【0307】 Reconstructed P1 frame 850 (shown as 850), and (frame 【0308】 【number】 【0309】 Outputs the reconstructed I-frame (shown as 848). 【0310】
[0201] Figure 8C shows an example of pipeline 860 for a picture group using implicit neural representation. In some examples, pipeline 860 may include multiple stages configured to process each frame from the frame group. For example, pipeline 860 processes the first frame (e.g., frame I 802) to reconstruct the first frame 【0311】 【number】 【0312】 It may include a first stage 866 which can produce 872. 【0313】
[0202] In some embodiments, pipeline 860 processes the second frame (for example, P1 frame 862) to reconstruct the second frame 【0314】 【number】 【0315】 It may include a second stage 868 which can produce 874. In some examples, pipeline 860 processes a third frame (e.g., P2 frame 864) to reconstruct the third frame 【0316】 【number】 【0317】 It may include a third stage 870 that can produce 876. Those skilled in the art will recognize that according to this technology, pipeline 860 can be configured to have any number of stages. 【0318】
[0203] In some examples, each stage of pipeline 860 may include a base model (e.g., base model 804) and a flow model (e.g., flow model 810). In some embodiments, the input to the base model may be an element-wise sum of input coordinates, having the current flow model as well as previous versions of the flow model outputs. In some cases, the appended flow model may be implemented as an additional layer that can be added using skip joins. 【0319】
[0204] Figure 10 shows an exemplary process 1000 for performing implicit neural compression. In one embodiment, each block of process 1000 may be associated with equation 1002, which can be implemented in a neural network compression system (e.g., system 700) to minimize rate distortion. In some examples, equation 1002 may have the following form: 【0320】 【number】 【0321】
[0205] Referring to Equation 1002, d can correspond to a strain function (e.g., MSE, MS-SSIM), Ψ can correspond to an implicit model class (e.g., network type and architecture), and Q ν w can correspond to a weight quantizer, w can correspond to implicit model weights, I can correspond to an input image or video, β can correspond to a trade-off parameter, p ω It can handle weighted pliers. 【0322】
[0206] Referring to process 1000, in block 1004, the process includes finding the optimal function class or model architecture. In some embodiments, finding the optimal implicit model may include searching across network architectures and training weights for each model by minimizing the strain loss (e.g., without weight quantization). In some examples, the optimal model is selected based on the minimized strain loss. In some cases, the search may include neural network search or Bayesian optimization technique. 【0323】
[0207] In block 1006, process 1000 includes finding the optimal function parameters and / or weights. In some examples, finding the optimal weights may include using gradient descent or stochastic gradient descent to find the optimal weights. 【0324】
[0208] In block 1008, process 1000 includes finding the optimal quantization setting. In some embodiments, finding the optimal quantization setting may be done using a trainable quantizer (e.g., trained using a machine learning algorithm). In some examples, the quantization setting may be determined using codebook quantization, trained fixed-point quantization, and / or any other suitable quantization technique. 【0325】
[0209] In block 1010, process 1000 includes finding the optimal weight plier. In some cases, the optimal weight plier can be found by exploring different distribution types (e.g., Gaussian, Beta, Laplace, etc.). In some embodiments, finding the optimal weight plier may include fitting parameters of the weight distribution (e.g., mean and / or standard deviation) to minimize rate loss. In some examples, a binary mask may be included for transmission to a decoder that can provide instructions for weightless binary. 【0326】
[0210] In some examples, the steps in process 1000 may be executed sequentially or, if applicable, using parallel processing. In some embodiments, one or more parameters may enable backpropagation which can allow a combination of one or more steps (for example, blocks 1006 and 1008 may be minimized using gradient descent when using a learnable quantizer). 【0327】
[0211] Figure 11 shows an exemplary process 1100 for performing implicit neural compression. In block 1102, process 1100 may include receiving input video data for compression by a neural network compression system. In some examples, the neural network compression system may be configured to perform video and image compression using implicit frame flow (IFF) based on implicit neural representations. For example, a full-resolution video sequence may be compressed by representing each frame using a neural network that maps coordinate positions to pixel values. In some embodiments, a separate implicit network may be used to modulate the coordinate input to enable motion compensation between frames (e.g., optical flow warping). In some examples, the IFF may be implemented such that the receiver does not need to have access to a pre-trained neural network. In some cases, the IFF may be implemented without the need for a separate training dataset (e.g., the network may be trained using input frames). 【0328】
[0212] In block 1104, process 1100 includes dividing the input video into frame groups (also called “picture groups” or “GoP”). In some examples, a frame group may contain five or more frames. In some embodiments, the first frame in a frame group may be compressed as a standalone image (e.g., an I-frame), and the other frames in the frame group may be compressed using information available from the other frames. For example, the other frames in a frame group may be compressed as P-frames that depend on the previous frame. In some embodiments, a frame may be compressed as a B-frame that depends on both the preceding and succeeding frame. 【0329】
[0213] In block 1106, process 1100 performs a base model with respect to the I-frame (for example, base model 【0330】 【number】 【0331】 This includes training the model. In some examples, training the base mode with respect to the I-frame may include minimizing distortion. In some embodiments, training the base model with respect to the I-frame may be based on the following relation: 【0332】 【number】 【0333】
[0214] In equation (11), t can correspond to the frame index, and x and y can correspond to coordinates within the video frame, I t,x,y This can correspond to the ground truth RGB values in coordinate (x,y), and f θt (x,y) is the weight θ evaluated at coordinate (x,y). t It can support implicit neural networks that have Q τ This can correspond to a quantization function with parameter ψ, and p ω This can correspond to a plier used to compress the quantized weight ω. 【0334】
[0215] In block 1108, process 1100 includes quantizing and entropy coding the I-frame weights θ0 and writing them to a bitstream. In some embodiments, in order to reduce the model size of the implicit model representing the I-frame, each weight tensor θ (l) ∈θ can be quantized using fixed-point representation (for example, using equation (2)). In some examples, the bit width is b(s,θ). max) = log2(θ max It can be implicitly defined as +1), where s can correspond to a scale, θ max This can correspond to a clipping threshold. In some examples, channel-by-channel quantization may be performed to obtain distinct ranges and bit widths for each row in the matrix. In one embodiment, a channel-by-channel mixed-precision quantization function may be defined as follows: 【0335】 【number】 【0336】
[0216] In some embodiments, the quantization parameter 【0337】 【number】 【0338】 and integer tensors 【0339】 【number】 【0340】 This can be encoded into a video bitstream. For example, s (l) This can be encoded as a 32-bit floating-point vector, with bit width b (l) It can be encoded as a 5-bit integer vector, and an integer tensor 【0341】 【number】 【0342】 The bit width for each of those channels 【0343】 【number】 【0344】 It can be encoded in [location]. 【0345】
[0217] In block 1110, process 1100 performs a flow model with respect to the P frame (for example, model 【0346】 【number】 【0347】 This includes training the IFF. In some embodiments, a P-frame can correspond to the next consecutive frame in a frame group (e.g., the first P-frame after an I-frame). As described above, optical flow can be implicitly modeled by leveraging the continuity between implicit representations. Using IFF, a frame can be represented as a network that takes image coordinates as input and returns pixel values, as follows: (x,y) → f θ (x,y)=(r,g,b). In some embodiments, the displacement h from the optical flow field. φ The equation (x,y)=(Δx,Δy) can be applied by adding a displacement vector to the input variable (for example, equation (7)). In some embodiments, training a flow model with respect to P frames can be based on the relation in equation (11). 【0348】
[0218] In block 1112, process 1100 includes quantizing and entropy coding the P-frame weights φ0 and writing them to a bitstream. In some embodiments, the P-frame weights φ0 may be quantized and entropy coded using the methods described above with respect to the I-frame weights θ0. For example, the P-frame weights φ0 may be quantized using a fixed-point representation. In some cases, channel-by-channel quantization may be performed according to equation (12). In some embodiments, the quantization parameters and integer tensors may be written to or encoded in a bitstream and sent to a receiver. In some embodiments, learnable quantization parameters ω may also be encoded in a bitstream and written. 【0349】
[0219] In block 1114, process 1100 is currently in frame P t This includes loading existing model parameters for processing. In some embodiments, the current frame P t This can correspond to the next frame in a frame group. For example, the current frame can correspond to the frame following the I and P frames used to train the base model and flow model, respectively. In some embodiments, existing model parameters are the base model weights for the previous frame (e.g., θ). t-1 ) and flow model weights for the previous frame (for example, φ t-1 It can be expressed as ). 【0350】
[0220] In block 1116, process 1100 includes training the base model and the flow model with respect to the current frame. In some embodiments, training the base model and the flow model with respect to the current frame includes learning weight updates δθ and δφ with respect to the previous frame, as follows: 【0351】 【number】 【0352】
[0221] In some cases, updating the base model can correspond to modeling the residuals. In some cases, modeling the update can avoid retransmitting previously calculated flow information (for example, optical flow between consecutive frames is probably similar). In some embodiments, the implicit representation of P-frame T can be shown by the following equation: 【0353】 【number】 【0354】
[0222] In some examples, as can be proven by equation (15), the cumulative effect of all previous flow models is stored in a single tensor which is the sum of local displacements. In some cases, this tensor may be maintained by the sender and receiver. In some embodiments, the use of a single tensor can avoid the need to store previous versions of the flow network in order to perform a forward pass through any network for each frame. 【0355】
[0223] In some cases, frame P T The training for this can be expressed according to the following relation: 【0356】 【number】 【0357】
[0224] In equation (16), D T is frame P T The strain related to can be shown, and R(δθ,δφ) can represent the updated rate cost. 【0358】 【0225】 In block 1118, process 1100 can include quantizing and entropy coding the weight updates δθ and δφ into a bitstream. In some examples, the updates δθ and δφ can be quantized on a fixed grid of n equally sized bins of width t centered at δθ = 0. In some aspects, the quantized weight updates can be entropy coded under a spike and slab prior as described with respect to equation (9). As described above, in some aspects, the variance 【0359】 【Number】 【0360】 of the "slab" component having can minimize the bitrate for sending the updated weights to the receiver. In some cases, a narrow standard deviation σ spike << σ slab of the "spike" component can minimize the processing cost associated with zero updates. 【0361】 【0226】 In block 1120, process 1100 includes updating the model parameters for the base model and the flow model. In some aspects, the update of the model parameters can be shown as θ t ← θ t-1 + δθ and φ t ← φ t-1 + δθ. In some cases, the update of the model parameters can be sent to the receiver. 【0362】 【0227】 In block 1122, process 1100 includes updating a displacement tensor. In some aspects, the update of the displacement tensor can be shown as Δ t ← Δ t-1 + h φt . 【0363】
[0228] In block 1124, process 1100 can determine if there are additional frames in the frame group (e.g., GoP). If there are additional frames to process (e.g., additional P frames), process 1100 can repeat the operations described with respect to blocks 1114-1122. Once the network compression system has finished processing the frame group, process 1100 can proceed to block 1126 and determine if there are any further frame groups associated with the video input. If there are additional frame groups to process, the method can return to block 1106 and begin training the base model with new I frames corresponding to the next frame group. If there are no additional frame groups, process 1100 can return to block 1102 and receive new input data for compression. 【0364】
[0229] Figure 12 is a flowchart of an exemplary process 1200 for processing media data. In block 1202, process 1200 may include receiving multiple images for compression by a neural network compression system. For example, an implicit neural network compression system 700 may receive image data 702. In some embodiments, the implicit neural network compression system 700 may be implemented using a pipeline 800, and the multiple images may include I-frames 802 and P1-frames 808. 【0365】
[0230] In block 1204, process 1200 may include determining a first set of weight values associated with a first model of a neural network compression system based on a first image from a set of images. For example, base model 804 may determine a first set of weight values (e.g., weight w706) based on I-frame 802. In some embodiments, at least one layer of the first model may include position coding of a plurality of coordinates associated with the first image. For example, at least one layer of base model 804 may include position coding of coordinates associated with I-frame 802. 【0366】
[0231] In some cases, the first model may be configured to determine one or more pixel values corresponding to a plurality of coordinates associated with the first image. For example, the base model 804 may be configured to determine one or more pixel values (e.g., RGB values) corresponding to a plurality of coordinates associated with the I-frame 802. 【0367】
[0232] In block 1206, process 1200 may include generating a first bitstream having compressed versions of a first plurality of weight values. For example, the arithmetic encoder 710 may generate a bitstream 724 which may contain compressed versions of a plurality of weight values (e.g., weight w706). In block 1208, process 1200 may include outputting the first bitstream for transmission to a receiver. For example, bitstream 724 may be output by the arithmetic encoder 710 for transmission to a receiver (e.g., arithmetic decoder 730). 【0368】
[0233] In some embodiments, process 1200 may include quantizing a first plurality of weight values under a weight plier to produce a plurality of quantized weight values. In some cases, the bitstream may include compressed versions of the plurality of quantized weight values. For example, weight w706 may be quantized under a weight plier 712 to produce quantized weight 708. In some examples, quantized weight 708 may be encoded in bitstream 724 by arithmetic encoder 710. In some embodiments, process 1200 may include entropy coding of the first plurality of weight values using a weight plier. For example, arithmetic encoder 710 may encode quantized weight 708 in bitstream 724 using entropy coding under weight plier 712. 【0369】
[0234] In some cases, the weight pairs may be selected to minimize the rate loss associated with sending the first bitstream to the receiver. For example, the weight pair 712 may be selected or configured to minimize the rate loss associated with sending the bitstream 724 to the receiver. In some examples, the first set of weight values may be quantized using fixed-point quantization. In some embodiments, fixed-point quantization may be implemented using machine learning algorithms. For example, the weight w706 may be quantized using fixed-point quantization, which allows the weight tensor to be represented as a fixed-point number including an integer tensor and a scaling factor. In some cases, the implicit neural network compression system 700 may implement the fixed-point quantization of the weight w706 using machine learning algorithms. 【0370】
[0235] In some embodiments, process 1200 may include determining a second plurality of weight values for use by a second model associated with a neural network compression system based on a second image from a plurality of images. For example, pipeline 800 may determine a second set of weight values for use by flow model 810 based on P1 frame 808. In some cases, process 1200 may include generating a second bitstream containing a compressed version of the second plurality of weight values and outputting the second bitstream for transmission to a receiver. For example, an arithmetic encoder (e.g., arithmetic encoder 710) may generate a bitstream that may contain a compressed version of the weight tensor used by flow model 810. 【0371】
[0236] In some examples, a second model may be configured to determine the optical flow between the first and second images. For example, a flow model 810 (for example, flow model h) may be used to determine the optical flow field between frame I 802 and frame P1 808. φ In some embodiments, process 1200 may include determining at least one updated weight value from a first set of weight values based on optical flow. For example, flow model 810 may determine the updated weight value from the weight values used by base model 804 based on optical flow. 【0372】
[0237] In some embodiments, the process 1200 may include selecting a model architecture corresponding to a first model based on a first image. In some cases, selecting a model architecture may include adjusting a plurality of weight values associated with one or more model architectures based on the first image, where each of the one or more model architectures is associated with one or more model characteristics. For example, the implicit neural compression system 700 may adjust the weights w706 for each model architecture based on the image data 702. In some examples, the one or more model characteristics may include at least one of width, depth, resolution, size of the convolutional kernel, and input dimension. 【0373】
[0238] In some cases, the process 1200 may include determining at least one strain between the first image and the reconstructed data output corresponding to each of one or more model architectures. For example, the implicit neural compression system 700 may adjust the weights w706 associated with each model to minimize strain loss without quantization. In some embodiments, the process 1200 may include selecting a model architecture from one or more model architectures based on at least one strain. For example, the implicit neural compression system 700 may select a model architecture based on the lowest strain value. 【0374】
[0239] In some examples, process 1200 may include generating a second bitstream containing a compressed version of the model architecture and outputting the second bitstream for transmission to a receiver. For example, the architecture encoder 716 may encode the model architecture Ψ(·)718 in bitstream 720 and output bitstream 720 for transmission to a receiver (e.g., architecture decoder 726). 【0375】
[0240] Figure 13 is a flowchart of an exemplary process 1300 for processing media data. In block 1302, process 1300 may include receiving a compressed version of a first set of neural network weight values associated with a first image from a set of images. For example, the arithmetic decoder 730 may receive a bitstream 724 which may include a set of weight values (e.g., weight w706) associated with the image data 702. 【0376】
[0241] In block 1304, process 1300 may include unpacking a first set of neural network weight values. For example, an arithmetic decoder may unpack weights w706 from bitstream 724. In block 1306, process 1300 may include processing a first set of neural network weight values to yield a first image using a first neural network model. For example, the implicit neural compression system 700 may include a pipeline 800 having a quantized base model 806 which can be used to process the weight tensor to yield a reconstructed version of I-frame 802. 【0377】
[0242] In some embodiments, process 1300 may include receiving a compressed version of a second plurality of neural network weight values associated with a second image from a plurality of images. In some cases, process 1300 may include decompressing the second plurality of neural network weight values and processing the second plurality of neural network weight values using a second neural network model to determine the optical flow between the first and second images. For example, the implicit neural compression system 600 may include a pipeline 800 having a quantized flow model which can be used to process weight tensors associated with a flow model 810 to determine the optical flow between I-frame 802 and P1-frame 808. 【0378】
[0243] In some cases, process 1300 may include determining at least one updated weight value from a first set of neural network weight values associated with a first neural network model based on optical flow. For example, flow model 810 may determine updated weight values from the weights associated with flow model 810. In some embodiments, process 1300 may include using a first neural network model to process at least one updated weight value to produce a reconstructed version of a second image. For example, quantized base model 806 may use updated weights (e.g., based on optical flow) to produce a reconstructed version of P1 frame 808. 【0379】
[0244] In some examples, the first plurality of neural network weight values may be quantized under a weight plier. For example, the weights received by the quantized base model 806 may be quantized under a weight plier (e.g., weight plier 712). In some embodiments, a compressed version of the first plurality of network weight values is received in an entropy encoded bitstream. For example, the arithmetic encoder 710 may perform entropy encoding of the weights (e.g., weight w706) or quantized weights (e.g., quantized weight 708) and output a bitstream 724. 【0380】
[0245] In some cases, process 1300 may include receiving a compressed version of the neural network architecture corresponding to the first neural network model. For example, the architecture encoder 716 may encode the model architecture Ψ(·)718 in the bitstream 720 and send it to the architecture decoder 726. 【0381】
[0246] Figure 14 is a flowchart of an exemplary process 1400 for compressing image data based on an implicit neural representation. In block 1402, process 1400 may include receiving input data for compression by a neural network compression system. In some embodiments, the input data may correspond to media data (e.g., video data, picture data, audio data, etc.). In some examples, the input data may include a plurality of coordinates corresponding to image data used to train the neural network compression system. 【0382】
[0247] In block 1404, process 1400 may include selecting a model architecture for use by a neural network compression system to compress the input data, based on the input data. In some embodiments, selecting a model architecture may include adjusting a plurality of weight values associated with one or more model architectures based on the input data, where each of the one or more model architectures is associated with one or more model characteristics. In some examples, selecting a model architecture may also include determining at least one strain between the input data and the reconstructed data output corresponding to each of the one or more model architectures. In some cases, selecting a model architecture from one or more model architectures may be based on at least one strain. In some embodiments, one or more model characteristics may include at least one of width, depth, resolution, size of the convolutional kernel, and input dimension. 【0383】
[0248] In block 1406, process 1400 may use the input data to determine multiple weight values corresponding to multiple layers associated with the model architecture. In block 1408, process 1400 may generate a first bitstream containing a compressed version of the weight plier. In some examples, generating the first bitstream may involve encoding the weight plier using the Open Neural Network Exchange (ONNX) format. In block 1410, process 1400 may generate a second bitstream containing a compressed version of the multiple weight values under the weight plier. In some embodiments, generating the second bitstream may involve entropy encoding the multiple weight values using the weight plier. In some examples, the weight plier may be selected to minimize the rate loss associated with sending the second bitstream to the receiver. 【0384】
[0249] In block 1412, process 1400 may include outputting a first bitstream and a second bitstream for transmission to a receiver. In some examples, the process may include generating a third bitstream having a compressed version of the model architecture and outputting the third bitstream for transmission to a receiver. In some embodiments, at least one layer of the model architecture includes position coding of a plurality of coordinates associated with the input data. 【0385】
[0250] In some examples, the process may involve quantizing multiple weight values to produce multiple quantized weight values, where the second bitstream comprises a compressed version of the multiple quantized weight values under the weight plier. In some embodiments, the multiple weight values may be quantized using learned fixed-point quantization. In some cases, the learned fixed-point quantization may be implemented using a machine learning algorithm. In some examples, the second bitstream may include multiple encoded quantization parameters used to quantize the multiple weight values. 【0386】
[0251] Figure 15 is a flowchart of an example of a process 1500 for unpacking image data based on an implicit neural representation. In block 1502, the process 1500 may include receiving a compressed version of the weight plier and a compressed version of the multiple weight values under the weight plier. In some embodiments, the multiple weights under the weight plier may be received in an entropy-encoded bitstream. In block 1504, the process 1500 may include unpacking the weight plier and a compressed version of the multiple weight values under the weight plier. 【0387】
[0252] In block 1506, process 1500 may include determining a plurality of neural network weights based on a weight plier and a plurality of weights under the weight plier. In block 1508, process 1500 may include processing a plurality of neural network weights using a neural network architecture to produce reconstructed image content. In some embodiments, the plurality of weight values under the weight plier may correspond to a plurality of quantized weights under the weight plier. In some examples, the process may include receiving a plurality of encoded quantization parameters used to quantize the plurality of quantized weights under the weight plier. 【0388】
[0253] In some embodiments, the process may include receiving a compressed version of the neural network architecture and decompressing the compressed version of the neural network architecture. In some examples, the process may include redistributing multiple weights under a weight plier based on a binary mask. 【0389】
[0254] In some examples, the processes described herein (for example, process 1100, process 1200, process 1300, process 1400, process 1500, and / or other processes described herein) may be carried out by a computing device or apparatus. In one example, processes 1100, 1200, 1300, 1400, and / or 1500 may be carried out by a computing device, such as the system 400 shown in Figure 4 or the computing system 1600 shown in Figure 16. 【0390】
[0255] The computing device may include any suitable device, such as a mobile device (e.g., a mobile phone), a desktop computing device, a tablet computing device, a wearable device (e.g., a VR headset, an AR headset, AR glasses, a network-connected watch or smartwatch, or other wearable device), a server computer, an autonomous vehicle or computing device for an autonomous vehicle, a robotic device, a television, and / or any other computing device having the resource capacity to perform the processes described herein, including process 1100, process 1200, process 1300, process 1400, process 1500, and / or other processes described herein. In some cases, the computing device or apparatus may include a variety of components, such as one or more input devices, one or more output devices, one or more processors, one or more microprocessors, one or more microcomputers, one or more cameras, one or more sensors, and / or one or more other components configured to perform the steps of the processes described herein. In some examples, the computing device may include a display, a network interface configured to communicate and / or receive data, any combination thereof, and / or one or more other components. Network interfaces can be configured to communicate and / or receive Internet Protocol (IP)-based data or other types of data. 【0391】
[0256] Components of a computing device may be implemented in a circuit. For example, a component may include and / or be implemented using one or more programmable electronic circuits (e.g., a microprocessor, a graphics processing unit (GPU), a digital signal processor (DSP), a central processing unit (CPU), and / or other suitable electronic circuits) to perform the various operations described herein, and / or may include and / or be implemented using computer software, firmware, or any combination thereof. 【0392】
[0257] Processes 1100, 1200, 1300, 1400, and 1500 are shown as logical flowcharts, and their operations represent sequences of operations that can be performed in hardware, computer instructions, or combinations thereof. In the context of computer instructions, an operation represents a computer executable instruction stored in one or more computer-readable storage media that, when executed by one or more processors, performs the described operation. Generally, computer executable instructions include routines, programs, objects, components, data structures, etc., that perform a particular function or implement a particular data type. The order in which operations are described is not to be interpreted as limiting, and any number of described operations can be combined in any order and / or in parallel to implement a process. 【0393】
[0258] Furthermore, processes 1100, 1200, 1300, 1400, 1500, and / or other processes described herein may be carried out under the control of one or more computer systems comprising executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) that is executed collectively on one or more processors, by hardware, or in combination thereof. As described above, the code may be stored in a computer-readable or machine-readable storage medium in the form of a computer program comprising, for example, a number of instructions executable by one or more processors. The computer-readable or machine-readable storage medium may be non-temporary. 【0394】
[0259] Figure 16 shows an example of a system for implementing some aspects of the present technology. In particular, Figure 16 shows an example of a computing system 1600, which may be any computing device, remote computing system, camera, or any component thereof that constitutes an internal computing system, and the components of the system communicate with each other using connections 1605. Connections 1605 may be a physical connection to the processor 1610 using a bus, or a direct connection to the processor 1610 in a chipset architecture, etc. Connections 1605 may also be a virtual connection, a networked connection, or a logical connection. 【0395】
[0260] In some embodiments, the computing system 1600 is a distributed system in which the functions described herein may be distributed across a data center, multiple data centers, a peer network, etc. In some embodiments, one or more of the system components described represent many such components, each of which performs some or all of the functions described herein. In some embodiments, the components may be physical devices or virtual devices. 【0396】
[0261] An exemplary system 1600 includes at least one processing unit (CPU or processor) 1610 and an interface 1605, the interface 1605 connecting various system components, including system memory 1615 such as read-only memory (ROM) 1620 and random access memory (RAM) 1625, to the processor 1610. The computing system 1600 may include a high-speed memory cache 1612 that is directly connected to the processor 1610, very close to the processor 1610, or integrated as part of the processor 1610. 【0397】
[0262] The processor 1610 may include any general-purpose processor and hardware or software services, such as services 1632, 1634, and 1636 stored in memory device 1630, which are configured to control the processor 1610 and dedicated processors, where software instructions are incorporated into the actual processor design. The processor 1610 may be a fully self-contained computing system that includes multiple cores or processors, buses, memory controllers, caches, etc. A multicore processor may be symmetric or asymmetric. 【0398】
[0263] To enable user interaction, the computing system 1600 includes an input device 1645, which can represent any number of input mechanisms, such as a microphone for voice, a touch-sensitive screen for gesture or graphical input, a keyboard, a mouse, motion input, or voice. The computing system 1600 may also include an output device 1635, which may be one or more of several output mechanisms. In some cases, a multimodal system may allow a user to provide multiple types of inputs and outputs for communication with the computing system 1600. The computing system 1600 may generally include a communication interface 1640 that can control and manage user inputs and system outputs. 【0399】
[0264] Communication interfaces include audio jack / plug, microphone jack / plug, Universal Serial Bus (USB) port / plug, Apple® Lightning® port / plug, Ethernet® port / plug, fiber optic port / plug, proprietary wired port / plug, Bluetooth® wireless signal transmission, Bluetooth® Low Energy (BLE) wireless signal transmission, IBEACON® wireless signal transmission, Radio Frequency Identification (RFID) wireless signal transmission, Near Field Communication (NFC) wireless signal transmission, Dedicated Short Range Communication (DSRC) wireless signal transmission, 802.11 Wi-Fi wireless signal transmission, and Wireless Local Area Network. Wired and / or wireless transceivers, including those utilizing WLAN (Wireless Network Area) signaling, visible light communication (VLC), worldwide interoperability for microwave access (WiMAX®), infrared (IR) communication wireless signaling, public switched telephone network (PSTN) signaling, integrated services digital network (ISDN) signaling, 3G / 4G / 5G / LTE cellular data network wireless signaling, ad hoc network signaling, radio signaling, microwave signaling, infrared signaling, visible light signaling, ultraviolet light signaling, wireless signaling along the electromagnetic spectrum, or any combination thereof, may be used to perform or enable the reception and / or transmission of wired or wireless communications. 【0400】
[0265] The communication interface 1640 may also include one or more GNSS receivers or transceivers used to determine the location of the computing system 1600 based on the reception of one or more signals from one or more satellites associated with one or more global navigation satellite systems (GNSS). The GNSS systems include, but are not limited to, the US-based Global Positioning System (GPS), the Russian-based Global Navigation Satellite System (GLONASS), the China-based Beidou Navigation Satellite System (BDS), and the European-based Galileo GNSS. There are no restrictions on operating on any particular hardware configuration, and therefore the basic features described herein can be easily substituted with improved hardware or firmware configurations as they are developed. 【0401】
[0266] The storage device 1630 may be a non-volatile and / or non-temporary and / or computer-readable memory device, such as a magnetic cassette, flash memory card, solid memory device, digital multipurpose disk, cartridge, floppy disk, flexible disk, hard disk, magnetic tape, magnetic strip / stripe, any other magnetic storage medium, flash memory, memory storage, any other solid memory, compact disc read-only memory (CD-ROM) optical disc, rewritable compact disc (CD) optical disc, digital video disc (DVD) optical disc, Blu-ray® disc (BDD) optical disc, holographic optical disc, another optical medium, Secure Digital (SD) card, microSecure Digital (microSD) card, Memory This could be a hard disk or other type of computer-readable medium capable of storing computer-accessible data, such as Stick® cards, smart card chips, EMV chips, subscriber identification module (SIM) cards, mini / micro / nano / pico SIM cards, other integrated circuit (IC) chips / cards, random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM®), flash EPROM (FLASHEPROM), cache memory (L1 / L2 / L3 / L4 / L5 / L#), resistive random access memory (RRAM® / ReRAM), phase-change memory (PCM), spin-transfer torque RAM (STT-RAM), other memory chips or cartridges, and / or combinations thereof. 【0402】
[0267] The storage device 1630 may include software services, servers, services, etc., which cause the system to perform functions when code defining such software is executed by the processor 1610. In some embodiments, a hardware service that performs a particular function may include software components stored on a computer-readable medium with respect to the necessary hardware components, such as the processor 1610, connection 1605, output device 1635, etc., in order to perform that function. The term “computer-readable medium” includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other media capable of storing, containing, or carrying (one or more) instructions and / or data. The computer-readable medium may include non-temporary media on which data can be stored, and it does not include carrier waves and / or temporary electronic signals that propagate wirelessly or via wired connections. Examples of non-temporary media include, but is not limited to, magnetic disks or tapes, optical storage media such as compact discs (CDs) or digital multipurpose discs (DVDs), flash memory, memory, or memory devices. Computer-readable media may store code and / or machine-executable instructions thereon, which may represent procedures, functions, subprograms, programs, routines, subroutines, modules, software packages, classes, or any combination of instructions, data structures, or program statements. Code segments may be coupled to other code segments or hardware circuits by passing and / or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc., may be passed, forwarded, or transmitted via any preferred means, including memory sharing, message passing, token passing, network transmission, etc. 【0403】
[0268] In some embodiments, computer-readable storage devices, media, and memory may include cable signals or wireless signals, such as bitstreams. However, as stated, non-transient computer-readable storage media explicitly exclude media such as energy, carrier signals, electromagnetic waves, and the signals themselves. 【0404】
[0269] Specific details are provided in the above description in order to provide a complete understanding of the embodiments and examples provided herein. However, those skilled in the art will understand that embodiments may be carried out without these specific details. For the sake of clarity of description, in some cases the art may be presented as including individual functional blocks comprising devices, device components, steps or routines in a method embodied in software, or combinations of hardware and software. Additional components other than those shown in the figures and / or described herein may be used. For example, circuits, systems, networks, processes, and other components may be shown as components in the form of block diagrams in order not to obscure the embodiments with unnecessary details. In other cases, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary details in order to avoid obscuring the embodiments. 【0405】
[0270] Individual embodiments may be described above as processes or methods shown as flowcharts, flow diagrams, data flow diagrams, structural diagrams, or block diagrams. While flowcharts may describe operations as sequential processes, many operations may be performed in parallel or simultaneously. Furthermore, the order of operations may be rearranged. When the operations of a process are completed, the process terminates, but it may have additional steps not shown in the diagram. A process may correspond to a method, function, procedure, subroutine, subprogram, etc. When a process corresponds to a function, its termination may correspond to the function returning to a calling function or main function. 【0406】
[0271] The processes and methods described above may be implemented using computer-executable instructions that are stored or otherwise available from a computer-readable medium. Such instructions may include instructions and data that cause, for example, a general-purpose computer, a dedicated computer, or a processing device to perform a certain function or a group of functions, or otherwise configure it to perform them. The portion of the computer resources used may be accessible over a network. Computer-executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, firmware, or source code. Examples of computer-readable media that may be used to store instructions, information used, and / or information created during the methods described above include magnetic or optical disks, flash memory, USB devices with non-volatile memory, and networked storage devices. 【0407】
[0272] Devices implementing the processes and methods described herein may include hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof, and may take any of a variety of form factors. When implemented in software, firmware, middleware, or microcode, program code or code segments (e.g., computer program products) for performing the required tasks may be stored in computer-readable or machine-readable media. One or more processors may perform the required tasks. Typical examples of form factors include laptops, smartphones, mobile phones, tablet devices or other small form factor personal computers, personal digital assistants, rack-mount devices, and standalone devices. The functions described herein may also be embodied in peripherals or add-in cards. Such functions may also, as a further example, be implemented on a circuit board between different chips or different processes running in a single device. 【0408】
[0273] Instructions, a medium for transmitting such instructions, computing resources for executing them, and other structures for supporting such computing resources are exemplary means for providing the functionality described herein. 【0409】
[0274] While the above description has described aspects of this application with reference to specific embodiments thereof, those skilled in the art will recognize that this application is not limited thereto. Accordingly, although exemplary embodiments of this application are described in detail herein, it should be understood that, except as limited by the prior art, the inventive concept may be embodied and employed in various ways, and the appended claims should be interpreted to include such variations. The various features and aspects of the applications described above may be used individually or together. Furthermore, the embodiments may be used in any number of environments and applications other than those described herein without departing from the broader spirit and scope of this specification. Accordingly, this specification and the drawings should be considered illustrative and not restrictive. For illustrative purposes, the methods have been described in a specific order. It should be understood that in alternative embodiments, the methods may be carried out in an order different from that described. 【0410】
[0275] Those skilled in the art will understand that the symbols or terms used herein for less than ("<") and greater than (">") may be replaced, without departing from the scope of this specification, with the symbols for less than or equal to ("≦") and greater than or equal to ("≧"). 【0411】
[0276] When a component is described as “configured to” perform some operations, such configuration can be achieved, for example, by designing an electronic circuit or other hardware to perform the operations, by programming a programmable electronic circuit (e.g., a microprocessor or other suitable electronic circuit) to perform the operations, or by any combination thereof. 【0412】
[0277] The phrase “combined” means any component that is physically connected to another component, either directly or indirectly, and / or any component that communicates with another component, either directly or indirectly (for example, connected to another component via a wired or wireless connection and / or other preferred communication interface). 【0413】
[0278] Claim language or other language that states “at least one of” a set and / or “one or more” of a set indicates that one member of a set or multiple members of a set (in any combination) satisfy the claim. For example, claim language that states “at least one of A and B” or “at least one of A or B” means A, B, or A and B. In another example, claim language that states “at least one of A, B, and C” or “at least one of A, B, or C” means A, B, C, or A and B, or A and C, or B and C, or A and B and C. The language “at least one of” a set and / or “one or more” of a set does not limit the set to the items listed in the set. For example, claim language stating "at least one of A and B" or "at least one of A or B" could mean A, B, or A and B, and could also include items not listed in the set of A and B. 【0414】
[0279] Various exemplary logic blocks, modules, circuits, and algorithmic steps described in relation to the examples disclosed herein may be implemented as electronic hardware, computer software, firmware, or a combination thereof. To clearly demonstrate this hardware- and software compatibility, various exemplary components, blocks, modules, circuits, and steps have been described above in general terms with respect to their function. Whether such function is implemented as hardware or as software depends on the specific application and the design constraints imposed on the overall system. A person skilled in the art may implement the described function in various ways for each specific application, but such a decision on implementation should not be construed as resulting in a departure from the scope of this application. 【0415】
[0280] The techniques described herein may also be implemented in electronic hardware, computer software, firmware, or any combination thereof. Such techniques may be implemented in any of a variety of devices, such as general-purpose computers, wireless communication device handsets, or integrated circuit devices having multiple applications, including applications in wireless communication device handsets and other devices. Features described as modules or components may be implemented together in an integrated logic device, or separately as individual but interoperable logic devices. When implemented in software, the techniques may be at least partially realized by a computer-readable data storage medium having program code that, when executed, performs one or more of the methods, algorithms, and / or operations described above. The computer-readable data storage medium may form part of a computer program product, which may include packaging materials. Computer-readable media may comprise memory or data storage media such as random access memory (RAM) including synchronous dynamic random access memory (SDRAM), read-only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), flash memory, and magnetic or optical data storage media. The technique may be at least partially implemented by a computer-readable communication medium, such as propagating signals or radio waves, that carries or communicates program code in the form of instructions or data structures and can be accessed, read, and / or executed by a computer. 【0416】
[0281] The program code may be executed by a processor which may include one or more processors, such as one or more digital signal processors (DSPs), general-purpose microprocessors, application-specific integrated circuits (ASICs), field-programmable logic arrays (FPGAs), or other equivalent integrated circuits or discrete logic circuits. Such a processor may be configured to implement any of the techniques described herein. A general-purpose processor may be a microprocessor, but alternatively, the processor may be any conventional processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, for example, a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors working with a DSP core, or any other such configuration. Accordingly, the term “processor” as used herein may refer to any of the above structures, any combination thereof, or any other structure or device suitable for implementing the techniques described herein. 【0417】
[0282] Exemplary examples of this disclosure include: 【0418】
[0283] Embodiment 1: An apparatus comprising at least one memory and at least one processor coupled to the at least one memory, wherein the at least one processor is configured to receive a plurality of images for compression by a neural network compression system, determine a first plurality of weight values associated with a first model of the neural network compression system based on a first image from the plurality of images, generate a first bitstream having a compressed version of the first plurality of weight values, and output the first bitstream for transmission to a receiver. 【0419】
[0284] Embodiment 2: The apparatus according to Embodiment 1, wherein at least one layer of the first model includes position coding of a plurality of coordinates associated with the first image. 【0420】
[0285] Embodiment 3: The apparatus according to Embodiment 2, wherein the first model is configured to determine one or more pixel values corresponding to a plurality of coordinates associated with the first image. 【0421】
[0286] Embodiment 4: The apparatus according to any one of embodiments 1 to 3, wherein at least one processor is further configured to determine a second set of weight values for use by a second model associated with a neural network compression system based on a second image from a set of images, generate a second bitstream having a compressed version of the second set of weight values, and output the second bitstream for transmission to a receiver. 【0422】
[0287] Embodiment 5: The apparatus according to Embodiment 4, wherein a second model is configured to determine the optical flow between a first image and a second image. 【0423】
[0288] Embodiment 6: The apparatus according to Embodiment 5, wherein at least one processor is further configured to determine at least one updated weight value from a first plurality of weight values based on optical flow. 【0424】
[0289] Embodiment 7: The apparatus according to any one of embodiments 1 to 6, wherein at least one processor is further configured to quantize a first plurality of weight values under a weight plier to produce a plurality of quantized weight values, wherein the first bitstream comprises compressed versions of the plurality of quantized weight values. 【0425】
[0290] Embodiment 8: The apparatus according to Embodiment 7, wherein the weight pliers are selected to minimize rate loss associated with sending a first bitstream to a receiver. 【0426】
[0291] Embodiment 9: The apparatus according to any one of Embodiments 7 to 8, wherein at least one processor is further configured to entropy encode a first plurality of weight values using a weight plier in order to generate a first bitstream. 【0427】
[0292] Embodiment 10: The apparatus according to any one of Embodiments 7 to 9, wherein a first plurality of weight values are quantized using fixed-point quantization. 【0428】
[0293] Embodiment 11: The apparatus according to Embodiment 10, wherein fixed-point quantization is implemented using a machine learning algorithm. 【0429】
[0294] Embodiment 12: The apparatus according to any one of embodiments 1 to 11, wherein at least one processor is further configured to select a model architecture corresponding to a first model based on a first image. 【0430】
[0295] Embodiment 13: The apparatus according to Embodiment 12, wherein at least one processor is further configured to generate a second bitstream having a compressed version of the model architecture and to output the second bitstream for transmission to a receiver. 【0431】
[0296] Embodiment 14: The apparatus according to any one of embodiments 12 to 13, wherein at least one processor is further configured to adjust a plurality of weight values associated with one or more model architectures based on a first image, thereby determining at least one strain between the first image and a reconstructed data output corresponding to each of the one or more model architectures, each of which is associated with one or more model characteristics, and to select a model architecture from the one or more model architectures based on at least one strain. 【0432】
[0297] Embodiment 15: The apparatus according to Embodiment 14, wherein one or more model characteristics include at least one of width, depth, resolution, size of the convolutional kernel, and input dimension. 【0433】
[0298] Embodiment 16: A method for performing any of the operations described in Embodiments 1 to 15. 【0434】
[0299] Embodiment 17: A computer-readable storage medium that stores instructions that, when executed, cause one or more processors to perform any of the operations described in Embodiments 1 to 15. 【0435】
[0300] Embodiment 18: An apparatus comprising means for performing any of the operations described in Embodiments 1 to 15. 【0436】
[0301] Embodiment 19: An apparatus comprising at least one memory and at least one processor coupled to the at least one memory, wherein the at least one processor is configured to receive a compressed version of a first plurality of neural network weight values associated with a first image from a plurality of images, decompress the first plurality of neural network weight values, and process the first plurality of neural network weight values to produce a first image using a first neural network model. 【0437】
[0302] Embodiment 20: The apparatus according to Embodiment 19, further configured to receive a compressed version of a second plurality of neural network weight values associated with a second image from a plurality of images, decompress the second plurality of neural network weight values, and process the second plurality of neural network weight values using a second neural network model to determine the optical flow between the first image and the second image. 【0438】
[0303] Embodiment 21: The apparatus according to Embodiment 20, wherein at least one processor is further configured to determine at least one updated weight value from a first set of neural network weight values associated with a first neural network model, based on optical flow. 【0439】
[0304] Embodiment 22: The apparatus according to Embodiment 21, wherein at least one processor is further configured to process at least one updated weight value to produce a reconstructed version of a second image using a first neural network model. 【0440】
[0305] Embodiment 23: The apparatus according to any one of Embodiments 19 to 22, wherein the first plurality of neural network weight values are quantized under a weight plier. 【0441】
[0306] Embodiment 24: The apparatus according to any one of embodiments 19 to 23, wherein a compressed version of a first plurality of neural network weight values is received in an entropy-encoded bitstream. 【0442】
[0307] Embodiment 25: The apparatus according to any one of embodiments 19 to 24, wherein at least one processor is further configured to receive a compressed version of a neural network architecture corresponding to a first neural network model. 【0443】
[0308] Embodiment 26: A method for performing any of the operations described in Embodiments 19 to 25. 【0444】
[0309] Embodiment 27: A computer-readable storage medium that stores instructions that, when executed, cause one or more processors to perform any of the operations described in Embodiments 19 to 25. 【0445】
[0310] Embodiment 28: An apparatus comprising means for performing any of the operations described in Embodiments 19 to 25. 【0446】
[0311] Embodiment 29: An apparatus comprising a memory and one or more processors coupled to the memory, wherein one or more processors are configured to receive input data for compression by a neural network compression system, select a model architecture for use by the neural network compression system for compressing the input data based on the input data, determine a plurality of weight values corresponding to a plurality of layers associated with the model architecture using the input data, generate a first bitstream having a compressed version of the weight pliers, generate a second bitstream having a compressed version of the plurality of weight values under the weight pliers, and output the first bitstream and the second bitstream for transmission to a receiver. 【0447】
[0312] Embodiment 30: The apparatus according to Embodiment 29, wherein one or more processors are configured to select a model architecture for use by a neural network, by adjusting a plurality of weight values associated with one or more model architectures based on input data, by determining at least one strain between the input data and a reconstructed data output corresponding to each of the one or more model architectures, wherein each of the one or more model architectures is associated with one or more model characteristics, and by selecting a model architecture from the one or more model architectures based on at least one strain. 【0448】
[0313] Embodiment 31: The apparatus according to Embodiment 30, wherein one or more model characteristics include at least one of width, depth, resolution, size of the convolutional kernel, and input dimension. 【0449】
[0314] Embodiment 32: The apparatus according to any one of embodiments 29 to 31, wherein one or more processors are further configured to quantize a plurality of weight values to produce a plurality of quantized weight values, wherein a second bitstream comprises compressed versions of the plurality of quantized weight values under a weight plier. 【0450】
[0315] Embodiment 33: The apparatus according to Embodiment 32, wherein multiple weight values are quantized using learned fixed-point quantization. 【0451】
[0316] Embodiment 34: The apparatus according to Embodiment 32, wherein fixed-point quantization is implemented using a machine learning algorithm. 【0452】
[0317] Embodiment 35: The apparatus according to Embodiment 32, wherein the second bitstream comprises a plurality of encoded quantization parameters used to quantize a plurality of weight values. 【0453】
[0318] Embodiment 36: The apparatus according to any one of embodiments 29 to 35, wherein one or more processors are further configured to generate a third bitstream having a compressed version of the model architecture and to output the third bitstream for transmission to a receiver. 【0454】
[0319] Embodiment 37: The apparatus according to any one of embodiments 29 to 36, wherein at least one layer of the model architecture comprises position coding of a plurality of coordinates associated with input data. 【0455】
[0320] Embodiment 38: The apparatus according to any one of embodiments 29 to 37, wherein one or more processors are configured to encode weight pliers using an open neural network exchange format in order to generate a first bitstream. 【0456】
[0321] Embodiment 39: The apparatus according to any one of embodiments 29 to 38, wherein one or more processors are configured to entropy encode a plurality of weight values using a weight plier in order to generate a second bitstream. 【0457】
[0322] Embodiment 40: The apparatus according to any one of embodiments 29 to 39, wherein the weight plier is selected to minimize rate loss associated with sending a second bitstream to a receiver. 【0458】
[0323] Embodiment 41: The apparatus according to any one of Embodiments 29 to 40, wherein the input data includes a plurality of coordinates corresponding to image data used to train a neural network compression system. 【0459】
[0324] Embodiment 42: A method for performing any of the operations described in Embodiments 29 to 41. 【0460】
[0325] Embodiment 43: A computer-readable storage medium that stores instructions that, when executed, cause one or more processors to perform any of the operations described in Embodiments 29 to 41. 【0461】
[0326] Embodiment 44: An apparatus comprising means for performing any of the operations described in Embodiments 29 to 41. 【0462】
[0327] Embodiment 45: An apparatus comprising a memory and one or more processors coupled to the memory, wherein one or more processors are configured to receive a compressed version of a weight plier and a compressed version of a plurality of weight values under the weight plier, to decompress the weight plier and the compressed version of the plurality of weight values under the weight plier, to determine a plurality of neural network weights based on the weight plier and the plurality of weights under the weight plier, and to process the plurality of neural network weights using a neural network architecture to produce reconstructed image content. 【0463】
[0328] Embodiment 46: The apparatus according to Embodiment 45, wherein one or more processors are further configured to receive a compressed version of a neural network architecture and to decompress a compressed version of a neural network architecture. 【0464】
[0329] Embodiment 47: The apparatus according to any one of Embodiments 45 to 46, wherein multiple weight values under a weight plier correspond to multiple quantized weights under a weight plier. 【0465】
[0330] Embodiment 48: The apparatus according to Embodiment 47, wherein one or more processors are further configured to receive a plurality of encoded quantization parameters used to quantize a plurality of quantized weights under a weight plier. 【0466】
[0331] Embodiment 49: The apparatus according to any one of embodiments 45 to 48, wherein compressed versions of multiple weights under a weight plier are received in an entropy-encoded bitstream. 【0467】
[0332] Embodiment 50: The apparatus according to any one of embodiments 45 to 49, wherein one or more processors are further configured to redistribute a plurality of weights under a weight plier based on a binary mask. 【0468】
[0333] Embodiment 51: A method for performing any of the operations described in Embodiments 45 to 50. 【0469】
[0334] Embodiment 52: A computer-readable storage medium that stores instructions that, when executed, cause one or more processors to perform any of the operations described in Embodiments 45 to 50. 【0470】
[0335] Embodiment 53: An apparatus comprising means for performing any of the operations described in Embodiments 45 to 50. The invention described in the original claims of this application is listed below. [C1] A method for processing media data, Receiving multiple images for compression by a neural network compression system, Based on the first image from the aforementioned plurality of images, a first set of weight values associated with the first model of the neural network compression system is determined. To generate a first bitstream having compressed versions of the first set of weight values, Outputting the first bitstream for transmission to the receiver A method that includes [a certain feature]. [C2] The method according to C1, wherein at least one layer of the first model includes position coding of a plurality of coordinates associated with the first image. [C3] The method according to C2, wherein the first model is configured to determine one or more pixel values corresponding to the plurality of coordinates associated with the first image. [C4] Based on a second image from the aforementioned plurality of images, a second set of weight values for use by a second model associated with the neural network compression system is determined. To generate a second bitstream having a compressed version of the second set of weight values, Outputting the second bitstream for transmission to the receiver A method of C1 that further includes the following: [C5] The method of C4, wherein the second model is configured to determine the optical flow between the first image and the second image. [C6] Based on the optical flow, determine at least one updated weight value from the first plurality of weight values. A method using C5 that further includes these features. [C7] Quantizing the first plurality of weight values under a weight plier to produce a plurality of quantized weight values, wherein the first bitstream comprises compressed versions of the plurality of quantized weight values. A method of C1 that further includes the following: [C8] The method of C7, wherein the weighted pliers are selected to minimize rate loss associated with sending the first bitstream to the receiver. [C9] The first bitstream described above is generated by The method of C7, comprising entropy encoding the first plurality of weight values using the weight plier. [C10] The method according to C7, wherein the first set of weight values are quantized using fixed-point quantization. [C11] The method according to C10, wherein the fixed-point quantization is implemented using a machine learning algorithm. [C12] Based on the first image, select a model architecture corresponding to the first model. A method of C1 that further includes the following: [C13] To generate a second bitstream having a compressed version of the aforementioned model architecture, Outputting the second bitstream for transmission to the receiver A method for C12 that further includes the following features. [C14] Selecting the aforementioned model architecture means Based on the first image, adjust multiple weight values associated with one or more model architectures, wherein each of the one or more model architectures is associated with one or more model characteristics. Determining at least one strain between the first image and the reconstructed data output corresponding to each of the one or more model architectures, Selecting the model architecture from the one or more model architectures based on the at least one strain mentioned above. A method for C12 comprising the following: [C15] The method according to C14, wherein one or more of the model characteristics include at least one of width, depth, resolution, size of the convolutional kernel, and input dimension. [C16] At least one memory, The at least one processor coupled to the at least one memory and A device comprising, wherein the at least one processor, Receiving multiple images for compression by a neural network compression system, Based on the first image from the aforementioned plurality of images, a first set of weight values associated with the first model of the neural network compression system is determined. To generate a first bitstream having compressed versions of the first set of weight values, Outputting the first bitstream for transmission to the receiver A device configured to perform the following actions. [C17] The apparatus according to C16, wherein at least one layer of the first model includes position coding of a plurality of coordinates associated with the first image. [C18] The apparatus according to C17, wherein the first model is configured to determine one or more pixel values corresponding to the plurality of coordinates associated with the first image. [C19] The aforementioned at least one processor, Based on a second image from the aforementioned plurality of images, a second set of weight values for use by a second model associated with the neural network compression system is determined. To generate a second bitstream having a compressed version of the second set of weight values, Outputting the second bitstream for transmission to the receiver The apparatus described in C16, further configured to perform the following actions. [C20] The apparatus according to C19, wherein the second model is configured to determine the optical flow between the first image and the second image. [C21] The aforementioned at least one processor, Based on the optical flow, determine at least one updated weight value from the first plurality of weight values. The apparatus described in C20, further configured to perform the following actions. [C22] The aforementioned at least one processor is Quantizing the first plurality of weight values under a weight plier to produce a plurality of quantized weight values, wherein the first bitstream comprises compressed versions of the plurality of quantized weight values. The apparatus described in C16, further configured to perform the following actions. [C23] The apparatus according to C22, wherein the weight pliers are selected to minimize rate loss associated with sending the first bitstream to the receiver. [C24] The apparatus according to C22, wherein the at least one processor is further configured to entropy encode the first plurality of weight values using the weight plier in order to generate the first bitstream. [C25] The apparatus according to C22, wherein the first plurality of weight values are quantized using fixed-point quantization. [C26] The apparatus according to C25, wherein the fixed-point quantization is implemented using a machine learning algorithm. [C27] The aforementioned at least one processor, Based on the first image, select a model architecture corresponding to the first model. The apparatus described in C16, further configured to perform the following actions. [C28] The aforementioned at least one processor, To generate a second bitstream having a compressed version of the aforementioned model architecture, Outputting the second bitstream for transmission to the receiver The apparatus described in C27, further configured to perform the following actions. [C29] To select the model architecture, the at least one processor adjusts a plurality of weight values associated with one or more model architectures based on the first image, wherein each of the one or more model architectures is associated with one or more model characteristics. Determining at least one strain between the first image and the reconstructed data output corresponding to each of the one or more model architectures, Selecting the model architecture from the one or more model architectures based on the at least one strain mentioned above. The apparatus described in C27, further configured to perform the following actions. [C30] The apparatus according to C29, wherein one or more model characteristics include at least one of width, depth, resolution, size of the convolutional kernel, and input dimension. [C31] A method for processing media data, Receiving a compressed version of the first set of neural network weight values associated with the first image from a set of images, Expanding the first set of neural network weight values, Using a first neural network model, process the first set of neural network weight values to produce the first image. A method that includes [a certain feature]. [C32] Receiving a compressed version of a second set of neural network weight values associated with a second image from the aforementioned set of images, Expanding the second set of neural network weight values, Using a second neural network model, process the second set of neural network weights to determine the optical flow between the first image and the second image. A method for C31 that further includes the following features. [C33] Based on the optical flow, determine at least one updated weight value from the first set of neural network weight values associated with the first neural network model. A method using C32 that further incorporates these features. [C34] Using the first neural network model, process the at least one updated weight value to produce a reconstructed version of the second image. A method using C33 that further includes these features. [C35] The method according to C31, wherein the first set of neural network weight values are quantized under a weight plier. [C36] The method according to C31, wherein the compressed versions of the first plurality of neural network weight values are received in an entropy-encoded bitstream. [C37] To receive a compressed version of the neural network architecture corresponding to the first neural network model described above. A method for C31 that further includes the following features. [C38] At least one memory, The at least one processor coupled to the at least one memory and A device comprising, wherein the at least one processor, Receiving a compressed version of the first set of neural network weight values associated with the first image from a set of images, Expanding the first set of neural network weight values, Using a first neural network model, process the first set of neural network weight values to produce the first image. A device configured to perform the following actions. [C39] The aforementioned at least one processor, Receiving a compressed version of a second set of neural network weight values associated with a second image from the aforementioned set of images, Expanding the second set of neural network weight values, Using a second neural network model, process the second set of neural network weights to determine the optical flow between the first image and the second image. The apparatus described in C38, further configured to perform the following actions. [C40] The aforementioned at least one processor, Based on the optical flow, determine at least one updated weight value from the first set of neural network weight values associated with the first neural network model. The apparatus described in C39, further configured to perform the following actions. [C41] The aforementioned at least one processor, Using the first neural network model, process the at least one updated weight value to produce a reconstructed version of the second image. The apparatus described in C40, further configured to perform the following actions. [C42] The apparatus described in C38, wherein the first set of neural network weight values are quantized under a weight plier. [C43] The apparatus according to C38, wherein the compressed versions of the first plurality of neural network weight values are received in an entropy-coded bitstream. [C44] The aforementioned at least one processor, To receive a compressed version of the neural network architecture corresponding to the first neural network model described above. The apparatus described in C38, further configured to perform the following actions.
Claims
[Claim 1] A method for processing media data, Receiving multiple images for compression by a neural network compression system, Using initialized weight values for the weights of the first model of the neural network compression system, the coordinate grid is processed to generate a reconstructed output value for the first image from the plurality of images. To determine the loss for the reconstructed output value, the values of the first image are compared with the reconstructed output value for the first image, Adjusting the weights of the first model to generate a first set of adjusted weight values for the weights of the first model for reconstructing the first image, using backpropagation to reduce the determined loss, To generate a first bitstream comprising compressed versions of the first set of adjusted weight values, In order to reconstruct the first image, the first bitstream is output for transmission to the receiver, Based on a second image from the plurality of images, a second plurality of adjusted weight values are determined for the weights of a second model associated with the neural network compression system, and the second model is configured to determine the optical flow between the first image and the second image, and based on the optical flow, at least one updated weight value is determined from the first plurality of adjusted weight values. A method that includes [a certain feature]. [Claim 2] At least one layer of the first model includes position coding of a plurality of coordinates associated with the first image, The method according to claim 1, wherein the reconstructed output value includes one or more pixel values corresponding to the plurality of coordinates associated with the first image. [Claim 3] To generate a second bitstream comprising a compressed version of the second plurality of adjusted weight values, Outputting the second bitstream for transmission to the receiver The method according to claim 1, further comprising: [Claim 4] Further comprising determining at least one updated weight value from the first plurality of adjusted weight values based on the optical flow, The method according to claim 3. [Claim 5] Quantizing the first set of adjusted weight values under a weight plier to produce a set of quantized weight values, wherein the first bitstream comprises compressed versions of the set of quantized weight values. The method according to claim 1, further comprising: [Claim 6] The method according to claim 5, wherein the weight pliers are selected to minimize rate loss associated with sending the first bitstream to the receiver. [Claim 7] The generation of the first bitstream is The method according to claim 5, further comprising entropy encoding the first plurality of adjusted weight values using the weight plier. [Claim 8] Based on the first image, select a model architecture corresponding to the first model. The method according to claim 1, further comprising: [Claim 9] To generate a second bitstream having a compressed version of the aforementioned model architecture, Outputting the second bitstream for transmission to the receiver The method according to claim 8, further comprising: [Claim 10] Selecting the aforementioned model architecture means Based on the first image, determine the respective set of adjusted weight values associated with each of the multiple model architectures, wherein each of the multiple model architectures is associated with one or more model characteristics. To determine at least one strain between the first image and each of the reconstructed data outputs corresponding to each of the plurality of model architectures, Selecting the model architecture from the plurality of model architectures based on the at least one strain, Equipped with, The method according to claim 9, wherein each of the one or more model characteristics includes at least one of width, depth, resolution, size of the convolutional kernel, and input dimension. [Claim 11] At least one memory, At least one processor coupled to the at least one memory, A device comprising, wherein the at least one processor, Receiving multiple images for compression by a neural network compression system, Using initialized weight values for the weights of the first model of the neural network compression system, the coordinate grid is processed to generate a reconstructed output value for the first image from the plurality of images. To determine the loss for the reconstructed output value, the values of the first image are compared with the reconstructed output value for the first image, Adjusting the weights of the first model to generate a first set of adjusted weight values for the weights of the first model for reconstructing the first image, using backpropagation to reduce the determined loss, To generate a first bitstream comprising compressed versions of the first set of adjusted weight values, In order to reconstruct the first image, the first bitstream is output for transmission to the receiver, Based on a second image from the plurality of images, a second plurality of adjusted weight values are determined for the weights of a second model associated with the neural network compression system, and the second model is configured to determine the optical flow between the first image and the second image, and based on the optical flow, at least one updated weight value is determined from the first plurality of adjusted weight values. A device configured to perform the following actions. [Claim 12] The aforementioned at least one processor, To generate a second bitstream comprising a compressed version of the second set of adjusted weight values, Outputting the second bitstream for transmission to the receiver, It is further configured to do the following: The aforementioned at least one processor, Based on the optical flow, determine at least one updated weight value from the first set of adjusted weight values. The apparatus according to claim 11, further configured to perform the following: [Claim 13] The aforementioned at least one processor, Based on the first image, select a model architecture corresponding to the first model. Further configured to perform, and in order to select the model architecture, the at least one processor, Based on the first image, determine the respective set of adjusted weight values associated with each of the multiple model architectures, wherein each of the multiple model architectures is associated with one or more model characteristics. To determine at least one strain between the first image and the reconstructed data output corresponding to each of the multiple model architectures, Selecting a model architecture from a plurality of model architectures based on the aforementioned at least one strain, It is further configured to do the following: Each of the one or more model characteristics includes at least one of width, depth, resolution, size of the convolution kernel, and input dimension. The apparatus according to claim 11. [Claim 14] A method for processing media data, To receive a compressed version of the model architecture corresponding to the first neural network model, Receiving a compressed version of a first set of neural network weight values associated with the first neural network model for reconstructing a first image from multiple images, To generate a reconstructed version of the first plurality of neural network weight values, the compressed version of the first plurality of neural network weight values is decompressed, Using the first neural network model, the reconstructed versions of the first set of neural network weight values are processed to generate a reconstructed version of the first image. Receiving a compressed version of a second set of neural network weight values associated with a second image from the aforementioned set of images, To generate a reconstructed version of the second set of neural network weight values, the compressed version of the second set of neural network weight values is decompressed, Using a second neural network model, process the reconstructed version of the second set of neural network weight values to determine the optical flow between the first image and the second image, Based on the optical flow, determine at least one updated weight value from the reconstructed version of the first plurality of neural network weight values associated with the reconstructed version of the first neural network model. A method that includes [a certain feature]. [Claim 15] At least one memory, The at least one processor coupled to the at least one memory and A device comprising, wherein the at least one processor, To receive a compressed version of the model architecture corresponding to the first neural network model, Receiving a compressed version of a first set of neural network weight values associated with the first neural network model for reconstructing a first image from multiple images, To generate a reconstructed version of the first plurality of neural network weight values, the compressed version of the first plurality of neural network weight values is decompressed, Using a first neural network model, process the reconstructed versions of the first set of neural network weight values to generate a reconstructed version of the first image, Receiving a compressed version of a second set of neural network weight values associated with a second image from the aforementioned set of images, To generate a reconstructed version of the second set of neural network weight values, the compressed version of the second set of neural network weight values is decompressed, Using a second neural network model, process the reconstructed version of the second set of neural network weight values to determine the optical flow between the first image and the second image, Based on the optical flow, determine at least one updated weight value from the reconstructed version of the first plurality of neural network weight values associated with the reconstructed version of the first neural network model. A device configured to perform the following actions.