202method and data processing system for lossy image or video encoding, transmission and decoding
Adaptive neural networks with mini GOPs and tailored weights for frame types and quality levels improve lossy image and video compression efficiency and quality, addressing the limitations of existing techniques.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- INTERDIGITAL VC HOLDINGS INC
- Filing Date
- 2025-12-19
- Publication Date
- 2026-06-25
AI Technical Summary
Existing lossy image and video compression techniques face challenges in efficiently reducing data transmission demands while maintaining visual quality, particularly due to limitations in removing redundant spatial and temporal correlations, and AI-based methods may result in poor compression outcomes.
A method utilizing mini Groups of Pictures (GOP) and neural networks with adaptable weights and architectures for encoding and decoding, tailored to frame types, positions, and quality levels, to produce latent representations for efficient transmission and approximation of images or videos.
This approach enhances compression efficiency by optimizing data reduction and visual quality through adaptive neural networks, minimizing errors and improving the overall performance of lossy image and video encoding and decoding.
Smart Images

Figure US2025060654_25062026_PF_FP_ABST
Abstract
Description
[0001] L1T2 / W0
[0002] 202METHOD AND DATA PROCESSING SYSTEM FOR LOSSY IMAGE OR VIDEO ENCODING, TRANSMISSION AND DECODING
[0003] CROSS REFERENCE TO RELATED APPLICATIONS
[0004] This application claims priority to United Kingdom Patent Application No. GB 2418783.3, filed December 20, 2024, and United Kingdom Patent Application No. GB 2508271.0, filed May 28. 2025, each of which is incorporated herein by reference in its entirety.
[0005] BACKGROUND
[0006] This invention relates to a method and system for lossy image or video encoding, transmission and decoding, a method, apparatus, computer program and computer readable storage medium for lossy image or video encoding and transmission, and a method, apparatus, computer program and computer readable storage medium for lossy image or video receipt and decoding.
[0007] There is increasing demand from users of communications networks for images and video content. Demand is increasing not just for the number of images viewed, and for the playing time of video; demand is also increasing for higher resolution content. This places increasing demand on communications networks and increases their energy use because of the larger amount of data being transmitted.
[0008] To reduce the impact of these issues, image and video content is compressed for transmission across the network. The compression of image and video content can be lossless or lossy compression. In lossless compression, the image or video is compressed such that all of the original information in the content can be recovered on decompression. However, when using lossless compression there is a limit to the reduction in data quantity that can be achieved. In lossy compression, some information is lost from the image or video during the compression process. Known compression techniques attempt to minimise the apparent loss of information by the removal of information that results in changes to the decompressed image or video that is not particularly noticeable to the human visual system. JPEG, JPEG2000, AVC, HEVC and AVI are examples of compression processes for image and / or video files.
[0009] In general terms, known lossy image compression techniques use the spatial correlations between pixels in images to remove redundant information during compression. For example, in an image of a blue sky, if a given pixel is blue, there is a high likelihood that the neighbouring pixels, and their neighbouring pixels, and so on, are also blue. There is accordingly no need to L1T2 / W0
[0010] retain all the raw pixel data. Instead, we can retain only a subset of the pixels which take up fewer bits and infer the pixel values of the other pixels using information derived from spatial correlations.
[0011] A similar approach is applied in known lossy video compression techniques. That is, spatial correlations between pixels allow the removal of redundant information during compression. However, in video compression, there is further information redundancy in the form of temporal correlations. For example, in a video of an aircraft flying across a blue-sky background, most of the pixels of the blue sky do not change at all between frames of the video. The most of the blue sky pixel data for the frame at position t = 0 in the video is identical to that at position t = 10. Storing this identical, temporally correlated, information is inefficient. Instead, only the blue sky’ pixel data for a subset of the frames is stored and the rest are inferred from information derived from temporal correlations.
[0012] In the realm of lossy video compression in particular, the removal of redundant temporally correlated information in a video sequence is known inter-frame redundancy.
[0013] One technique using inter-frame redundancy that is widely used in standard video compression algorithms involves the categorization of video frames into three types: I-frames. P-frames, and B-frames. Each frame type carries distinct properties concerning their encoding and decoding process, playing different roles in achieving high compression ratios while maintaining acceptable visual quality'.
[0014] I-frames, or intra-coded frames, serve as the foundation of the video sequence. These frames are self-contained, each one encoding a complete image without reference to any other frame. In terms of compression, I-frames are least compressed among all frame ty pes, thus carry ing the most data. However, their independence provides several benefits, including being the starting point for decompression and enabling random access, crucial for functionalities like fast-forwarding or rewinding the video.
[0015] P-frames, or predictive frames, utilize temporal redundancy in video sequences to achieve greater compression. Instead of encoding an entire image like an I-frame, a P-frame represents the difference between itself and the closest preceding I- or P-frame. The process, known as motion compensation, identifies and encodes only the changes that have occurred, thereby significantly reducing the amount of data transmitted. Nonetheless, P-frames are dependent on previous frames for decoding. Consequently, any error during the encoding or transmission process may propagate to subsequent frames, impacting the overall video quality.
[0016] B-frames. or bidirectionally predictive frames, represent the highest level of compression. Unlike P-frames, B-frames use both the preceding and following frames as L1T2 / W0
[0017] references in their encoding process. By predicting motion both forwards and backwards in time, B-frames encode only the differences that cannot be accurately anticipated from the previous and next frames, leading to substantial data reduction. Although this bidirectional prediction makes B-frames more complex to generate and decode, it does not propagate decoding errors since they are not used as references for other frames. Artificial intelligence (AI) based compression techniques achieve compression and decompression of images and videos through the use of trained neural networks in the compression and decompression process. Typically, during training of the neutral networks, the difference between the original image and video and the compressed and decompressed image and video is analyzed and the parameters of the neural networks are modified to reduce this difference while minimizing the data required to transmit the content. However. AI based compression methods may achieve poor compression results in terms of the appearance of the compressed image or video or the amount of information required to be transmitted.
[0018] An example of an AI based image compression process comprising a hyper-network is described in Balle, Johannes, et al. " Variational image compression with a scale hyperprior." arXiv preprint arXiv: 1802.01436 (2018), which is hereby incorporated by reference.
[0019] An example of an AI based video compression approach is shown in Agustsson, E., Minnen, D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference.
[0020] A further example of an AI based video compression approach is shown in Mentzer, F., Agustsson, E., Balle, J., Minnen, D., Johnston, N., and Toderici, G. (2022, November). Neural video compression using gans for detail synthesis and propagation. In Computer Vision-ECCV 2022: 17th European Conference. Tel Aviv, Israel, October 23-27. 2022, Proceedings, Part XXVI (pp. 562-578), which is hereby incorporated by reference.
[0021] SUMMARY
[0022] According to an aspect there is provided a method for lossy image or video encoding and transmission, and decoding, the method comprising the steps of:
[0023] receiving a sequence of images at a first computer system;
[0024] assigning one or more of the images to a mini GOP; and
[0025] encoding, transmitting and decoding the images of the mini GOP by:
[0026] with a first neural network, encoding the images to produce latent representations; L1T2 / W0
[0027] transmitting the latent representations to a second computer system; and
[0028] with a second neural network, decoding the latent representations to produce output images, wherein the output images are approximations of the images of the mini GOP.
[0029] wherein the first and second neural network use a first set of weights and / or network architecture for encoding and decoding a first image of the mini GOP, and
[0030] wherein the first and second neural network use a second set of weights and / or network architecture for encoding and decoding a second image of the mini GOP.
[0031] Optionally, comprising, for each image of the mini GOP, selecting the first or the second set of weights and / or network architecture based on a frame type of the image.
[0032] Optionally, comprising, for each image of the mini GOP, selecting the first or the second set of weights and / or network architecture based on a frame position in the mini GOP of the image.
[0033] Optionally, comprising, for each image of the mini GOP, selecting the first or the second set of weights and / or network architecture based on a quality level in the mini GOP of the image.
[0034] Optionally, the first and second neural network using the first set of weights and / or network architecture comprises more layers than the first and second neural network using the second set of weights and / or network architecture.
[0035] Optionally, the second image is assigned to a leaf node of the miniGOP.
[0036] Optionally, the second neural network comprises a decoder neural network, a hyper decoder neural network, and a hyper hyper decoder neural network.
[0037] Optionally, the first set of weights and / or network architecture associated with the decoder neural network is different to the second set of weights and / or network architecture associated with the decoder neural network.
[0038] Optionally, the first set of weights and / or network architecture associated with the decoder neural network is different to the second set of weights and / or network architecture associated with the decoder neural network. Optionally, the first neural network comprises an encoder neural network, a hyper encoder neural network, and a hyper hyper encoder neural network.
[0039] Optionally, the first set of weights and / or network architecture associated with the hyper encoder neural network and / or the hyper hyper encoder neural network is different to the second set of w eights and / or network architecture associated with the hyper encoder neural network and the hyper hyper encoder neural network. L1T2 / W0
[0040] Optionally, the first neural network comprises a decoder neural network, a hyper decoder neural network, and a hyper hyper decoder neural network.
[0041] Optionally, comprising producing a mask for masking at least a portion of an output of at least one of the decoder neural network, the hyper decoder neural network, and / or the hyper hyper decoder neural network, and the method comprising modifying the mask based on a frame type, a frame position in the mini GOP and / or a frame quality in the mini GOP of the image being encoded.
[0042] According to an aspect there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of:
[0043] receiving a sequence of images at a first computer system;
[0044] assigning one or more of the images to a mini GOP; and
[0045] encoding and transmitting the images of the mini GOP by:
[0046] with a first neural network, encoding the images to produce latent representations; transmitting the latent representations to a second computer system;
[0047] wherein the first neural network uses a first set of weights and / or network architecture for encoding a first image of the mini GOP, and
[0048] wherein the first neural network uses a second set of weights and / or network architecture for encoding a second image of the mini GOP.
[0049] According to an aspect there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of:
[0050] receiving latent representations at a second computer system, the latent representations produced by receiving a sequence of images at a first computer system, assigning one or more of the images to a mini GOP, and encoding the images of the mini GOP by, with a first neural network, encoding the images to produce latent representations;
[0051] the method further comprising decoding the images of the mini GOP by:
[0052] with a second neural network, decoding the latent representations to produce output images, wherein the output images are approximations of the images of the mini GOP, wherein the second neural network uses a first set of weights and / or network architecture for decoding a first image of the mini GOP. and
[0053] wherein the second neural network uses a second set of weights and / or network architecture for decoding a second image of the mini GOP.
[0054] Optionally, comprising, for each image of the mini GOP, selecting the first or the second set of weights and / or network architecture based on a frame type of the image. L1T2 / W0
[0055] Optionally, comprising, for each image of the mini GOP, selecting the first or the second set of weights and / or network architecture based on a frame position in the mini GOP of the image.
[0056] Optionally, comprising, for each image of the mini GOP, selecting the first or the second set of weights and / or network architecture based on a quality level in the mini GOP of the image.
[0057] Optionally, the first and second neural network using the first set of weights and / or network architecture comprises more layers than the first and second neural network using the second set of weights and / or network architecture.
[0058] Optionally, the second image is assigned to a leaf node of the mini GOP.
[0059] Optionally, the second neural network comprises a decoder neural network, a hyper decoder neural network, and a hyper hyper decoder neural network.
[0060] Optionally, the first set of weights and / or network architecture associated with the decoder neural network is different to the second set of weights and / or network architecture associated with the decoder neural network.
[0061] Optionally, the first set of weights and / or network architecture associated with the decoder neural network is different to the second set of weights and / or network architecture associated with the decoder neural network. Optionally, the first neural network comprises an encoder neural network, a hyper encoder neural network, and a hyper hyper encoder neural network.
[0062] Optionally, the first set of weights and / or network architecture associated with the hyper encoder neural network and / or the hyper hyper encoder neural network is different to the second set of weights and / or network architecture associated with the hyper encoder neural network and the hyper hyper encoder neural netw ork.
[0063] Optionally, the first neural network comprises a decoder neural network, a hyper decoder neural network, and a hyper hyper decoder neural netw ork.
[0064] Optionally, comprising producing a mask for masking at least a portion of an output of at least one of the decoder neural network, the hyper decoder neural network, and / or the hyper hyper decoder neural network, and the method comprising modifying the mask based on a frame type, a frame position in the mini GOP and / or a frame quality in the mini GOP of the image being encoded.
[0065] According to an aspect, there is provided a data processing apparatus configured to perform any of the above methods. L1T2 / W0
[0066] According to an aspect, there is provided a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the above methods
[0067] According to an aspect, there is provided a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out any of the above methods
[0068] According to an aspect there is provided a method for lossy image or video encoding and transmission, and decoding, the method comprising the steps of:
[0069] receiving a sequence of images at a first computer system;
[0070] encoding, transmitting to a second computer system and decoding at least one image of the sequence of images by:
[0071] predicting a target quality level and / or a reference image selection for encoding and decoding the at least one image;
[0072] with a first neural network, at the predicted target quality level and / or using the predicted reference image selection, encoding the at least one image to produce a latent representation;
[0073] transmitting the latent representation to a second computer system; and
[0074] with a second neural network, decoding the latent representation to produce an output image, wherein the output image is an approximation of the at least one image at the predicted target quality level.
[0075] Optionally, predicting the reference image selection is based on a property of the at least one image.
[0076] Optionally, predicting the reference image selection is based on a property of one or more previously decoded images of the sequence of images.
[0077] Optionally, predicting the reference image selection is based on how close respective representations of the one or more previously decoded images are to the at least one image.
[0078] Optionally, comprising estimating how close respective representations of the one or more previously decoded images are to the at least one image by estimating respective mean square errors between the representations and the at least one image.
[0079] Optionally, the respective representations comprise warped representations of the one or more previously decoded images, and wherein the method comprises producing the respective warped representations of the one or more previously decoded images by estimating optical flow information indicative of differences between the at least one image and the L1T2 / W0
[0080] respective previously decoded images, and using the estimated optical flow information to warp the respective previously decoded images.
[0081] Optionally, the respective representations further comprise non-warped representations of the one or more previously decoded images, and wherein predicting the reference image selection is based on a ratio between (i) how close a warped representation is to the at least one image and (ii) how close a corresponding non-warped representation is to the at least one image.
[0082] Optionally, predicting the target quality level is based on the at least one image.
[0083] Optionally, predicting the target quality level comprises predicting a target qualify level from a plurality of discrete target qualify' levels.
[0084] Optionally, predicting the target quality level comprises selecting a target quality level based on a predetermined sequence of discrete target quality levels when a first condition is met, or selecting a target quality level deviating from the predetermined sequence when a second condition is met.
[0085] Optionally, predicting the target quality level comprises: with the first neural network, encoding at a starting target quality level the at least one image to produce a latent representation; with the second neural network decoding the latent representation to produce an approximation of the at least one image having the starting target quality level; estimating a difference between the at least one image and the approximation of the at least one image having the starting target quality level; updating the starting target quality level based on the difference; repeating the above steps iteratively until a predetermined condition is met to reach a final target quality level; and using the final target quality level as the predicted target quality level.
[0086] Optionally, comprising downsampling the at least one image, and wherein predicting the target quality level is based on the downsampled at least one image.
[0087] Optionally, the target quality level is associated with one or more variable rate control parameters, wherein the method comprises modifying an output of the first neural network using the one or more variable rate control parameters to produce the latent representation, and modifying the latent representation using the one or more variable rate control parameters before performing the decoding, and wherein the method comprises applying an offset to the one or more learned variable rate control parameters.
[0088] Optionally, the offset is based on the predicted target quality level and / or reference image selection. L1T2 / W0
[0089] Optionally, the offset comprises a learned offset.
[0090] Optionally, the one or more variable rate control parameters comprise a matrix of first values, wherein the offset comprises a matrix of second values, and wherein applying the offset to the variable rate control parameters comprises modifying the first values using the second values.
[0091] Optionally, comprising repeating the predicting, encoding, and decoding for each image of the sequence of images to predict a sequence of target quality levels and reference image selections of a scalable video coding scheme optimised for the sequence of images.
[0092] Optionally, comprising transmitting from the first computer system to the second computer system information representing the sequence of target quality levels and reference image selections.
[0093] According to an aspect there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of:
[0094] receiving a sequence of images at a first computer system;
[0095] encoding, transmitting to a second computer system at least one image of the sequence of images by:
[0096] predicting a target quality level and / or a reference image selection for encoding and decoding the at least one image;
[0097] with a first neural network, at the predicted target quality level and / or using the predicted reference image selection, encoding the at least one image to produce a latent representation;
[0098] transmitting the latent representation to a second computer system.
[0099] According to an aspect there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of:
[0100] receiving a latent representation at a second computer system, the latent representation produced by predicting a target quality level and / or reference image selection, with a first neural network, at the predicted target quality level and / or using the predicted reference image selection, encoding the at least one image to produce the latent representation; and
[0101] with a second neural network, decoding the latent representation to produce an output image, wherein the output image is an approximation of the at least one image at the predicted target quality level.
[0102] Optionally, predicting the reference image selection is based on a property of the at least one image. L1T2 / W0
[0103] Optionally, predicting the reference image selection is based on a property of one or more previously decoded images of the sequence of images.
[0104] Optionally, predicting the reference image selection is based on how close respective representations of the one or more previously decoded images are to the at least one image.
[0105] Optionally, comprising estimating how close respective representations of the one or more previously decoded images are to the at least one image by estimating respective mean square errors between the representations and the at least one image.
[0106] Optionally, the respective representations comprise warped representations of the one or more previously decoded images, and wherein the method comprises producing the respective warped representations of the one or more previously decoded images by estimating optical flow information indicative of differences between the at least one image and the respective previously decoded images, and using the estimated optical flow information to warp the respective previously decoded images.
[0107] Optionally, the respective representations further comprise non-warped representations of the one or more previously decoded images, and wherein predicting the reference image selection is based on a ratio between (i) how close a warped representation is to the at least one image and (ii) how close a corresponding non-warped representation is to the at least one image.
[0108] Optionally, predicting the target quality level is based on the at least one image.
[0109] Optionally, predicting the target quality level comprises predicting a target quality level from a plurality of discrete target quality levels.
[0110] Optionally, predicting the target quality level comprises selecting a target quality level based on a predetermined sequence of discrete target quality levels when a first condition is met, or selecting a target quality level deviating from the predetermined sequence when a second condition is met.
[0111] Optionally, predicting the target quality level comprises:
[0112] with the first neural network, encoding at a starting target quality level the at least one image to produce a latent representation;
[0113] with the second neural network decoding the latent representation to produce an approximation of the at least one image having the starting target quality level;
[0114] estimating a difference between the at least one image and the approximation of the at least one image having the starting target quality' level;
[0115] updating the starting target quality level based on the difference;
[0116] repeating the above steps iteratively until a predetermined condition is met to reach a final target quality level; and L1T2 / W0
[0117] using the final target quality level as the predicted target quality level.
[0118] Optionally, comprising downsampling the at least one image, and wherein predicting the target quality level is based on the downsampled at least one image.
[0119] Optionally, the target quality level is associated with one or more variable rate control parameters, wherein the method comprises modifying an output of the first neural network using the one or more variable rate control parameters to produce the latent representation, and modifying the latent representation using the one or more variable rate control parameters before performing the decoding, and wherein the method comprises applying an offset to the one or more learned variable rate control parameters.
[0120] Optionally, the offset is based on the predicted target quality level and / or reference image selection.
[0121] Optionally, the offset comprises a learned offset.
[0122] Optionally, the one or more variable rate control parameters comprise a matrix of first values, wherein the offset comprises a matrix of second values, and wherein applying the offset to the variable rate control parameters comprises modifying the first values using the second values.
[0123] Optionally, comprising repeating the predicting, encoding, and decoding for each image of the sequence of images to predict a sequence of target quality levels and reference image selections of a scalable video coding scheme optimised for the sequence of images.
[0124] Optionally, comprising transmitting from the first computer system to the second computer system information representing the sequence of target quality levels and reference image selections.
[0125] According to an aspect, there is provided a data processing apparatus configured to perform any of the above methods.
[0126] According to an aspect, there is provided a computer program comprising instructions w hich, when the program is executed by a computer, cause the computer to carry out any of the above methods
[0127] According to an aspect, there is provided a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out any of the above methods
[0128] According to an aspect there is provided a method for lossy image or video encoding and transmission, and decoding, the method comprising the steps of:
[0129] receiving a sequence of images at a first computer system; L1T2 / W0
[0130] setting a mini group of pictures (GOP) size based on at least two images of the sequence and assigning one or more of the images to a mini GOP of the size;
[0131] encoding, transmitting and decoding the images of the mini GOP by:
[0132] with a first neural network, encoding the images to produce latent representations; transmitting the latent representations to a second computer system; and
[0133] with a second neural network, decoding the latent representations to produce output images, wherein the output images are approximations of the images of the mini GOP.
[0134] Optionally, setting the mini GOP size comprises setting a starting mini GOP size and repeating the steps of:
[0135] (i) estimating a difference between at least two images of the sequence, and
[0136] (ii) changing the mini GOP size, until the difference exceeds a threshold.
[0137] Optionally, the difference between the at least two images comprises optical flow information.
[0138] Optionally, the optical flow information comprises a tensor of values and wherein the threshold is exceeded when a norm of at least one slice of the tensor exceeds a threshold value.
[0139] Optionally, with a third neural network and using the at least two images, producing the optical flow information.
[0140] Optionally, setting the mini GOP size comprises setting a starting mini GOP size and repeating the steps of:
[0141] (i) with a fourth neural network and using the at least two images, producing a mask for occluding pixels of a first image of the at least two images when the first image is a reference frame for encoding and decoding a second image of the at least two images,
[0142] (ii) estimating a statistical value associated with the mask, and
[0143] (iii) changing the mini GOP size, until the statistical value exceeds a threshold.
[0144] Optionally, the mask comprises a tensor of values and wherein the threshold is exceeded when a norm of at least one slice of the tensor exceeds a threshold value.
[0145] Optionally, changing the mini GOP size comprises increasing the mini GOP size. Optionally, the at least two images of the sequence comprise a first image and a next image in the sequence.
[0146] Optionally, assigning the one or more images to the mini GOP comprises specifying, for the one or more images, a frame type parameter, a quality level parameter, and / or an encode order parameter.
[0147] Optionally, comprising assigning all images of the sequence to one or more mini GOPs before performing the encoding, transmitting and decoding. L1T2 / W0
[0148] Optionally, comprising performing the encoding, transmitting and decoding of a first mini GOP before assigning one or more of the images to a second mini GOP.
[0149] Optionally, comprising downsampling the at least two images of the sequence, and wherein setting the mini GOP size is based on the at least two downsampled images of the sequence.
[0150] According to an aspect there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of
[0151] receiving a sequence of images at a first computer system;
[0152] setting a mini group of pictures (GOP) size based on at least two images of the sequence and assigning one or more of the images to a mini GOP of the size;
[0153] encoding and transmitting the images of the mini GOP by:
[0154] with a first neural network, encoding the images to produce latent representations; and transmitting the latent representations to a second computer system.
[0155] According to an aspect there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of:
[0156] receiving latent representations at a second computer system, the latent representations produced at a first computer system by setting a mini group of pictures (GOP) size based on at least two images of a sequence of images, assigning one or more of the images to a mini GOP of the size, and encoding the images of the mini GOP by: with a first neural network, encoding the images to produce the latent representations; and
[0157] with a second neural network, decoding the latent representations to produce output images, wherein the output images are approximations of the images of the mini GOP.
[0158] Optionally, setting the mini GOP size comprises setting a starting mini GOP size and repeating the steps of:
[0159] (i) estimating a difference between at least two images of the sequence, and
[0160] (ii) changing the mini GOP size, until the difference exceeds a threshold.
[0161] Optionally, the difference between the at least two images comprises optical flow information.
[0162] Optionally, the optical flow information comprises a tensor of values and wherein the threshold is exceeded when a norm of at least one slice of the tensor exceeds a threshold value.
[0163] Optionally, with a third neural network and using the at least two images, producing the optical flow information.
[0164] Optionally, setting the mini GOP size comprises setting a starting mini GOP size and repeating the steps of: L1T2 / W0
[0165] (i) with a fourth neural network and using the at least two images, producing a mask for occluding pixels of a first image of the at least two images when the first image is a reference frame for encoding and decoding a second image of the at least two images,
[0166] (ii) estimating a statistical value associated with the mask, and
[0167] (iii) changing the mini GOP size, until the statistical value exceeds a threshold.
[0168] Optionally, the mask comprises a tensor of values and wherein the threshold is exceeded when a norm of at least one slice of the tensor exceeds a threshold value.
[0169] Optionally, changing the mini GOP size comprises increasing the mini GOP size. Optionally, the at least two images of the sequence comprise a first image and a next image in the sequence.
[0170] Optionally, assigning the one or more images to the mini GOP comprises specifying, for the one or more images, a frame type parameter, a quality level parameter, and / or an encode order parameter.
[0171] Optionally, comprising assigning all images of the sequence to one or more mini GOPs before performing the encoding, transmitting and decoding.
[0172] Optionally, comprising performing the encoding, transmitting and decoding of a first mini GOP before assigning one or more of the images to a second mini GOP.
[0173] Optionally, comprising downsampling the at least two images of the sequence, and wherein setting the mini GOP size is based on the at least two downsampled images of the sequence.
[0174] According to an aspect, there is provided a data processing apparatus configured to perform any of the above methods.
[0175] According to an aspect, there is provided a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the above methods
[0176] According to an aspect, there is provided a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out any of the above methods
[0177] According to an aspect there is provided a method for lossy image or video encoding, transmission and decoding, the method comprising the steps of:
[0178] receiving an input image at a first computer system;
[0179] encoding the input image using a first trained neural network to produce a latent representation; L1T2 / W0
[0180] selecting a parameter associated with a position in a frame hierarchy scheme, and using the selected parameter to produce values for modifying the latent representation;
[0181] processing the latent representation using the values to produce a modified latent representation;
[0182] transmitting the modified latent representation to a second computer system; and processing the modified latent representation using the values and decoding the processed modified latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image at the position in the frame hierarchy scheme.
[0183] Optionally, producing the values comprises applying a function to the selected parameter, wherein the function is based on one or more learned gain units.
[0184] Optionally, each gain unit comprises a tensor of learned values, and wherein the function comprises an interpolation function for producing, based on the selected parameter, the values for modifying the latent representation by interpolating between the learned values of the gain unit tensors.
[0185] Optionally, the selected parameter comprises one of a plurality of learned parameters. Optionally, at least one of the plurality of learned parameters is based on at least one other of the plurality of learned parameters.
[0186] Optionally, at least one of the plurality of learned parameters comprises a cumulative sum of at least tw o others of the plurality’ of learned parameters.
[0187] According to an aspect there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of:
[0188] receiving an input image at a first computer system;
[0189] encoding the input image using a first trained neural network to produce a latent representation;
[0190] selecting a parameter associated with a position in a frame hierarchy scheme, and using the selected parameter to produce values for modifying the latent representation;
[0191] processing the latent representation using the values to produce a modified latent representation; and
[0192] transmitting the modified latent representation to a second computer system.
[0193] According to an aspect there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of:
[0194] receiving a modified latent representation at a second computer system, the modified latent representation produced by encoding an input image using a first trained neural netw ork L1T2 / W0
[0195] to produce a latent representation, selecting a parameter associated with a position in a frame hierarchy scheme, using the selected parameter to produce values for modifying the latent representation, and processing the latent representation using the values to produce the modified latent representation; and
[0196] processing the modified latent representation using the values and decoding the processed modified latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image at the position in the frame hierarchy scheme.
[0197] Optionally, producing the values comprises applying a function to the selected parameter, wherein the function is based on one or more learned gain units.
[0198] Optionally, each gain unit comprises a tensor of learned values, and wherein the function comprises an interpolation function for producing, based on the selected parameter, the values for modifying the latent representation by interpolating between the learned values of the gain unit tensors.
[0199] Optionally, the selected parameter comprises one of a plurality' of learned parameters. Optionally, at least one of the plurality of learned parameters is based on at least one other of the plurality of learned parameters.
[0200] Optionally, at least one of the plurality of learned parameters comprises a cumulative sum of at least two others of the plurality of learned parameters.
[0201] According to an aspect, there is provided a data processing apparatus configured to perform any of the above methods.
[0202] According to an aspect, there is provided a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the above methods According to an aspect, there is provided a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out any of the above methods BRIEF DESCRIPTION OF THE DRAWINGS Aspects of the invention will now be described by way of examples, with reference to the following figures in which:
[0203] Figure 1 illustrates an example of an image or video compression, transmission and decompression pipeline. L1T2 / W0
[0204] Figure 2 illustrates a further example of an image or video compression, transmission and decompression pipeline including a hyper-network.
[0205] Figure 3 illustrates an example of a video compression, transmission and decompression pipeline.
[0206] Figure 4 illustrates an example of a video compression, transmission and decompression system.
[0207] Figure 6 illustrates an example of a video compression, transmission and decompression pipeline.
[0208] Figure 7 illustrates an example of a video compression, transmission and decompression pipeline.
[0209] Figure 8 illustrates a reference frame dependency scheme.
[0210] Figure 9 illustrates a predictive reference frame dependency scheme.
[0211] Figure 10a illustrates a step of a predictive reference frame dependency and quality level scheme.
[0212] Figure 10b illustrates a step of a predictive reference frame dependency and quality level scheme.
[0213] Figure 10c illustrates a step of a predictive reference frame dependency and quality’ level scheme.
[0214] Figure lOd illustrates a step of a predictive reference frame dependency and quality level scheme.
[0215] Figure lOe illustrates a step of a predictive reference frame dependency and quality level scheme.
[0216] Figure 11 illustrates a part of a video compression, transmission and decompression pipeline.
[0217] Figure 12 illustrates a part of a video compression, transmission and decompression pipeline.
[0218] Figure 13 illustrates a part of a video compression, transmission and decompression pipeline.
[0219] Figure 14 illustrates a part of a video compression, transmission and decompression pipeline.
[0220] Figure 15 illustrates a part of a video compression, transmission and decompression pipeline. L1T2 / W0
[0221] DETAILED DESCRIPTION OF THE DRAWINGS
[0222] Compression processes may be applied to any form of information to reduce the amount of data, or file size, required to store that information. Image and video information is an example of information that may be compressed. The file size required to store the information, particularly during a compression process when referring to the compressed file, may be referred to as the rate. In general, compression can be lossless or lossy. In both forms of compression, the file size is reduced. However, in lossless compression, no information is lost when the information is compressed and subsequently decompressed. This means that the original file storing the information is fully reconstructed during the decompression process. In contrast to this, in lossy compression information may be lost in the compression and decompression process and the reconstructed file may differ from the original file. Image and video files containing image and video data are common targets for compression.
[0223] In a compression process involving an image, the input image may be represented as x. The data representing the image may be stored in a tensor of dimensions H x W x C, where H represents the height of the image, W represents the width of the image and C represents the number of channels of the image. Each H x W data point of the image represents a pixel value of the image at the corresponding location. Each channel C of the image represents a different component of the image for each pixel which are combined when the image file is displayed by a device. For example, an image file may have 3 channels with the channels representing the red, green and blue component of the image respectively. In this case, the image information is stored in the RGB colour space, which may also be referred to as a model or a format. Other examples of colour spaces or formats include the CMKY and the YCbCr colour models. However, the channels of an image file are not limited to storing colour information and other information may be represented in the channels. As a video may be considered a series of images in sequence, any compression process that may be applied to an image may also be applied to a video. Each image making up a video may be referred to as a frame of the video.
[0224] The output image may differ from the input image and may be represented by x. The difference between the input image and the output image may be referred to as distortion or a difference in image quality. The distortion can be measured using any distortion function which receives the input image and the output image and provides an output which represents the difference between input image and the output image in a numerical way. An example of such a method is using the mean square error (MSE) between the pixels of the input image and the L1T2 / W0
[0225] output image, but there are many other ways of measuring distortion, as will be known to the person skilled in the art. The distortion function may comprise a trained neural network.
[0226] Typically, the rate and distortion of a lossy compression process are related. An increase in the rate may result in a decrease in the distortion, and a decrease in the rate may result in an increase in the distortion. Changes to the distortion may affect the rate in a corresponding manner. A relation between these quantities for a given compression technique may be defined by a rate-distortion equation.
[0227] Al based compression processes may involve the use of neural networks. A neural network is an operation that can be performed on an input to produce an output. A neural network may be made up of a plurality of layers. The first layer of the network receives the input. One or more operations may be performed on the input by the layer to produce an output of the first layer. The output of the first layer is then passed to the next layer of the network which may perform one or more operations in a similar way. The output of the final layer is the output of the neural network.
[0228] Each layer of the neural network may be divided into nodes. Each node may receive at least part of the input from the previous layer and provide an output to one or more nodes in a subsequent layer. Each node of a layer may perform the one or more operations of the layer on at least part of the input to the layer. For example, a node may receive an input from one or more nodes of the previous layer. The one or more operations may include a convolution, a weight, a bias and an activation function. Convolution operations are used in convolutional neural networks. When a convolution operation is present, the convolution may be performed across the entire input to a layer. Alternatively, the convolution may be performed on at least part of the input to the layer.
[0229] Each of the one or more operations is defined by one or more parameters that are associated with each operation. For example, the weight operation may be defined by a weight matrix defining the weight to be applied to each input from each node in the previous layer to each node in the present layer. In this example, each of the values in the weight matrix is a parameter of the neural network. The convolution may be defined by a convolution matrix, also known as a kernel. In this example, one or more of the values in the convolution matrix may be a parameter of the neural network. The activation function may also be defined by values which may be parameters of the neural network. The parameters of the network may be varied during training of the network.
[0230] Other features of the neural network may be predetermined and therefore not varied during training of the network. For example, the number of layers of the network, the number L1T2 / W0
[0231] of nodes of the network, the one or more operations performed in each layer and the connections between the layers may be predetermined and therefore fixed before the training process takes place. These features that are predetermined may be referred to as the hyperparameters of the network. These features are sometimes referred to as the architecture of the network.
[0232] To train the neural network, a training set of inputs may be used for which the expected output, sometimes referred to as the ground truth, is known. The initial parameters of the neural network are randomized and the first training input is provided to the network. The output of the network is compared to the expected output, and based on a difference between the output and the expected output the parameters of the network are varied such that the difference between the output of the network and the expected output is reduced. This process is then repeated for a plurality of training inputs to train the network. The difference between the output of the network and the expected output may be defined by a loss function. The result of the loss function may be calculated using the difference between the output of the network and the expected output to determine the gradient of the loss function. Back-propagation of the gradient descent of the loss function may be used to update the parameters of the neural network using the gradients dL / dy of the loss function. A plurality of neural networks in a system may be trained simultaneously through back-propagation of the gradient of the loss function to each network.
[0233] In the context of image or video compression, this type of system, where simultaneous training with back-propagation through each element or the whole network architecture may be referred to as end-to-end, learned image or video compression. Unlike in traditional compression algorithms that use primarily handcrafted, manually constructed steps, an end-to-end learned system leams itself during training what combination of parameters best achieves the goal of minimising the loss function. This approach is advantageous compared to systems that are not end-to-end learned because an end-to-end system has a greater flexibility to leam weights and parameters that might be counter-intuitive to someone handcrafting features.
[0234] It will be appreciated that the term "training" or "learning" as used herein means the process of optimizing an artificial intelligence or machine learning model, based on a given set of data. This involves iteratively adjusting the parameters of the model to minimize the discrepancy between the model's predictions and the actual data, represented by the abovedescribed rate-distortion loss function.
[0235] The training process may comprise multiple epochs. An epoch refers to one complete pass of the entire training dataset through the machine learning algorithm. During an epoch, the model's parameters are updated in an effort to minimize the loss function. It is envisaged that L1T2 / W0
[0236] multiple epochs may be used to train a model, with the exact number depending on various factors including the complexity of the model and the diversity of the training data.
[0237] Within each epoch, the training data may be divided into smaller subsets known as batches. The size of a batch, referred to as the batch size, may influence the training process. A smaller batch size can lead to more frequent updates to the model's parameters, potentially leading to faster convergence to the optimal solution, but at the cost of increased computational resources. Conversely, a larger batch size involves fewer updates, which can be more computationally efficient but might converge slower or even fail to converge to the optimal solution.
[0238] The learnable parameters are updated by a specified amount each time, determined by the learning rate. The learning rate is a hyperparameter that decides how much the parameters are adjusted during the training process. A smaller learning rate implies smaller steps in the parameter space and a potentially more accurate solution, but it may require more epochs to reach that solution. On the other hand, a larger learning rate can expedite the training process but may risk overshooting the optimal solution or causing the training process to diverge.
[0239] The training described herein may involve use of a validation set. which is a portion of the data not used in the initial training, which is used to evaluate the model's performance and to prevent overfitting. Overfitting occurs when a model learns the training data too well, to the point that it fails to generalize to unseen data. Regularization techniques, such as dropout or L1 / L2 regularization, can also be used to mitigate overfitting.
[0240] It will be appreciated that training a machine learning model is an iterative process that may comprise selection and tuning of various parameters and hyperparameters. As will be appreciated, the specific details, such as hyper parameters and so on, of the training process may vary and it is envisaged that producing a trained model in this way may achieved in a number of different ways with different epochs, batch sizes, learning rates, regularisations, and so on, the details of which are not essential to enabling the advantages and effects of the present disclosure, except where stated otherwise. The point at which an "untrained" neural network is considered be "trained" is envisaged to be case specific and depend on, for example, on a number of epochs, a plateauing of any further learning, or some other metric and is not considered to be essential in achieving the advantages described herein.
[0241] More details of an end-to-end, learned compression process will now be described. It will be appreciated that in some cases, end-to-end, learned compression processes may be combined with one or more components that are handcrafted or trained separately. L1T2 / W0
[0242] In the case of Al based image or video compression, the loss function may be defined by the rate distortion equation. The rate distortion equation may be represented by Loss = D + Z * R, where D is the distortion function, A is a weighting factor, and R is the rate loss. A may be referred to as a lagrange multiplier. The langrange multiplier provides as weight for a particular term of the loss function in relation to each other term and can be used to control which terms of the loss function are favoured when training the network.
[0243] In the case of Al based image or video compression, a training set of input images may be used. An example training set of input images is the KODAK image set (for example at www.cs.albany.edu / xypan / research / snr / Kodak.html). An example training set of input images is the IMAX image set. An example training set of input images is the Imagenet dataset (for example at www.image-net.org / download). An example training set of input images is the CLIC Training Dataset P ("professional") and M ("mobile") (for example at http: / / challenge.compression.cc / tasks / ).
[0244] An example of an Al based compression, transmission decompression process 100 is shown in Figure 1. As a first step in the Al based compression process, an input image 5 is provided. The input image 5 is provided to a trained neural network 110 characterized by a function f0acting as an encoder. The encoder neural network 110 produces an output based on the input image. This output is referred to as a latent representation of the input image 5. In a second step, the latent representation is quantised in a quantisation process 140 characterised by the operation Q. resulting in a quantized latent. The quantisation process transforms the continuous latent representation into a discrete quantized latent. An example of a quantization process is a rounding function.
[0245] In a third step, the quantized latent is entropy encoded in an entropy encoding process 150 to produce a bitstream 130. The entropy encoding process may be for example, range or arithmetic encoding. In a fourth step, the bitstream 130 may be transmitted across a communication network.
[0246] In a fifth step, the bitstream is entropy decoded in an entropy decoding process 160. The quantized latent is provided to another trained neural network 120 characterized by a function geacting as a decoder, which decodes the quantized latent. The trained neural network 120 produces an output based on the quantized latent. The output may be the output image of the Al based compression process 100. The encoder-decoder system may be referred to as an autoencoder.
[0247] Entropy encoding processes such as range or arithmetic encoding are typically able to losslessly compress given input data up to close to the fundamental entropy limit of that data, L1T2 / W0
[0248] as determined by the total entropy of the distribution of that data. Accordingly, one way in which end-to-end, learned compression can minimise the rate loss term of the rate-distortion loss function and thereby increase compression effectiveness is to leam autoencoder parameter values that produce low entropy latent representation distributions. Producing latent representations distributed with as low an entropy as possible allows entropy encoding to compress the latent distributions as close to or to the fundamental entropy limit for that distribution. The lower the entropy of the distribution, the more entropy encoding can losslessly compress it and the lower the amount of data in the corresponding bitstream. In some cases where the latent representation is distributed according to a gaussian or Laplacian distribution, this learning may comprise learning optimal location and scale parameters of the gaussian or Laplacian distributions, in other cases, it allows the learning of more flexible latent representation distributions which can further help to achieve the minimising of the ratedistortion loss function in ways that are not intuitive or possible to do with handcrafted features. Examples of these and other advantages are described in W02021 / 220008A1, which is incorporated in its entirety by reference.
[0249] Something which is closely linked to the entropy encoding of the latent distribution and which accordingly also has an effect on the effectiveness of compression of end-to-end learned approaches is the quantisation step. During inference, a rounding function may be used to quantise a latent representation distribution into bins of given sizes, a rounding function is not differentiable everywhere. Rather, a rounding function is effectively one or more step functions whose gradient is either zero (at the top of the steps) or infinity (at the boundary between steps). Back propagating a gradient of a loss function through a rounding function is challenging. Instead, during training, quantisation by rounding function is replaced by one or more other approaches. For example, the functions of a noise quantisation model are differentiable everywhere and accordingly do allow backpropagation of the gradient of the loss function through the quantisation parts of the end-to-end, learned system. Alternatively, a straight-through estimator (STE) quantisation model or one other quantisation models may be used. It is also envisaged that different quantisation models may be used for during evaluation of different term of the loss function. For example, noise quantisation may be used to evaluate the rate or entropy loss term of the rate-distortion loss function while STE quantisation may be used to evaluate the distortion term.
[0250] In a similar manner to how learning parameters top produce certain distributions of the latent representation facilitates achieving better rate loss term minimisation, end-to-end learning of the quantisation process achieves a similar effect. That is, learnable quantisation L1T2 / W0
[0251] parameters provide the architecture with a further degree of freedom to achieve the goal of minimising the loss function. For example, parameters corresponding to quantisation bin sizes may be learned which is likely to result in an improved rate-distortion loss outcome compared to approaches using hand-crafted quantisation bin sizes.
[0252] Further, as the rate-distortion loss function constantly has to balance a rate loss term against a distortion loss term, it has been found that the more degrees of freedom the system has during training, the better the architecture is at achieving optimal rate and distortion trade off.
[0253] The system described above may be distributed across multiple locations and / or devices. For example, the encoder 110 may be located on a device such as a laptop computer, desktop computer, smart phone or server. The decoder 120 may be located on a separate device which may be referred to as a recipient device. The system used to encode, transmit and decode the input image 5 to obtain the output image 6 may be referred to as a compression pipeline.
[0254] The Al based compression process may further comprise a hyper-network 105 for the transmission of meta-information that improves the compression process. The hyper-network 105 comprises a trained neural network 115 acting as a hyper-encoder
[0255]
[0256] and a trained neural network 125 acting as a hyper-decoder
[0257]
[0258] . An example of such a system is shown in Figure 2. Components of the system not further discussed may be assumed to be the same as discussed above. The neural network 115 acting as a hyper-decoder receives the latent that is the output of the encoder 110. The hyper-encoder 115 produces an output based on the latent representation that may be referred to as a hyper-latent representation. The hyper-latent is then quantized in a quantization process 145 characterised by Qhto produce a quantized hyper-latent. The quantization process 145 characterised by Qhmay be the same as the quantisation process 140 characterised by Q discussed above.
[0259] In a similar manner as discussed above for the quantized latent, the quantized hyper-latent is then entropy encoded in an entropy encoding process 155 to produce a bitstream 135. The bitstream 135 may be entropy decoded in an entropy decoding process 165 to retrieve the quantized hyper-latent. The quantized hyper-latent is then used as an input to trained neural network 125 acting as a hyper-decoder. However, in contrast to the compression pipeline 100, the output of the hyper-decoder may not be an approximation of the input to the hyper-decoder 115. Instead, the output of the hyper-decoder is used to provide parameters for use in the entropy encoding process 150 and entropy decoding process 160 in the main compression process 100. For example, the output of the hyper-decoder 125 can include one or more of the mean, standard L1T2 / W0
[0260] deviation, variance or any other parameter used to describe a probability7model for the entropy encoding process 150 and entropy decoding process 160 of the latent representation. In the example shown in Figure 2, only a single entropy decoding process 165 and hyper-decoder 125 is shown for simplicity. However, in practice, as the decompression process usually takes place on a separate device, duplicates of these processes will be present on the device used for encoding to provide the parameters to be used in the entropy encoding process 150.
[0261] Further transformations may be applied to at least one of the latent and the hyper-latent at any stage in the Al based compression process 100. For example, at least one of the latent and the hyper latent may be converted to a residual value before the entropy encoding process 150,155 is performed. The residual value may be determined by subtracting the mean value of the distribution of latents or hyper-latents from each latent or hyper latent. The residual values may also be normalised.
[0262] To perform training of the Al based compression process described above, a training set of input images may be used as described above. During the training process, the parameters of both the encoder 110 and the decoder 120 may be simultaneously updated in each training step. If a hyper-network 105 is also present, the parameters of both the hyper-encoder 115 and the hyper-decoder 125 may additionally be simultaneously updated in each training step. The training process may further include a generative adversarial network (GAN). When applied to an Al based compression process, in addition to the compression pipeline described above, an additional neutral network acting as a discriminator is included in the system. The discriminator receives an input and outputs a score based on the input providing an indication of whether the discriminator considers the input to be ground truth or fake. For example, the indicator may be a score, w ith a high score associated with a ground truth input and a low7score associated with a fake input. For training of a discriminator, a loss function is used that maximizes the difference in the output indication between an input ground truth and input fake.
[0263] When a GAN is incorporated into the training of the compression process, the output image 6 may be provided to the discriminator. The output of the discriminator may then be used in the loss function of the compression process as a measure of the distortion of the compression process. Alternatively, the discriminator may receive both the input image 5 and the output image 6 and the difference in output indication may then be used in the loss function of the compression process as a measure of the distortion of the compression process. Training of the neural network acting as a discriminator and the other neutral networks in the compression process may be performed simultaneously. During use of the trained compression pipeline for L1T2 / W0
[0264] the compression and transmission of images or video, the discriminator neural network is removed from the system and the output of the compression pipeline is the output image 6.
[0265] Incorporation of a GAN into the training process may cause the decoder 120 to perform hallucination. Hallucination is the process of adding information in the output image 6 that was not present in the input image 5. In an example, hallucination may add fine detail to the output image 6 that was not present in the input image 5 or received by the decoder 120. The hallucination performed may be based on information in the quantized latent received by decoder 120.
[0266] Details of a video compression process will now be described. As discussed above, a video is made up of a series of images arranged in sequential order. Al based compression process 100 described above may be applied multiple times to perform compression, transmission and decompression of a video. For example, each frame of the video may be compressed, transmitted and decompressed individually. The received frames may then be grouped to obtain the original video.
[0267] The frames in a video may be labelled based on the information from other frames that is used to decode the frame in a video compression, transmission and decompression process. As described above, frames which are decoded using no information from other frames may be referred to as I-frames. Frames which are decoded using information from past frames may be referred to as P-frames. Frames which are decoded using information from past frames and future frames may be referred to as B-frames. Frames may not be encoded and / or decoded in the order that they appear in the video. For example, a frame at a later time step in the video may be decoded before a frame at an earlier time.
[0268] The images represented by each frame of a video may be related. For example, a number of frames in a video may show the same scene. In this case, a number of different parts of the scene may be shown in more than one of the frames. For example, objects or people in a scene may be shown in more than one of the frames. The background of the scene may also be shown in more than one of the frames. If an object or the perspective is in motion in the video, the position of the object or background in one frame may change relative to the position of the object or background in another frame. The transformation of a part of the image from a first position in a first frame to a second position in a second frame may be referred to as flow, warping or motion compensation. The flow may be represented by a vector. One or more flows that represent the transformation of at least part of one frame to another frame may be referred to as a flow map. L1T2 / W0
[0269] An example Al based video compression, transmission, and decompression process 200 is shown in Figure 3. The process 200 shown in Figure 3 is divided into an I-frame part 201 for decompressing I-frames, and a P-frame part 202 for decompressing P-frames. It will be understood that these divisions into different parts are arbitrary and the process 200 may be also be considered as a single, end-to-end pipeline.
[0270] As described above, 1-frames do not rely on information from other frames so the I-frame part 201 corresponds to the compression, transmission, and decompression process illustrated in Figures 1 or 2. The specific details will not be repeated here but, in summary, an input image x0is passed into an encoder neural network 203 producing a latent representation which is quantised and entropy encoded into a bitstream 204. The subscript 0 in xQindicates the input image corresponds to a frame of a video stream at position t = 0. This may be the first frame of an entire video stream or the first frame of a chunk of a video stream made up of, for example, an 1-frame and a plurality of subsequent P-frames and / or B-frames. The bitstream 204 is then entropy decoded and passed into a decoder neural network 205 to reproduce a reconstructed image x0which in this case is an 1-frame. The decoding step may be performed both locally at the same location as where the input image compression occurs as well as at the location where the decompression occurs. This allows the reconstructed image x0to be available for later use by components of both the encoding and decoding sides of the pipeline.
[0271] In contrast to I-frames, P-frames (and B-frames) do rely on information from other frames. Accordingly, the P-frame part 202 at the encoding side of the pipeline takes as input not only the input image xtthat is to be compressed (corresponding to a frame of a video stream at position t), but also one or more previously reconstructed images xt−1from an earlier frame t-1. As described above, the previously reconstructed xt-1is available at both the encode and decode side of the pipeline and can accordingly be used for various purposes at both the encode and decode sides.
[0272] At the encode side, previously reconstructed images may be used for generating a flow maps containing information indicative of inter-frame movement of pixels between frames. In the example of Figure 3, both the image being compressed xtand the previously reconstructed image from an earlier frame x̂t−1are passed into a flow module part 206 of the pipeline. The flow module part 206 comprises an autoencoder such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural netw ork 207 has been trained to produce a latent representation of a flow map from inputs x̂t−1and xt. which is indicative of inter-frame movement of pixels or pixel groups between x̂t−1and xt. The latent representation of the flow L1T2 / W0
[0273] map is quantised and entropy encoded to compress it and then transmitted as a bitstream 208. On the decode side, the bitstream is entropy decoded and passed to a decoder neural network 209 to produce a reconstructed flow map f.
[0274] The reconstructed flow map f is applied to the previously reconstructed image xt−1to generate a warped image
[0275]
[0276] It is envisaged that any suitable warping technique may be used, for example bi-linear or tri-linear warping, as is described in Agustsson, E., Minnen. D., Johnston, N., Balle, J., Hwang, S. J., and Toderici, G. (2020), Scale-space flow for end-to-end optimized video compression. In Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition (pp. 8503-8512), which is hereby incorporated by reference. It is further envisaged that a scale-space flow approach as described in the above paper may also optionally be used. The warped image xt-liWis a prediction ofhow the previously reconstructed image xt-1might have changed between frame positions t-1 and t, based on the output flow map produced by the flow module part 206 autoencoder system from the inputs of xtand xt-i.
[0277] As with the I-frame. the reconstructed flow map f and corresponding warped image xt-1,wmay be produced both on the encode side and the decode side of the pipeline so they are available for use by other components of the pipeline on both the encode and decode sides.
[0278] In the example of Figure 3, both the image being compressed xtand the xt-1,ware passed into a residual module part 210 of the pipeline. The residual module part 210 comprises an autoencoder system such as that of the autoencoder systems of Figures 1 and 2 but where the encoder neural network 211 has been trained to produce a latent representation of a residual map indicative of differences between the input mage xtand the warped image xt-1,w. The latent representation of the residual map is then quantised and entropy encoded into a bitstream 212 and transmitted. The bitstream 212 is then entropy decoded and passed into a decoder neural network 213 which reconstructs a residual map r from the decoded latent representation.
[0279] Alternatively, a residual map may first be pre-calculated between xtand the xt-1,wand the pre-calculated residual map may be passed into an autoencoder for compression only. This hand-crafted residual map approach is computationally simpler, but reduces the degrees of freedom with which the architecture may learn weights and parameters to achieve its goal during training of minimising the rate-distortion loss function.
[0280] Finally, on the decode side, the residual map r is applied (e.g. combined by addition, subtraction or a different operation) to the w arped image to produce a reconstructed image xtwhich is a reconstruction of image xtand accordingly corresponds to a P-frame at position t in a sequence of frames of a video stream. It will be appreciated that the reconstructed image xt L1T2 / W0
[0281] can then be used to process the next frame. That is, it can be used to compress, transmit and decompress xt+1, and so on until an entire video stream or chunk of a video stream has been processed.
[0282] Alternatively, the residual autoencoder may be trained to reconstruct the frame xtdirectly from the entropy decoded bitstream by removing the connection between xt-1,wand the output of the residual block 210. thereby eliminating any direct combination step with the warped previously decoded image to speed up inference. In this case, the flow information is intuitively understood to be indirectly captured within the residual information, which the residual decoder is able to learn to use to directly reconstruct the output image xt.
[0283] Alternatively, the residual autoencoder may be trained to reconstruct the frame xtdirectly from the entropy decoded bitstream in combination with some representation of flow injected into one or more layers of the residual decoder. In this case, the flow information is intuitively understood to be indirectly captured within the injected information, which the residual decoder is able to leam to use while decoding the latent representation of flow information to directly reconstruct the output image xt.
[0284] Thus, for a block of video frames comprising an I-frame and n subsequent P-frames, the bitstream may contain (i) a quantised, entropy encoded latent representation of the I-frame image, and (ii) a quantised, entropy encoded latent representation of a flow map and residual map of each P-frame image. For completeness, whilst not illustrated in Figure 3, any of the autoencoder systems of Figure 3 may comprise hyper and hyper-hyper networks such as those described in connection with Figure 2. Accordingly, the bitstream may also contain hyper and hyper-hyper parameters, their latent quantised, entropy encoded latent representations and so on, of those networks as applicable.
[0285] Finally, the above approach may generally also be extended to B-frames, for example as is described in Pourreza, R., and Cohen, T. (2021). Extending neural p-frame codecs for b-frame coding. In Proceedings of the IEEE / CVF International Conference on Computer Vision (pp. 6680-6689).
[0286] The above-described flow and residual based approach is highly effective at reducing the amount of data that needs to be transmitted because, as long as at least one reconstructed frame (e.g. I-frame xt-i) is available, the encode side only needs to compress and transmit a flow map and a residual map (and any hyper or hyper-hyper parameter information, as applicable) to reconstruct a subsequent frame. L1T2 / W0
[0287] Figure 4 shows an example of an Al image or video compression process such as that described above in connection with Figures 1-3 implemented in a video streaming system 400. The system 400 comprises a first device 401 and a second device 402. The first and second devices 401, 402 may be user devices such as smartphones, tablets, AR / VR headsets or other portable devices. In contrast to known systems which primarily perform inference on GPUs such as Nvidia Al 00, Geforce 3090, Gefore 4090 GPU cards, the system 400 of Figure 4 performs inference on a CPU of the first and second devices respectively. That is, compute for performing both encoding and decoding are performed by the respective CPUs of the first and second devices 401, 402. This places very different power usage, memory and runtime constraints on the implementation of the above methods than when implementing Al-based compression methods on GPUs. In one example, the CPU of first and second devices 401, 402 may comprise a Qualcomm Snapdragon CPU.
[0288] The first device 401 comprises a media capture device 403, such as a camera, arranged to capture a plurality of images, referred to hereafter as a video stream 404, of a scene 404. The video stream 404 is passed to a pre-processing module 406 which splits the video stream into blocks of frames, various frames of which will be designated as I-frames, P-frames, and / or B-frames. The blocks of frames are then compressed by an Al-compression module 407 comprising the encode side of the Al-based video compression pipeline of Figure 3. The output of the Al-compression module is accordingly a bitstream 408a which is transmitted from the first device 401, for example via a communications channel, for example over one or more of a WiFi, 3G, 4G or 5G channel, which may comprise internet or cloud-based 409 communications.
[0289] The second device 402 receives the communicated bitstream 408b which is passed to an Al-decompression module 410 comprising the decode side of the Al-based video compression pipeline of Figure 3. The output of the Al-decompression module 402 is the reconstructed 1-frames, P-frames and / or B-frames which are passed to a post-processing module 411 where they can prepared, for example passed into a buffer, in preparation for streaming 412 to and rendering on a display device 413 of the second device 402.
[0290] It is envisaged that the system 400 of Figure 4 may be used for live video streaming at 30fps of a 1080p video stream, which means a cumulative latency of both the encode and decode side is below substantially 50ms, for example substantially 30ms or less. Achieving this level of runtime performance with only CPU compute on user devices presents challenges which are not addressed by known methods and systems or in the wider Al-compression literature. L1T2 / W0
[0291] For example, execution of different parts of the compression pipeline during inference may be optimized by adjusting the order in which operations are performed using one or more known CPU scheduling methods. Efficient scheduling can allow for operations to be performed in parallel, thereby reducing the total execution time. It is also envisaged that efficient management of memory resources may be implemented, including optimising caching methods such as storing frequently-accessed data in faster memory locations, and memory reuse, which minimizes memory allocation and deallocation operations.
[0292] A number of concepts related to the Al compression processes and / or their implementation in a hardware system discussed above will now be described. Although each concept is described separately, one or more of the concepts described below may be applied in an Al based compression process as described above.
[0293] Concept 1: Al-based compression with hierarchy schemes
[0294] Scalable video coding is used in traditional video compression for bitrate and frame rate control. Temporal and spatial reference frame dependency or hierarchy schemes such as ‘L1T2‘, ‘L1T3‘, ‘L2TU, ‘L2Tl_h‘, ‘L2T1_KEY‘, ‘L2T2‘, ‘L2T2_KEY‘, ‘L2T2_KEY_SHIFT‘, ‘L3TU, 3T3L, ‘L3T3_KEY‘, andLS2TU, and so on, specify whether a frame at a given position in a group of pictures (GOP) is to be encoded and decoded as an I-frame, a P-frame, or a B-frame, and the hierarchy of the reference frame spatial and temporal dependencies relative to the rest of the frames of the GOP that will be used when performing the encoding and / or decoding according to the scheme.
[0295] Figure 5 illustratively shows an example L2T3 frame dependency scheme 500. There are two resolution layers (spatial layers " SI" and " S2") and each of those layers has three available temporal dependency possibilities (temporal layers " TO", " Tl", and " T2"). These layers are set out on the Y-axis in Figure 5. On the X-axis in Figure 5 is the frame index. For example, the first frame of a GOP may have index 0, the second frame of the GOP may have index 1, the third frame of the GOP may have index 2, and the fourth frame of the GOP may have index 3, and so on.
[0296] Frames that have been assigned to a " TO" temporal layer do not have any temporal dependency on other frames, so they are I-frames. In the 'L2T3' scheme of Figure 5, the frame at index 0 is an I-frame and is encoded and decoded at two spatial resolutions (" SI" and " S2"). For each of these spatial resolutions, the next frames at index 1, 2, and 3 are encoded as P-frames and variously use a frame at an index position of one or two frames back as a reference frame, respectively " Tl" and " T2" layers. L1T2 / W0
[0297] For example, the frame 503, 504 at index position 1 uses the frame 501, 502 at index position 0 as a reference frame (i.e., one frame back) for encoding and decoding. The frame 505, 506 at index position 2 also uses the frame 501, 502 at index position 0 as a reference frame (i.e., two frames back), while the frame 507, 508 at index position 3 uses the frame 505, 506 at index position 2 as a reference frame (i.e., one frame back).
[0298] Typically, the smaller the index position gap, the fresher the information in the reference frame will be when encoding and decoding the P- or B-frame, and thus the greater the amount of fine detail that will be captured in the frame. Thus, frames in the " T2" layer, where the gap is small, will capture finer motion details more accurately than frames in the " Tl" layer, where the gap is larger.
[0299] Optionally, one or more of the lower spatial layer frames (e.g., frame 501) may also be used when encoding and decoding higher spatial frames, as shown by the dotted line linking frame 501 and 502 using cross-layer prediction. That is, a residual between the lower resolution frame 501 and the higher resolution frame 502 is predicted and transmitted instead of transmitting the high resolution frame 502 in the bitstream. As long as the lower resolution frame 501 is available on the decode side, the higher resolution frame 502 can be constructed from it using the residual. A similar hierarchical relationship between the respective resolution P-frames may also exist, as indicated by the dotted lines between frames 503 and 504, between frames 505 and 506, and between frames 507 and 508. This approach and the other " LnTn" type approaches listed above allow the bitrate and framerate to be adapted on the fly as the number of spatial and temporal layers can be reduced or increased based on available resources to provide a smooth user experience. Whilst not shown in Figure 5, the frame dependency scheme may be repeated in identical blocks such as that illustrated in Figure 5, working through a sequence of frames to encode and decode it.
[0300] Applying scalable video coding schemes, such as those described above, in Al-based compression presents a number of challenges. For example, unlike in traditional scalable video coding schemes where the lowest spatial layer configuration (e.g. " SI") determines the baseline bitrate of the scheme, and higher resolutions (e.g. " S2") correspond to higher bitrates, such a straightforward relationship between resolution and bitrate may not generally apply to Al-based compression. For example, bitrate control in Al-based compression pipelines is typically not based on discrete, hardcoded resolutions, but is instead learned, for example using a gain unit approach such as is described in Cui, Ze, et al. " Asymmetric gained deep image compression with continuous rate adaptation." Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition. 2021, or by using a plurality of networks, each trained to target L1T2 / W0
[0301] different rate distortion levels, or indeed using other methods. In each of these cases, bitrate and distortion (i.e. image reconstruction quality) is associated with the learned weights of the networks, and / or with the learned the gain units or with any other learned rate control mechanism being used, rather than explicitly to some predefined, discrete resolutions. The inventors have found that that introducing traditional scalable video coding schemes into AI-based compression pipelines can result in poor bitrates and poor image reconstruction qualities. One reason for poor performance may be that each component of an Al-based compression pipeline (e.g. a flow part 201, a residual part 210, and so on) are highly sensitive to even very small changes in input conditions. For example, if a flow part 201 has been trained to expect a current frame xtand a reference frame
[0302]
[0303] as input, then changing the reference frame temporal dependency from x̂tto x̂t-2, (or some other frame with a frame index even further apart from t than t-2) every n frames, and / or making a current frame depend on such a frame (mimicking the " Tl", " T2" scheme of traditional scalable video coding) may result in model failure modes. For example, one failure mode in such circumstances is that the predicted optical flows are not accurate representations of the actual optical flow between the reference frame and the current frame.
[0304] The inventors have found that there exist opportunities to not only overcome these and other problems, but also to substantially improve upon traditional scalable video coding schemes. Firstly, learnable rate control methods, such as those described above, facilitate scalable video coding schemes with learned hierarchies, for example the quality levels of a scheme may be learned and / or the reference frame temporal dependencies of a scheme may be learned, thereby substantially eliminating the guesswork and associated problems that arise from handcrafting scalable video coding schemes. For example, the quality levels of a scheme (for example but not limited to the learned values of gain unit vectors / matrices) may be learned during training and then frozen during inference and / or the quality levels may be predicted based on an input frame and / or input frame sequences. Additionally or alternatively, the temporal dependency of reference frames (i.e. how far away the index position of the reference frame is relative to the current frame) may be predicted, more details of which will be described later below. In very general terms, taking the known L2T3 scalable video coding scheme of Figure 5 as an illustrative example, the hierarchical dependencies both along the layer y-axis and along the frame index x-axis may be learned, as may position along the y-axis of the levels. These may then be frozen during inference, and / or made adaptive (i.e. predicted) on a per input frame or frame sequence basis. L1T2 / W0
[0305] Learned variable rate or quality control
[0306] We start with the first case of learning the quality levels. In order to introduce the learning of quality levels in a scalable video coding scheme, a non-limiting, illustrative bitrate or quality level control mechanism is described first below with reference to Figure 6. This illustrative example comprises a control or gain unit. Another implementation of a gain unit approach is illustrated in in Cui, Ze, et al. " Asymmetric gained deep image compression with continuous rate adaptation." Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition. 2021. However, it will be appreciated that other methods may also be used.
[0307] Figure 6 illustratively shows a compression pipeline 600 corresponding to that of Figure 2, or the I-frame module 200 of Figure 3, except that now a control unit C 601 and corresponding inverse control unit C-1602 are provided. On the encode side, the control unit 601 is positioned before the quantisation module. On the decode side, the inverse control unit 602 is positioned after the entropy decoder module. Note the inverse control unit 602 need not be a strict mathematical inverse.
[0308] The control unit 601 receives as input the latent representation and scales one or more channels of the latent representation through processing it with a learned matrix (or vector when considering the latent on a per channel basis). Here processing may refer to multiplication, for example channel-wise multiplication. This has the effect of transforming the latent representation (i.e. modifying its values) before they are quantised and accordingly provides a degree of control over what input the quantisation module receives in turn effects how well or not well a given image will be compressed i.e. it facilitates rate control.
[0309] More generally, the control unit 601 may apply a control matrix A ∈ Rc×nwhere c is the number of channels and n is the number of control vectors ai=
[0310]
[0311] {ai,0, ai,1with i denoting the index of the control vectors in the control matrix A, and Ai,j∈ R represents the yth control value in the control vector atand i ranges from 0 to c — 1, and where each channel may be associated with its value or values. It follows that applying the control matrix A changes the latent representation by ȳi,j= yi× ai,j. More specifically, the operation applied by the control unit 601 may be defined as ȳi= Cα(y,i) = y ⊙ aiwhere Ca(-) is the control unit’s operation and O is the channel-wise multiplication.
[0312] The modification of the latent representation by control matrix A can be understood at a more general level as corresponding to simulating better or worse compression rates by L1T2 / W0
[0313] emphasising how many bits the network should assign to one or more of the input channels of the latent representation.
[0314] At one extreme, the control matrix A blows up all of the values of the latent representation to infinity. That is, if the quantisation bin size stays fixed, then the quantisation error reduces as the range of values that the latent can occupy increases. Although in practice, this extreme can never be reached and runs into numerical problems.
[0315] At the other extreme, the control matrix A may completely transform all the channels of the latent representation into 0, which may be losslessly compressed at virtually no rate cost but from which it is practically impossible to reconstruct any meaningful image.
[0316] Between these two extremes are a set of target reconstruction qualities (with associated compression rates) which are defined by the control vectors atof the control matrix A and which emphasise or de-emphasise higher rate or conversely higher distortion during training and which encourage the network to learn to assign more or fewer bits to the various channels of the latent representation.
[0317] On the decode side, a corresponding inverse control matrix A-1is applied by the inverse control unit C-1602 to the reconstructed latent representation output by the entropy decoder before it is passed into the decoder neural network to reconstruct the output image. Similarly to the control matrix A, the inverse control matrix C-1comprises a number of inverse control vectors ai-1=
[0318]
[0319] a^, and the operation of the inverse control unit C-1602 can be defined as y”1= Cp (y, i) = y Q a1. Here C
[0320]
[0321] Cβ-1(·) and ⊙ corresponds to the same channel-wise multiplication operations as on the encode side.
[0322] As indicated above, the elements of the control matrix A are learned jointly with the other parameters of the compression pipeline (e.g. the weights of the encoders and decoders, and so on). Taking the control matrix A and the inverse control matrix A-1together, it is noted that there will always be pairs of corresponding control vectors
[0323]
[0324] bound by the corresponding index i whereby one part of the pair is associated with the the encode side and one part is associated with the decode side.
[0325] Given that the control matrix A and inverse control matrix A-1in general terms operate to simulate higher or lower compression rates by modifying the latent representation, they effectively act as regularisation terms. We can accordingly associate different Lagrange multipliers to be applied to the rate or distortion term of the loss function with each control vector pair during training of the compression pipeline. L1T2 / W0
[0326] More specifically, for one training iteration, we select one Lagrange multiplier (which will emphasise rate or distortion more for that training iteration), together with its associated pair of control vectors
[0327]
[0328] evaluate the loss function with the rate or distortion term regularised by that Lagrange multiplier and control vectors, and finally update the elements of the network including the values of the elements of the control unit and inverse control unit, based on the evaluation of that loss function.
[0329] For the next iteration, we may select a different Lagrange multiplier and its associated different control vector pairs and repeat the process but this time when we update the elements, it will be for the different control vectors. We then repeat this process throughout training, selecting different Lagrange multipliers and associated control vector pairs each iteration.
[0330] At the start of training, the randomly initialised control vectors and network weights will have high losses, however after a few iterations loss decreases and the values of the elements of the control vectors start to converge resulting in a set of control vectors at,
[0331]
[0332] each associated with a different regularisation of the rate or distortion terms of the loss function, and thus each emphasising a reconstruction of the target image emphasising a different rate and distortion. Given that rate effectively determines how many artefacts a reconstructed image will have (e.g. a measure of quality), the control vectors allow the network to target specific rates and thus target reconstruction image qualities.
[0333] Accordingly, during inference, a pair of the resulting, learned control vectors of the set can be selected and applied respectively to the latent representation on the encode side and to the output of the entropy decoder on the decode side to emphasise either more rate or more distortion. If more distortion (i.e. a lower rate and corresponding to worse image quality) is desired, the control vector pair associated with the higher distortion regularisation amount may be selected.
[0334] In this way, the compression pipeline can be controlled to vary the rate (and thus target reconstruction quality) of the reconstructions it produces simply be selecting a different pair of the control vectors to apply to the latent representation and output of the entropy decoder respectively.
[0335] Further training details of the variable rate compression network will now be provided in more detail. Consider the typical rate distortion loss function based on the steps of the pipeline of Figure 1:
[0336] L = R(Q(fθ(x))) + λD(gθ(ŷ))
[0337]
[0338] L1T2 / W0
[0339] where x is the input image, fθis the encoder network that produces the latent representation from the input image, Q represents the quantisation operation applied to the latent representation, y is the reconstructed latent representation, λ and gθis the decoder network that reconstructs the output image x̂ from the reconstructed latent representation.
[0340] In order to introduce a dependence of the loss on the control unit and the inverse control unit Cαand C
[0341]
[0342] ^1, the loss function is modified by the introduction of the regularisation term
[0343]
[0344] associated with a pair of randomly initialised control vector pairs {aita”1} that are linked to the regularisation term by the index t, that is:
[0345] L = R(Q(Ci(fθ(x)))) + λi· D
[0346]
[0347] For the sake of example, assume we would like a set of five control vector pairs
[0348]
[0349] {a3, a3-1}, {a4, a4-1}, {a5, a5-1}, each pair associated with an associated target image quality or rate Li, L2, L3, L4, Ls. We set up a corresponding set of Lagrange multipliers λ1= 0.5, λ2= 0.05, λ3= 0.005, λ4= 0.0005, λ5= 0.00005. The specific values here are illustrative only where a higher value will emphasise distortion more compared to rate whereas a lower value will emphases rate more over distortion.
[0350] In the first training iteration, we initialise the control vector pairs (e.g. with random values) and weights of the networks of the compression pipeline and then randomly select an i value, let's say i = 3, which sets λ3= 0.005, inserts the control vector pair {a3, a3-1} into the pipeline at control unit C3and inverse control unit C3-1, and thus results in the loss function for this iteration as:
[0351] L = R(C3(Q(fθ(x)))) + 0.005 × D(C3-1(gθ(ŷ))) Backpropagation is performed, and the weights of the network and values of the control vector pair {a3,
[0352]
[0353] are updated.
[0354] For the next iteration, a different i is randomly selected and the process is repeated for a predetermined number of training iterations or until some end of training condition is met (e.g. the training and / or validation loss stops decreasing, and so on). Whilst the initial training steps will not produce good reconstructions at any target image quality, after a few iterations, the output reconstructions associated with each randomly selected i will start to converge to respective positions along the rate distortion curve, such as those shown illustratively in Figure 5.
[0355] After training is completed, we have the set of set of five control vector pairs {a1, a1-1}, {a2, a2-1}, {a3, a3-1},
[0356]
[0357] {a4, a4-1} that, when applied during inference, result in the L1T2 / W0
[0358] network reconstructing images at the different target reconstruction qualities on the rate distortion curve. Note for completeness that even though discrete control vector pairs are learned, interpolating between the discrete values of atand ai+1is possible using e.g. linear interpolation or any other interpolation techniques known to the skilled person, facilitating intermediate quality levels in inference between the discrete values.
[0359] It will be appreciated that one or more bits indicating which target reconstruction quality is being used may be included in the bitstream as metadata. This may be in the main bitstream or, preferably, in the hypernetwork bitstream. In the latter case, the hypernetwork bitstream is decoded first, allowing the entropy decoded modified latent representation to be processed by the inverse control unit 602 using the correct control vectors before the processed modified latent representation is fed into the decoder to reconstruct the image at the target image quality. More generally it is envisaged that the different control vectors of the control units once learned will form part of the pipeline's weights and architecture installed on the encode and decode side so it is not necessary to send all the values of the control vectors. Instead, a simple indication of which vector pairs from the set of vector pairs has been selected may be sent, and a lookup table used to retrieve the correct control vectors during decode.
[0360] In practice, the effect the control vector pairs have in inference is to modify the latent representation before quantisation to make it more easily or less easily quantised into course or fine quantisation bins. That is, the learned values of the control vector pairs associated with the low rates typically end up with vector values that modify the distribution of the latent representation to be easily quantised into coarse bins whereas the values of the control vector pairs associated with higher rates typically result in a latent representation distribution that requires finer quantisation bins and which can accordingly not be compressed as easily. As described above, the exact values are learned parameters and may accordingly vary depending on the number of vector pairs in the set of vector pairs, as well as the values of the regularisation parameters in the set of regularisation parameters.
[0361] Learned quality level offsets
[0362] With an exemplary rate control mechanism introduced, we may now continue with learned quality levels of a scalable video coding scheme. As described above, during training, a set of vectors (which together may form a matrix) may be learned and one of which may be applied 601 at one or more specified places in the pipeline, for example as shown in Figure 6, modifying the data object it is applied to makes it more efficiently losslessly compressible in the entropy encoding step, thereby controlling the bitrate to quality trade off at which a given L1T2 / W0
[0363] network operates at. A corresponding inverse set of vectors (which together may form an inverse matrix) are also learned and a corresponding one of the vectors may be applied 602 on the decode side. Each vector and inverse vector pair may thus define some target bitrate and quality level of the pipeline and together these may define some bitrate range at which the pipeline is intended to operate at. For example, if one vector pair is associated with a bitrate of around 1Mbps, and one vector pair is associated with a bitrate of around 6Mbps, and the other vector pairs of a set are associated with bitrates somewhere between those two bitrates, then together those set of vectors allow the pipeline to operate at a bitrate range of 1 -6Mbps, with target quality levels between those two extremes.
[0364] During training, the values of the vectors are learned in the manner as described above. However, there is no guarantee that the values that the vectors settle on will correspond to useful quality levels of some hierarchical, scalable video coding scheme. For example, in some cases Al-based compression pipelines underperform at very low quality levels (e.g. in the toy example above, this may be around 1Mbps). Conversely, Al-based compression pipelines often perform very well at high quality levels (e.g. in the toy example above, this may be around 6Mbps or higher). Thus, in some cases it may be advantageous for the lowest quality level to be made higher (e.g. by modifying the values of the set of vectors by some amount), and / or for the highest quality level to be made lower, and / or for some modification or offset to be made to the intermediate quality levels, when the set of vectors are used in a pipeline that is being operated using a scalable video coding scheme. In other words, how far along the quality level scale each quality level in a scalable video coding scheme ought to be for optimal performance in a given scenario may not be the same as the quality levels that were initially learned during training. Indeed, this may also be dependent on what temporal dependency a current frame has. For example, if a reference frame being used for the current frame has a frame index of t-2, then the optimal quality level to be used for that reference frame may be a different quality level that ought to be used for that reference frame were it to have a frame index of t-1. Indeed, the optimal quality level might also not correspond directly to one of the other quality levels but might be somewhere in between the main levels. Given that the number of possible reference frame dependencies and quality level combinations that may exist in an arbitrary scalable video coding scheme is very high, handcrafting some optimal set of combinations is impractical and burdensome.
[0365] To address this problem, a learned offset for each quality level in the form of a learned modification to the values of gain unit vectors conditioned on an input frame index flag and input intended quality level flag, may be introduced during some or all of a training schedule. L1T2 / W0
[0366] During inference, this learned offset may applied as a modification to the gain unit vectors to increase and / or decrease some or all of their values for a given frame, based on whatever input frame index flag and intended quality level flag that has been specified for that frame.
[0367] Using a toy example of two gain unit vector pairs:
[0368]
[0369] {a1, a1-1}, {a2, a2-1}, we may now additionally introduce a set of learnable offsets ωq,tassociated with intended quality levels (denoted by q) and temporal frame dependency (denoted by t. the frame position index) within a miniGOP. A miniGOP is a repeatable block of a hierarchical scheme, for example between I-frames, or P- or B-frames. The learnable offsets ωq,tspecify for each vector pair, how much to modify the values of the vector pairs when a given input frame is accompanied by an intended quality level q flag, and / or temporal frame dependency flag t). Thus, we now get a rate control mechanism whose effect is adaptive to different target qualify levels and / or temporal frame dependencies: (
[0370]
[0371] ωq,t({a1, a1-1}, {a2, a2-1})- This mechanism thus facilitates optimal rate control in any hierarchical scalable video coding scheme using an Al-based compression pipeline without handcrafting, irrespective of what the hierarchies of the chosen scheme actually are. There are a number of consequences of this. This approach not only enables a learned, optimal implementation of known scalable video coding schemes (such as any of ‘L1T2‘, " L1T3‘, ‘L2TU, ‘L2T1 _h‘, ‘L2T1 KEY‘, ‘L2T2‘, ‘L2T2 KEY‘, ‘L2T2 KEY SHIFT, ‘L3T1‘, 'L3T3‘, ‘L3T3_KEY‘, and " S2TT, and so on) where the specific levels of the scheme are learned through the vector pair offsets targeting a training objective such as minimising a rate distortion loss, but it also facilitates the implementation of any arbitrary scalable video coding schemes that may have substantially more complex hierarchies than the example schemes listed above. That is, it facilitates fully learned hierarchies of an arbitrary scalable video coding scheme because quality levels are optimisable for any reference frame dependency hierarchy by operation of the learnable offset. Thus, if the networks of a pipeline are converging during training to some learned reference frame hierarchy, whatever that hierarchy is, the optimal quality levels (i.e. values for the gain unit vectors with some offset applied) for that hierarchy are automatically learned.
[0372] Figure 7 illustrates an Al-based compression pipeline 701, corresponding to that of Figure 3, but now also showing the input flags 701, 702 of intended quality level q and reference frame index position t — j relative to a current frame at position t, in this example the flags 701, 702 are passed into the residual part 210 but it is envisaged that the computation of the learnable offset may occur anywhere in the pipeline. Note also that where Figure 3 shows only an immediately previous frame t-1 as a reference frame, it will be appreciated that this may be L1T2 / W0
[0373] extended to a reference frame with another index, e.g. t — 2, t — 3,... t — j. These flags 701, 702 are associated with the how the pipeline is intending to encode and decode the current frame xtand may be, for example, following a predetermined temporal and / or quality level hierarchy scheme. Each unique combination of intended quality level qnand reference frame index position t — j may have a corresponding unique learnable offset to be applied to a corresponding gain unit vector. For example, if there are three intended target quality levels (t / i q2, Q3) with associated gain unit vector pairs
[0374]
[0375] {a2, a^1}, {a2, a2r}) being used with a hierarchical scheme that has two possible reference frame temporal dependencies (t — 1, t — 2), then there are six unique learnable offsets
[0376]
[0377] Mq2,t-2,Mq3,t-i, This results in a rate control mechanism comprising the following selectable combinations of gain unit vectors and learned offsets:
[0378] «2X} < <*>q2.t-2{a2,,
[0379]
[0380] u>Q3 t-2{n3, <231})
[0381] In practice, this set specifies what hierarchical combinations are possible in the miniGOP being used. Further, only a subset of these will be available at any given time as, once a frame has been encoded at a quality level, the other quality levels of that frame are not available as it is not envisaged retrospectively going backward to re-encode past frames at different quality levels (although this is possible in settings where an abundance of compute resources is available and run time is not a concern).
[0382] Each of the six learnable offsets of this toy example may be randomly initialized during training and be optimisable using any suitable optimisable algorithm such as gradient descent, Adam, or some other optimiser algorithm, based on a rate distortion loss either on a per frame basis, or accumulated across a plurality of frames of a GOP which is being encoded and decoded according to whatever hierarchical scalable video coding scheme is used. Thus, by the end of training the learnable offsets will have converged to values that result in an optimal rate distortion performance by modifying the quality levels to be optimal for the hierarchy of the scalable video coding scheme being used. During inference, an intended quality flag and reference frame index position flag may be passed into the pipeline together with the current frame xt, thereby selecting the associated gain unit and learned offset, which will be applied when encoding and decoding the current frame xt. L1T2 / W0
[0383] The above-described method of training is generalised in the pseudocode provided below:
[0384] Algorithm 1 Training with learned quality level offsets
[0385] Inputs:
[0386] Training <iai:isei A’, learning rate y, regeiarixalion parnnteler,, number of ep<><?hs E.
[0387] network architecture fθ including set of gain vectors {aₙ, aₙ⁻¹} and set of quality level
[0388] offsets ω
[0389] Initialize network parameters θ
[0390] for epoch - I to E do
[0391] for each batch in 풳 do
[0392] Compute forward pass through f(θ)
[0393] Compute loss Loss = D(xₜ, xₜ₋₁) + λᵢR
[0394] Backward pass to compute gradients ∇θLoss
[0395] Update parameters with optimizer 풪:
[0396]
[0397] ∇θLoss, η), where θ includes set of gain vectors {aₙ, aₙ⁻¹} and set of quality level offsets ω
[0398] end for
[0399] Optionally evaluate on validation set
[0400] end for
[0401] That is, a training data set X. a learning rate TJ, a regularisation parameter
[0402]
[0403] appropriately picked for each quality level, and a number of training steps or epochs E is selected. The network architecture of feincluding a set of gain vectors {an, a~r} for each desired quality level n, and a set of corresponding quality level offsets m is defined, for example as shown in Figure 3. The network parameters 0 are randomly initialised and then the training loop is started. For each batch in the training data X the forward pass is computed for the network being trained fe. The total Loss is then calculated by combining a distortion term D. a rate term R. and any other loss terms (not shown). The backwards pass is then performed to compute gradients based on the loss, and the parameters 0, which includes the set of gain vectors and {an, a”1} and set of quality level offsets a) are optimised using the optimiser, such as stochastic gradient descent SGD, or some other known optimiser. Optionally, a validation loss can be calculated and the training loop is repeated until the predetermined number of steps or epochs E have been calculated, or some other criteria have been reached. The learning rate, batch size, and / or number of epochs may be optimised during training, for example using a L1T2 / W0
[0404] learning rate scheduler or some other hyperparameter optimisation method. More generally, the hyperparameters may be optimised experimentally. A training schedule may also be set whereby some parameters of the network architecture feare frozen for some number of steps while others are updated and vice versa. For example, for some first number of steps, the quality level offsets may be set to not apply any offset to allow the networks to leam a reasonable starting set of gain vectors, and only after the first number of steps will the quality level offsets be made optimisable with the optimizer 0. Whilst not shown, if the network architecture feis being used with a scalable video coding scheme, the training dataset X may be sampled according to that scheme by sampling which reference frame temporal and quality level hierarchies from the scheme being used, in order to expose the networks to the temporal and quality hierarchies that the networks will typically face in inference when being used with that scalable video coding scheme.
[0405] During inference, each input current frame xtmay be accompanied by an intended quality level flag qnand a flag specifying which index position t — j the reference frame being used has in the frame sequence. Responsive to receiving the qnand t —j flag, the corresponding quality level offset and gain vectors are selected and used when encoding and decoding the current frame xtin the manner as described above. That is, the learned quality level offset and gain vectors allow for any arbitrary, hierarchical scalable video coding scheme that has temporal and quality level hierarchies to be implemented in an optimal way without relying on handcrafted features.
[0406] One illustrative example is shown in Figure 8, which shows a hierarchical scheme with two quality levels: lower quality "L1" and higher quality " L2", along with and two possible reference frame temporal dependencies t — 1 and t — 2, whereby each frame may either depend on the immediately preceding frame (t — 1) or a frame that is two positions back (t — 2). The first frame 801 of the sequence has no temporal dependency on any other frame and is thus an I-frame. It serves as the primary source of information for the subsequent P-frames so is encoded and decoded at the higher quality level " L2". The next frame 802 only has one possible option to use for a reference frame because only an I-frame 801 is available in the toy example so far, thus the frame 802 is encoded as a P-frame using the immediately preceding frame 801 in position t − 1 as its reference frame. Because the reference frame is a higher quality L2 frame, the encoding and decoding of the current frame 802 can rely more on the reference frame 801, allowing the current frame 802 to be of lower quality level LI. The next frame 803 has a choice of whether to use the higher quality L21-frame 801 or the immediately preceding LI P- L1T2 / W0
[0407] frame 802 as the reference frame. In this toy example scheme, the higher quality L2 I-frame (i.e. a reference frame temporal dependency of t — 2) is chosen. However, in other cases, the other choice may be more optimal. For example, substantial motion relative to 1-frame 801 may have occurred in the current frame and so even though the higher quality L2 I-frame 801 is available, a lower quality but more recent frame in which at least some of this motion has already happened is a better choice for a reference frame, as it results in more accurate optical flow estimation. In other cases, it is beneficial to retain as much of the quality as possible and use the less recent but higher quality frame, where available. This hierarchical block then repeats for frames 804, 805, 806 and so on progressing through frame index positions 0, 1, 2, 3, 4, 5 and so on until the next 1-frame.
[0408] As will be understood, the quality levels L2 and LI of the toy example scheme of Figure 8 may be implemented using the above-described learned offsets. That is, during training, two gain units and two learnable quality level offsets may be initialised and trained, giving four unique combinations:
[0409]
[0410] al1} '0)q1,t-2{al’al1} >a21}<O)q2,t-2{a2’a21} One of the combinations may be selected by passing in the intended target quality level flag and the reference frame temporal dependency flag. If there is no temporal dependency flag, the pipeline assumes the frame is an I-frame For example, the flag sequence for one 1-frame five P-frames with the hierarchy in the toy example of Figure 8 would be: {q2}, {q₁, t − 1}, {q₂, t − 2}, {q₁, t − 1}, {q₂, t − 2}, {q₁, t − 1}, and so on, resulting in the corresponding gain unit vectors and learnable offset combinations being selected during encoding and decoding.
[0411] An optional additional modification of the above methodology is the introduction of a reference frame temporal dependency and quality level prediction step. That is, predicting, which of the available reference frames to use for a given current frame, or to follow some predetermined reference frame dependency. For example, with reference to Figure 8, when we get to the frame 805 at index position 4 there are different options of frames that may be used as a reference frame: the L21-frame 801 at t — 4, the LI P-frame 802 at t — 3, the L2 P-frame 803 at t — 2, or the LI P-frame 804 at t — 1. If a more complex hierarchical scheme is used, the number of options is likely to be even greater both in terms of quality levels and in terms of choices for I in relative frame index position t — j. In these cases, there may be at least one frame that is an optimal frame to use as the reference frame in terms of rate and distortion performance compared to the available other frames. Sticking to a rigid hierarchy in these cases L1T2 / W0
[0412] results in sub-optimal performance when the sub-optimal frame is not used. The simplest case of this has already been alluded to above when describing making the choice between the immediately preceding t — 1 lower quality L2 P-frame 802 or the I-frame 801 in the t — 2 position when encoding frame 803. In some cases, picking one results in better performance than the other and vice versa.
[0413] Optimal reference frame temporal dependency prediction
[0414] Taking this simplest case as an illustrative example, we introduce a reference frame temporal dependency prediction step that predicts which of the available reference frames is likely result in the best rate distortion performance and dynamically updates the reference frame quality level and / or temporal frame dependency flags accordingly to point to an optimal reference frame. A toy example of this method is illustrated in Figure 9.
[0415] Figure 9 shows a hierarchical scheme 900 similar to that of Figure 8. except now with a reference frame dependency prediction step incorporated into the sequence. The scheme shown has two quality levels: lower quality " LI" and higher quality " L2", along with and two possible reference frame temporal dependencies t — 1 and t — 2, whereby each frame may either depend on the immediately preceding frame (t — 1) or a frame that is two positions back (t — 2). The first frame 901 of the sequence has no temporal dependency on any other frame and is thus an I-frame. It serves as the primary' source of information for the subsequent P-frames so is encoded and decoded at the higher quality level " L2". The next frame 902 only has one possible option to use for a reference frame because only an I-frame 901 is available in the toy example so far, thus the frame 902 is encoded as a P-frame using the immediately preceding frame 902 in position t — 1 as its reference frame. Because the reference frame is a higher quality L2 frame, the encoding and decoding of the current frame 902 can rely more on the reference frame 901, allowing the current frame 902 to be of lower quality level L2. As was the case in Figure 8, the next frame 903 has a choice of whether to use the higher quality L2 I-frame 901 or the immediately preceding LI P-frame 902 as the reference frame. Unlike in Figure 8, where the choice w as based on the predetermined hierarchy schedule of the scheme, in Figure 9 the choice of reference frame is predicted based on one or more properties of the cunent frame 903. the previously encoded and / or decoded frames 901, 902 available to be reference frames (which may be only a small number of frames, or many frames up to all of the available frames of the GOP encoded and decoded so far), and / or one or more metrics or values associated with these. This is denoted by a function of the form:
[0416] ■ ■ ■, -tyt— l, Ln)» -ty)
[0417]
[0418] L1T2 / W0
[0419] where t indicates the index of the current frame in a GOP with frame indices 0 to t, and Ln indicates what the quality level of a given frame is that is available to be a reference frame. For example, in Figure 9, the function of the first reference frame dependency prediction step 907 takes as inputs the I-frame 901 X(0 L2), the first P-frame 902 X(1; L1), and the current frame 903 x2that has yet to be encoded and decoded together giving: e(x(0 L2
[0420]
[0421] X2).
[0422] The form of the function e is not prescriptive. That is, any function that in some way predicts which available frame is likely to result in lower rate and / or distortion values for the current frame is envisaged to be suitable. For the sake of illustration, one such function which is based on a computed difference (mean squared error MSE) between a representation of the current frame and warped representations of each of the available previously encoded and decoded frames is described in more detail below with reference to the example frames 901, 902, 903 of Figure 9. It is envisaged that in other examples, not shown, e may comprise a neural network trained to output a reference frame temporal and quality level dependency (i.e. output the reference frame temporal flag and the quality level flag for the current frame to use) on a per-frame input basis, whereby training of the neural network may be end-to-end with the rest of the neural networks of the pipeline with a rate distortion loss function. In this way, the neural network learns to predict for a given current frame and available reference frames, which available reference frame results in the lowest rate distortion scores for the current frame.
[0423] Proceeding for now with the example where e is based on a MSE score of its inputs, first the following metrics associated with the 1-frame 901 X(0; L2), the first P-frame 902 X(1; L1), and the current frame 903 x2that has yet to be encoded and decoded are computed: (i) m
[0424]
[0425] se(x(-0 L2,w)< *2) which is the MSE score between a representation of the I-frame 901 x(0 L2,w) (where w indicates the representation has been warped using a computed optical flow information between x2and x(0, L2)) and the current frame x2, and (ii
[0426]
[0427] ) mse(x(1; L1; W), x2) which is the MSE score betw een a representation of the P-frame 902 x^1 W) (w here w indicates the representation has been warped using computed optical flow information between x2andX(I, LI)) and the current frame x2.
[0428] Second, the same metrics are calculated but without the warping. That is, the following further metrics are calculated: (i) mse(
[0429]
[0430] x{0 L2),x2) which is the MSE score between a representation of the 1-frame 901 X(0 2) and the current frame x2, (ii) ms
[0431]
[0432] e(x(1 L1), x2) which is the MSE score between a representation of the P-frame 902 x(1 L1 w)and the current framex2- L1T2 / W0
[0433] This gives us the following four metrics associated with the I-frame 901 %(o, L2), the first P-frame 902 X(1; L1), and the current frame 903 x2that has yet to be encoded and decoded: mseL2x2) - how close is the warped I-frame to current frame
[0434] mse(x(-1 L1 w), x2) - how close is the warped P-frame to current frame
[0435] mse(%(O t2j, x2) - how7close is the (not warped) I-frame to current frame
[0436]
[0437] mse(x(-1 L1), x2) - how close is the (not warped) P-frame to current frame
[0438] These four metrics may then be used to compute the following ratios:
[0439] R:mse(x( i,W),x2)
[0440] 1mse(x(1; L1),x2)
[0441] R;^se(x(0< L2,w), *2)
[0442] 2mse(x(0 L2),x2)
[0443] In very general terms, R1 is indicative of how much closer the w arped I-frame X(0 L2 w) is to the current frame x2than the (not warped) I-frame X(0; L2) whereas R2 is indicative of how much closer the w arped P-frame x^L1
[0444] is to the current frame than the (not warped) P-frame X(i; L1). A ratio value of 1 means neither is closer than the other, a ratio value above 1 means the w arped frame is closer to the current frame, a ratio value below71 means the not-warped frame is closer to the current frame. These ratios are useful because they allow7us to determine which of the two reference frame options in this toy example results in optical flow information that results in a warping operation that brings the reference frame closest to the current frame. In this example, w e w ant to use the reference frame whose warped representation is as close to the current frame as possible as this results in a best rate and / or distortion score when we encode and decode the current frame using that reference frame. In other words, in this toy example, the relevant properties, metrics and / or values associated with the I-frame 901 x(0; L2), the first P-frame 902 X(1; L1), and the current frame 903 x2are these ratios, although it will be appreciated that other properties, metrics and / or values that are indicative of an optimal rate and / or distortion score may also be used.
[0445] In this example with these illustrative ratio metrics, we can define the reference frame prediction function e as:
[0446] Select X(0; L2) as reference frame if R2> Re X(0, L2)<X(1, L1)>X2 Select x^
[0447]
[0448] as reference frame otherwise L1T2 / W0
[0449] That is, e predicts that the the I-frame X(0; L2) will be a better reference frame to use when encoding the cunent frame x2when / ?2> R1, otherwise x(1; L1) will be the better reference frame to use.
[0450] If another function e is used then different properties, metrics and / or values associated with the inputs to e may be used, whereby, at a general level, the function e predicts which frame of the available previously encoded and decoded frames is the reference frame that produces some optimum rate distortion score for encoding and decoding the current frame. For example, e may be instead based on estimating a magnitude of optical flow information (e.g. motion vector magnitude) across a plurality of GOP sizes forward from the current frame, and then adaptively selecting a GOP size based on which GOP size has a smallest magnitude of optical flow information, thereby following a default reference frame dependency scheme within a GOP but predicting each GOP size adaptively. Further, whilst the example function e above is based on properties, metrics and / or values of only two available frames and a current frame, it will be appreciated that this can be generalized to any number of frames, facilitating prediction that can result in highly complex and variable reference frame dependencies across a GOP.
[0451] These and similar approaches can be generalised to:
[0452] Select x^ij G X as reference frame if is satisfied e(X) =
[0453]
[0454] Select X(k
[0455]
[0456] G X as reference frame otherwise
[0457] That is we define a function e that selects a reference frame from a set of possible frames, X, based on a certain criterion C
[0458]
[0459] {x^i Ln^, where the set X represents a collection of possible frames, each labeled as x^Ln), and which may include the current frame being encoded and decoded. Here, i is an index identifying the specific frame within the set, and Lnis an associated qualify level.
[0460] The condition C(xi Ln) is an arbitrary criterion function applied to each frame X(iLn)within X. As noted above, this criterion C could be any function, logical condition, rule, or quantitative measure that determines whether a particular frame should be chosen as the reference frame, based on one or more properties, metrics and / or values associated with one or more frames in X.
[0461] In the first case, the function e will select frame x^
[0462]
[0463] as the reference frame if the criterion C(x(n)) is satisfied for that frame. In the edge case where no frame in the set X meets the criterion defined by C, then the function defaults to selecting a specified frame x^k Lm)as the reference frame. One example method of selecting a default frame is to follow a specified, L1T2 / W0
[0464] predetermined reference frame dependency scheme (e.g. an L1T2 scheme such as that of Figure 8. or some other scheme).
[0465] Thus, the above generalised approach may at a high level be considered as following a predetermined reference frame dependency scheme as the default unless some exit criterion is met, in which case that reference frame dependency scheme is exited early and a different reference frame dependency is chosen.
[0466] Returning now to the specific toy example of Figure 9, once frame 903 x2is encoded we move on to the next frame 904 x3, which uses the previous frame 903 x2as a reference frame, then for the next frame 905 x4, we perform a reference frame dependency prediction 908 again, but this time based on a property, metric and / or value associated with the frames x(2, L2) 903, x(3, L1) 904, and the current input frame x₃. In this case we may use the same function e as described above, or some other function. After selecting the reference frame for encoding and decoding frame 905 x4, we encode and decode frame 906 x5using frame 905 x4as the reference frame, and then proceed again with a further reference frame dependency prediction 909 and so on.
[0467] In the example of Figure 9, the reference frame dependency check is made every even frame after the I-frame 901 x₀, but it will be appreciated that this is illustrative only and in other examples the check may be made every frame, or every n frames, following some other predetermined schedule, and so on.
[0468] Optimal quality level prediction
[0469] So far, we have not touched on predicting what quality level a current frame is to be encoded and decoded at, given some set of available reference frames and a current frame. That is, in the example of Figure 9, the quality’ level for encoding and decoding follows a predetermined alternating quality level order of L2, LI, L2, LI and so on. In order to make the present approach even more adaptive, it is envisaged that the quality level at which to encode and decode the current frame may be predicted is based on one or more properties, metrics and / or values of the available reference frames and / or the current frame. That is, a function q is introduced that predicts an optimum quality level that the cunent frame is to be encoded and decoded at, given the available reference frames and / or current frame.
[0470] In one example, the function q may be based on estimating how much a rate and distortion score of a current frame changes for a change in quality level, that is q may be based on (e.g. a function of) estimating δmse(x̂ₜ,L, xₜ) / δL, where mse(x̂ₜ,L, xₜ) is a mean square error between (i) a reconstructed current frame xt Lobtained by performing a forward pass through the AI-compression pipeline at quality level L and computed using some reference frame (e.g. predicted using e or predetermined according to some default scheme) and (ii) the corresponding ground truth current frame xt. For example:
[0471] f6mse(xt L,xt
[0472] q(xt)'= hI - 6L -
[0473]
[0474] where h is any suitable function such that q(xt) can be said to be based on the change in distortion with respect to a change in quality level L.
[0475] In the case where the quality' levels are discrete and non-differentiable, or when it is desirable not to calculate the derivative exactly due to computational and / or device constraints, the function q (and thus h) may be implemented using finite differences, for example:
[0476] r y ~ h (mse(*t, L+AL’xt) ~ mse{xt L, %t)
[0477] AL
[0478]
[0479] where AL represents a change in quality level. For example, increasing or decreasing the quality level by 1.
[0480] Training using the usual rate distortion loss function will result in q converging to where it is consistently outputting an optimum quality level (i.e. the quality level that achieves a minimal rate distortion score) given an input frame and the available reference frame(s). Note it is also envisaged that, where there are no compute and / or memory constraints, gradient descent may be performed during inference to learn the optimal quality level at inference time by forward passing through the model many times with different quality levels. In this setup there is no real distinction between q and L, but this is computationally very costly and impractical in most real time or near real time use cases.
[0481] Given that finding an optimal quality level L by iteratively evaluating q may result in multiple forward passes of the AI-compression pipeline being performed, the compute and memory overhead of q may be large. Accordingly, it is envisaged that q may instead be run on a lower resolution representation of the current frame. That is. xtmay be downsampled a predetermined number of times and the predicted optimal quality level L may thus be based on the downsampled representation of xtto reduce compute and memory overhead.
[0482] It will further be appreciated that the prediction of an optimal L by optimising q may be based not just on evaluating q using a single, predetermined or predicted reference frame, but also on evaluating q on all of the available reference frames, thereby jointly optimising L and the choice of reference frame t − j (where j is frame index of a GOP). Thus, q takes on the role L1T2 / W0
[0483] of the reference frame temporal dependency prediction function e as well as playing the role of predicting an optimal quality level to use when encoding and decoding the current frame.
[0484] In this case, we may define q as a function we are trying to jointly optimise to minimise rate and distortion scores of the pipeline for a given input frame and / or available reference frames, whereby q is some function (for example a neural network whose parameters are learnable using a rate distortion loss, or some other function that may but need not be learnable) that maps information associated with the input frame and / or available reference frames (e.g. a difference between them) onto an optimal quality level L and / or reference frame temporal dependency t — j:
[0485] I mse (xt L+aLxA - mse (xt Lxty\
[0486] qL(xt) ~ hL- — -
[0487] / mse (xt: LiXt_.+a., xt) - mse (xt L,Xt_., xtj\
[0488] Qj&t) « hj I — I
[0489]
[0490] where mse (t L Xt_., xtis the mean square error between the ground truth current frame and the frame reconstructed at quality level L wi th a forward pass using a reference frame at position xt— j, where L is an increment change in quality level (e.g. one quality level higher or lower than L), and Aj is an increment change in the index position of the frame index position of the frame used as a reference frame. Alternatively, it is envisaged that the forward pass may be performed across multiple different quality levels L and ground truth references (effectively with "infinite" quality' level) and estimating a difference from these.
[0491] If we are optimising with gradient descent, then for each optimization iteration, we may update L and j, based on the rate distortion loss gradient, with:
[0492] dloss(Lold)
[0493] L-new Loid C(. •
[0494] dloss(jold)
[0495] Jnew J old P ’
[0496] where a and / ? are the size of the incremental increases or decreases AL and Aj.
[0497] After some number of iterations, for example when L and j converge or after a predetermined number of steps, a jointly optimal L and j will be reached for the current frame. These optimal values of L and j will then be passed as flags into the pipeline, leading to the current frame to be encoded and decoded at the optimal quality level and optimal reference frame, before the same process is repeated for the next frame. L1T2 / W0
[0498] Note however, that this is expensive to perform in inference. Instead, the learned functions qL(xt) and qj(xt) predict, based on information associated with the a trial encode of the current frame using the available reference frames (e.g. information such as the derivativedmse(L) / dLan output predicted optimal quality level L and optimal reference frame index without the need to perform expensive optimisation iterations during inference.
[0499] Note also that, in the case of optimising j (i.e. where an optimal reference structure is being learned), gradient descent is difficult as the reference frame index j is a discrete variable. Thus, the function qjfacilitates the prediction of an optimal reference frame index j without the need to perform gradient descent.
[0500] Whilst not shown, it is also envisaged that where compute and memory overheads are not a concern on the encode side, a brute force approach may be used to precalculate all possible quality level and reference frame temporal dependency combinations for a GOP of given input frames, and then to use the combination which achieves best rate distortion score when encoding and decoding the GOP. This approach is envisaged to be used where the number of quality level combinations and maximum GOP size is small, or where a given input sequence of frames is likely to be used very regularly (e g. in video on demand scenarios). It will also be appreciated that the brute force approach quickly becomes burdensome as the number of possible combinations increases.
[0501] Combined optimal temporal and quality level hierarchy prediction
[0502] Figure 10a. Figure 10b. Figure 10c, Figure lOd, and Figure lOe together show an illustrative toy example that combines a reference frame temporal dependency prediction step e with a current frame quality level prediction step q, together providing a fully predictive scalable video coding scheme whose quality level hierarchies and reference frame temporal dependencies are adaptive. In this toy example, we provide three quality levels L3 (highest quality), L2 (medium quality level), and LI (lowest quality level).
[0503] Starting with Figure 10a, the first frame 1001 x0at index position 0 is a highest quality L3 I-frame as no other frames are available. The second frame 1002 x at index position l is a P-frame that only has the option of the I-frame 1001 as its reference frame but does have a choice of what quality level to be: L3, L2 or LI. The quality level prediction step 1003 q( ) is performed, and in this example predicts that an optimal quality’ level is LI, thus the current frame x is encoded and decoded at quality level LI. Thus, the predicted reference frame L1T2 / W0
[0504] dependency and quality level flags passed into the pipeline for the frames of Figure 10a are {L3}, {Ll, t — 1}.
[0505] In Figure 10b, we get the next frame 1004 x2. We now have two options for reference frames: the higher quality I-frame two index positions back, or the lower quality' P-frame one index position back. The reference frame dependency prediction step 1005 e(x(0,L3), x(1,L1), x2) is then performed, predicting that the optimal reference frame to use is the immediately preceding frame 1002, thus giving a reference frame temporal dependency flag of t — 1. The quality' level prediction step 1006 q(x2) is then performed to predict, using the predicted t — 1 frame 1002 as the reference frame, what quality level the current frame x2should be encoded and decoded at. In this toy example, q(x2) predicts L2. Thus, our sequence of flags is now {L3}, {LI, t — 1}, {L2, t — 1}.
[0506] In Figure 10c, we get the next frame 1007 x3. We now have three options for reference frames: frame 1001, frame 1002, or frame 1004. The reference frame dependency prediction step 1008
[0507]
[0508] ^(2, L2)-X3) is performed, predicting here that the optimal reference frame is the frame two steps back at index position 1, thus giving us a reference frame temporal dependency flag of t — 2. The quality level prediction step 1009 q(%3) is then performed with the t — 2 frame 1002 as reference frame, giving a predicted quality level to use for encoding and decoding the current frame of LI. Thus, our sequence of flags is now {L3},
[0509]
[0510] — 1}, {L2, t — 1}, {LI, t — 2}.
[0511] In Figure lOd, we get the next frame 1010 x4. We now have four options for reference frames: frame 1001, frame 1002, frame 1004, or frame 1007. The reference frame dependency prediction step 1011 e(x(0; L3), X(1; L1), X(2; L2), X(3,L3), x4) is performed, predicting here that the optimal reference frame is the frame four steps back at index position 0, thus giving us a reference frame temporal dependency flag of t — 4. The quality level prediction step 1012 q(x4) is then performed with the t — 4 frame 1001 as reference frame, giving a predicted quality level to use for encoding and decoding the current frame of L3. Thus, our sequence of flags is now {L3}, {LI, t — 1}, {L2, t — 1}, {LI, t — 2}, {L3, t — 4}.
[0512] Finally, in Figure lOe, we get the next frame 1013 x5. We now have five options for reference frames: frame 1001, frame 1002, frame 1004, frame 1007, or frame 1010. The reference frame dependency prediction step 1011 e(x(0,L3), x(1,L1), x(2,L2), x(3,L3), x(4,L3), x5) is performed, predicting here that the optimal reference frame is the frame five steps back at index position 0, thus giving us a reference frame temporal dependency flag of t — 5. The quality¬ level prediction step 1014 q(x5) is then performed with the t — 5 frame 1001 as reference L1T2 / W0
[0513] frame, giving a predicted quality level to use for encoding and decoding the current frame of LI. Thus, our sequence of flags is now {L3}, {L1, t — 1}, {L2, t — 1}, {L1, t — 2}, {L3, t — 4}, {LI, t — 5}.
[0514] These steps may then be repeated an arbitrary number of times, for example up to a maximum number steps, or until some other condition is met, to force a current frame to be an I-frame again and to commence the start of a new GOP.
[0515] It will be appreciated that the toy example of Figures 10a- lOe are illustrative only and provided to demonstrate that the methods of the present disclosure facilitate fully adaptive, predicted temporal and quality hierarchies for a scalable video coding. It will also be appreciated that the present methods are synergistic with end-to-end, learned video compression by virtue of the quality levels being fully learnable (e.g. through learnable off-sets applied to gain or control units), thus facilitating implementation without burdensome handcrafting of the quality levels within the adaptive, predictive scheme.
[0516] Whilst not shown in Figures lOa-lOe, it is also envisaged that the present methods may be extended to reference frames at index positions forward from the cunent frame, for example at positions t+1, t+2, t+3,... t+j. That is, the present methods are not limited to P-frames only, but can be extended to include bidirectional reference frame (B-frame) dependency predictions, where such forward frames are available to use as reference frames. Advantageously, extending the present methods to B-frames comprises including any frames with forward index positions that are available as input to the reference frame prediction function e. That is:e(x(0, Ln)< • ■ ■ >x(t+j, Lri)>xt)- Optionally, because the quality level and / or reference frame dependency prediction step may comprise running one or more of the components of the Al-based compression pipeline (for example when relying on one or more warped representations of one or more available frames, the optical flow is calculated), the reference frame dependency prediction step may increase compute and / or memory overhead. To help to address this, it is envisaged that one or more of these steps may be computed on downsampled representations of the current frame (and / or past frames). The inventors have found that the quality level and reference frame dependency prediction step is still effective even in downsampled space, thus achieving the same effect with a smaller compute and / or memory overhead than when predicting an optimum reference frame dependency in full resolution space.
[0517] Concept 2: MiniGOP control L1T2 / W0
[0518] As described above, a miniGOP is a repeatable block of a hierarchical scheme, for example between I-frames, or P- and / or B-frames. In some hierarchical schemes, it can be beneficial to set a maximum miniGOP size, for example a maximum number of frames that can be assigned to that miniGOP as I-, P- and / or B-frames, at some quality level and / or in some encode order. Typically, as the miniGOP size increases, the spacing between P-frames increases, leading to a greater number of B-frames in an image sequence, this can cost fewer bits to encode, but may lead to poorer distortion scores particularly in scenes with high motion as the temporal gap over which motion is estimated increases. Note high motion may refer to image sequences where pixel values change significantly between frames. In high motion scenes it can be beneficial for rate distortion performance to have smaller miniGOP sizes as this typically results in more I- and P-frames in a sequence, which can help to prevent errors and artefacts propagating across the frames as the high motion occurs, even if this higher number of I- and P-frames costs more bits.
[0519] Determining what the miniGOP size should be for a given sequence of images may comprise receiving a sequence of images and assigning metadata to each image specifying a frame type parameter (for example, specifying whether that image should be an I-frame, a P-frame or a B-frame, and where applicable on which other images in the sequence that frame should be its reference frame(s)), a quality level parameter (for example, specifying one or more of the quality levels at which that frame should be encoded, such as the qualify levels described above in respect of concept 1), and / or an encode order parameter (for example, specifying an index position in an ordered list at which that frame will be encoded).
[0520] Taking a toy example of a sequence of 16 images, metadata specifying the above parameters might comprise assigning 8 of the images to a first miniGOP of size 8, with frame types of I-P-B-B-P-B-P-B-P, a second miniGOP of size 4 with frame types P-B-B-P. a third miniGOP of size 2 with frame types P-B, and a fourth miniGOP of size 2 with frame types P-P. The metadata associated with each frame in the miniGOPs may also be a list of reference frame dependencies defining on which other frame(s) each of the frames depends, and any accompanying encode order information. This metadata may be passed together with the images to the encoder, and / or sent in the bitstream to the decoder to use when encoding and decoding.
[0521] Determining optimal miniGOP size(s) for a given sequence of images is non-trivial. A naive approach of setting a static or default miniGOP size is typically not optimal because, as described above, there are scene types where larger miniGOP sizes are beneficial to rate L1T2 / W0
[0522] distortion performance, and there are scene ty pes where smaller miniGOP sizes are beneficial to rate distortion performance.
[0523] Instead, setting miniGOP size(s) for an image sequence may be based on information derived from at least two images of the image sequence. For example, based on a difference between the at least two images. This difference may be, for example, optical flow information, and / or some other information indicative of a difference between two frames.
[0524] Two non-limiting example approaches to producing such information will now be described with reference to an I-frame and flow-residual Al-based compression, transmission and decompression pipeline.
[0525] Figures 11 to 13 illustratively show components of a flow residual Al -based compression, transmission and decompression pipeline.
[0526] Figure 11 illustrates the components of an I-frame part comprising an encoder neural network 1101 (such as the encoder neural network 203 of Figure 3) and a decoder neural network 1102 (such as the decoder neural network 205 of Figure 3).
[0527] The encoder neural network 1101 in Figure 11 may be considered as a single neural network or as a plurality of neural network components made up of an encoder 1103, a hyper encoder 1104, and a hyper hyper encoder 1105, as well as a decoder 1106, a hyper decoder 1107, and a hyper hyper decoder 1108, with corresponding entropy encoding modules 1109, 1110, 1111 that together use the outputs of the decoders 1106, 1107, 1108 to entropy encode the latent representation outputs of the encoders 1103. 1104, 1105 into a bitstream.
[0528] The decoder neural network 1102 in Figure 11 may be considered as a single neural network or as a plurality of neural network components made up of a decoder 1106, a hyper decoder 1107, and a hyper hyper decoder 1108 with corresponding entropy decoding modules 1112, 1113, 1114 that together use the bitstreams and respective outputs of each other to entropy decode the bitstream and reconstruct the various latent representations into an output image x0. which is a lossy approximation of an input image x0.
[0529] Figure 12 illustrates the components of a flow part of a pipeline comprising a flow encoder neural network 1201 (such as the encoder neural network 207 of Figure 3) and a flow decoder neural network 1202 (such as the decoder neural network 209 of Figure 3).
[0530] The flow encoder neural network 1201 in Figure 12 may be considered as a single neural network or as a plurality of neural network components made up of: a flow encoder 1206, a flow hyper encoder 1208, a flow hyper hyper encoder 1209, a flow decoder 1210. a flow hyper decoder 1211 and a flow hyper hyper decoder 1212. Also provided are respective entropy encoding modules 1213, 1214, 1215 that together use the outputs of the flow decoders 1210, L1T2 / W0
[0531] 1211, 1212 to entropy encode the latent representation outputs of the flow encoders 1206, 1208, 1209 into a bitstream.
[0532] The flow decoder neural network 1202 in Figure 12 may be considered as a single neural network or as a plurality of neural network components made up of a flow decoder 1210, a flow hyper decoder 1211, and a flow hyper hyper decoder 1212 with corresponding entropy decoding modules 1216, 1217, 1218 that together use the bitstreams and respective outputs of each other to entropy decode the bitstream and reconstruct the various latent representations into a representation of optical flow information indicative of a difference between an input previously decoded image xt-1and an input current image xt. The representation of optical flow information is used to warp the previously decoded image
[0533]
[0534] to produce a warped representation of the previously decoded image xt-1,wfor later use by the residual part of the pipeline.
[0535] Certain scene types such as scene cuts, occlusion scenes (where an object moves and reveals a pixel structure background behind it), large portions of the frame changing due to high motion, failure to describe high motion resulting in mispositioned objects in the frame, and other non-motion-compensatable changes in scene content can be challenging for Al -based compression pipelines to handle well. Non-motion-compensatable changes means changes between a reference frame and a current frame where no amount of optical flow information based motion compensation can model a change because the reference frame does not contain enough or any information that is useful for predicting the current frame. Such scenes contain frame sequences that may result in the networks of an Al-based compression pipeline spending a disproportionate number of bits on to compress and result in artefacts and other failure modes arising in the reconstructed frames. It will be appreciated that a failure mode of a neural network arises when the output of the neural network based on a given input deviates substantially from what the network would be expected to produce from the input. For example, an Al -based compression pipeline might for many scene types produce reconstructed frames at some target rate score and / or distortion score but for other scene types suffer from a failure mode which results in a large spike in, for example, the distortion score, indicating the output frame looks nothing like the input frame.
[0536] In general terms, this underperformance can be attributed in part to how flow residual based compression pipelines operate. That is, they produce a reference frame xt-1,wthat is used by the residual part to reconstruct xt. As described above, this reference frame xt-1,wis produced by optical flow information based motion compensation through the warping of a L1T2 / W0
[0537] reference frame xt-1with the flow map produced by the flow part. However, there are several cases in which a frame produced by motion compensation is not a useful base for the residual part for example scene cuts, large portions of the frame changing due to high motion, or failure to describe high motion resulting in mispositioned objects in the frame, and so on. In effect, this leads to image content at a given location varying drastically between xt-1,wand xt, which the residual part struggles to correct without artefacts. In these cases, the inventors have realised that it is often bitrate-efficient and perceptually-pleasing to apply a coarse, cheap correction to these areas when producing the final reference frame xt-1,wfor the residual part.
[0538] In particular, a correction can be applied to a (warped or ground truth) reference frame before it is passed to the residual part of the pipeline, thereby reducing the difficulty of the task the residual part has to perform, reducing the number of artefacts in the final output image xt, and reducing the number of bits spent by the residual part to produce the final image xt. In general terms, this correction may be implemented by introducing a new output tensor on the decode side of the flow part of the pipeline, the new output tensor comprises information about the current frame xtand information indicative of to what extent that information can be used to correct xt-1,w.
[0539] For example, the flow hyper decoder 1211 may be configured not only to output entropy parameters (e.g. location, scale parameters and so on), but also a learned representation of xtreferred to hereinafter as xt,mini1218, for example a tensor having luma and chroma channels and a predetermined resolution that may be the same resolution as xtor a lower resolution. The base flow decoder 1210 may then be configured to output not only the flow map, but also a learned occlusion mask moccthat specifies where and by how much the warped reference frame xt-1,wis to be corrected by replacing or modifying its pixels using the newly produced xt,mini. In very general terms, the flow part of the decode side of the pipeline now produces an additional output that the networks are able to learn to use to correct problematic xt-1,wthat arise from failure modes in the flow part so that the residual part is not burdened with that task.
[0540] More formally, a corrected warped reference frame xt-1,w,corrmay be defined as:
[0541] X
[0542]
[0543] t-l,w,corr Tnocc ’ (1 ^occ) ’
[0544] where moccis a learned mask 1219 output by the base flow decoder 1210, xt,miniis a learned representation of the input frame xtwhich was fed into the encode side of the flow part, and (1 — mocc) is the inverse of the mask mocc. In one implementation, the mask moccmay be a binary mask and thus specifies which pixels of xt−1,w,corrare taken from xt-1,wand which are taken from xt,mini. Alternatively, the mask need not be binary and may instead specify a L1T2 / W0
[0545] pixel-wise blend of xt-1,wand xt-1,w. Alternatively, to reduce the impact that the miniframe approach has on runtime, xt,minican be replaced by a mechanism where occluded areas are filled in with grey or other single colour pixel values. This reduces the compute overhead and it has been found does not significantly negatively impact overall distortion performance. In this case, the corrected warped reference frame becomes:
[0546] •
[0547]
[0548] xt-1,w,corr= ma* xa,w+ mb* xb,w+ mgrey* g where g is a constant grey and each 0 ≤ m ≤, ma+ mb+ mgrey= 1. This approach is conceptually similar to the miniframe approach but fills out areas where there is likely to be bad flow with grey pixels, rather than with pixels from a representation of the current frame.
[0549] During training, the networks of the pipeline learn to use moccand xt,minito construct any portions of xt−1,w,corrprimarily from xt,miniwhen there is very high motion, and primarily from xt-1,w,corrwhen there is low motion. This means the networks are learning to correct any bad flow artefacts
[0550]
[0551] in by replacing pixels with pixels taken from xt,mini. At the same time, the networks leam to include in xt,minisubstantially only those pixels which are likely to be used for the corrections. As a result, xt−1,w,corrmore accurately captures actual motion between xtand x^-^ and thus the difficulty of the task that the residual part has to perform is reduced and the resulting final reconstructed frame xtcontains fewer artefacts and uses up fewer bits in the bitstream.
[0552] Configuring a networks of the flow' part to output moccand xt,miniallows the flow' parts to effectively operate simultaneously as a form of I-frame module whereby it leams during training that it may in some cases have to reconstruct relevant information from xtdirectly (in the form of xt,mini) and discard some or all of the pixels of xt-1,wwhenever non-motion-compensatable changes between xtand xt-1occur, for example if scene content differs drastically between xt
[0553]
[0554] and such that it would result in a spike in bitrate and / or distortion scores. In the extreme case, for example when there is a scene change or scene cut, moccand xt,minieven allows the flow part networks to choose to construct xt-l wcorrentirely from the pixels of xt,miniwhereby all of the pixels of xt-1,ware discarded (or in the case of the greyed out pixel alternative described above, the information about the current frame is effectively entirely encoded and decoded by the residual part which is doing all of the w'ork). In the toy example where the mask moccis a binary' mask, this means mocccomprises all zeroes and ^t-i,w,corr=t.mim and operates as a form of learned scene cut detector. When the residual part receives the corrected warped frame xt−1,w,corr, its task is now simply to tweak xt−1,w,corr L1T2 / W0
[0555] to get as close to xtas possible, which is much more straightforward to do with few er bits than trying to get a badly warped, artefact-filled xt-1,was close to xtas possible.
[0556] Whilst the above-described example uses the flow decoder 1210 and flow hyper decoder 1211 to produce moccand xt,mini, any of the networks in the flow' part may be used for this purpose. For example, one or more of the flow decoder, hyper decoder and / or hyper hyper decoder may be configured to produce additional output tensors xtrniniand moccand configured to use those outputs in the manner as described above. In general terms this may comprise introducing one or more output layers to the netw orks, and the tensors output by those layers (i.e. xt,miniand mocc) may be passed, as applicable, to the function that performs the correction of xt-1,w, namely xt-1,w,corr
[0557]
[0558] f^^w corrmocc’ T (1 Because the use of these new outputs directly affects the bitrate and / or distortion during a forward pass, the weights of the networks of the flow decoder, hyper decoder and / or hyper hyper decoder during training are updated so that the computed values for the xt,miniand mocctensors minimise the bitrate and / or distortion of the rate distortion loss function. When the pixel values of xt,miniand moccare monitored during training, it can be seen that the networks have a tendency to learn to use these outputs in the manner as described above. That is. as a mechanism for correcting poorly warped reference frames arising from the flow network's failure modes.
[0559] For example, it is possible to visualise xt,miniand moccin a plot which shows that xt,miniis substantially empty at pixel coordinates (e.g. whereby empty is a grey pixel or some other colour that is the same for all pixels) where there is no or low motion allowing for efficient compression for very few- bits, while at the same time contains a representation of xtwherever there is are non-motion-compensatable changes between xtand xt-1. Further, moccends up as a mask that covers the empty pixel parts of the frame, but lets through non-motion-compensatable change parts. During training, the networks will learn to use the new outputs as a mechanism to correct poor flow estimation as described above by predicting values for these outputs based on the given input frames that produce this effect.
[0560] Given that a purpose of xt,miniis to be a source of pixels that can be used to correct "bad" pixels in xt- w, where moccspecifies where the correction is to be made, the inventors have realised that it is not necessary for the generation of xt,miniand / or moccto be in the same higher resolution as xtand %t-i. This is at least because the residual part 210 is effective at compensating for poor resolution artefacts as long as the pixels are approximately correct, so the generation of xt,miniand / or moccdoes not have to be perfectly at full resolution. Thus L1T2 / W0
[0561]
[0562] ^t.mini moccmay be generated efficiently in a dow nsampled, low er resolution space relative to xtand xt−1(e.g. h / 8, w / 8 or h / 16. w / 16 or some other resolution that is lower than the h and w resolutions of xtand xt-1), and then upsampled to the resolution of xt-1,wbefore performing the correction. Alternatively, the inventors have found that there are advantages to generating moccat a higher resolution, for example full resolution. This is because moccultimately determines where the corrections are made to xt-1,wand, because mocccan produce high- entropy shapes very cheaply (few bits) due to such shapes sharing boundaries with what' s being transmitted in the optical flow information allowing the mutual information between these to be exploited, which in turns means that xt,minican spend fewer bits on making accurate shapes and more bits on accurate colours. Thus, generating mocceven at full resolution is relatively cheap and facilitates more accurate and better corrections to xt-1,w.
[0563] More details will now be described of an illustrative training implementation with reference to the pseudocode below'.
[0564] Algorithm 2 Training with mini frame* for warping correction
[0565] Inputs:
[0566] Training datasex A', learning rate??, network architecture comprising Sow part
[0567] and
[0568] residual pan / ($<. number of epochs F.
[0569] Initialize network parameters d
[0570] for epoch ~ 1 •<> £ o
[0571] for i?ad ■ hatch in A do
[0572] Perform forward pass to compute.fi — / ,»( r,.. xt) including: predicting
[0573]
[0574] and
[0575]
[0576] and flow wish ^ T-t. •«■»)> warping..; with the predicted flow to produce
[0577]
[0578] using and produce
[0579]
[0580] and finally computing x,)
[0581] Compute Loss - D +! / <'
[0582] Backward pass to compute gradients Vra.^jLo**
[0583] Update parameters with optimizer O’,
[0584]
[0585] {a, «- , <*), V^^.^^Loss,
[0586] end for
[0587] end for
[0588] Optionally ■.waiuaic on vai idal ioa set
[0589] That is, a training data set. a learning rate g, and a number of training steps or epochs E are selected and the network architecture of fdthat comprises a flow part
[0590]
[0591] and a residual L1T2 / W0
[0592] part h,p is defined, for example as shown in Figure 3. The network parameters 9, < >, i are initialised (e.g. randomly) and then the training loop is started. For each batch in the training dataX, a forward pass is performed through fg. As described above, this comprises predicting optical flow information with the flow part gp of the architecture, using the optical flow information to warp a previously decoded frame x̂t−1to produce
[0593]
[0594] as well as using the flow part g,p to predict a mask moccand associated representation of the the current frame t.mini- The mask moccis then used to combine xt,mini and
[0595]
[0596] to produce xt−1,w,corr. This combined, warped previously decoded frame xt−1,w,corris then fed into the residual part of the architecture together with the current frame xtto compute xt— h
[0597]
[0598] ^(xt−1,w,corr, xt). The total Loss is then calculated by combining a distortion term D and a rate term 7?, and any other loss terms (not shown). The backwards pass is then performed to compute gradients based on the loss, and the parameters 9, cp,ip are optimised using the optimiser, such as stochastic gradient descent SGD, or some other known optimiser.
[0599] Optionally, a validation loss can be calculated. The learning rate, batch size, and / or number of epochs may be optimised during training, for example using a learning rate scheduler or some other hyperparameter optimisation method. More generally, the hyperparameters may be optimised experimentally.
[0600] One issue that can arise in training is that the networks of fecomprising flow part g^ and residual part
[0601]
[0602] have a tendency to "cheat" by finding a local rather than global loss minimum by learning to produce occlusion masks moccthat have values that completely mask away xt,minifor all input frames (i.e. resulting in xt−1,w,corr= xt-1,w) meaning that the benefits of this architectural mechanism acting as a correction that compensates for the effects of bad flows and bad warps is not realised. To prevent this, an annealing or regularisation schedule may be introduced that gradually increases the effect that mocccan have on the overall loss by changing which loss terms contribute to the overall loss, and / or by using supervised learning using a teacher network that has been trained to produce high quality xt,miniand mocc.
[0603] For example, assume for the sake of argument that the loss function comprises, in addition to the usual rate and distortion terms, the following further loss term M:
[0604] M
[0605]
[0606] mse(xt−1,w,corr,xt)' + mse(mocc, mteacher) where mse is a function that estimates a mean square error value between its input arguments, where mteacheris computed by a teacher network, and where moccand xt−1,w,corr(the reference frame corrected using xt,mini) are computed by the netw orks being trained i.e. L1T2 / W0
[0607] by f0with flow part g and residual part h^. The teacher network may comprise a network having the same or similar architecture as fethat comprises a flow part g^ and a residual part h^p but which has been trained using a training scheme that allows it to produce high-quality occlusion masks and miniframes, but not necessarily in a way that makes it feasible for enddevice deployment. For example, the teacher's architecture may be similar in structure to the student's but much larger than what could be run on-device, and the unsupervised training schedule may take much longer than feasible for iterative development. By way of non-limiting example, if the student training schedule has 200,000 training steps, then the teacher training schedule may have 1,000,000 or more steps.
[0608] During the forward pass, the new overall loss may thus be computed as:
[0609] Loss = D + AT? + M
[0610] L
[0611]
[0612] Loss = D + λR + mse(x̂t−1,w,corr, xt) + mse(mocc, mteacher) However, at the start of training there is a conflict between the gradients from mse(
[0613]
[0614] xt−1,w,corr,xt) and mse mocc,mteacher). The miniframet,mm(used to produce Xt-i,w,corr at this point is of poor quality,
[0615]
[0616] so mse(xt−1,w,corr,xt) penalises the occlusion mask moccfor being anything than all Is (i.e. applying a full mask and thus not the poor quality miniframe xt,miniat all). On the other hand, the teacher's occlusion mask mteachergenerally has more 0 areas (i.e. using the miniframe xt,minimore) than the student because it assumes the teacher's high-quality miniframe xt,miniis to be used, and so they encourage those areas of the student's mask to be 0. To ensure that the mask is learned efficiently, gradients can be prevented from flowing from mse{xt-r w corr,xt^ to the occlusion mask moccat the start of training and only gradually introduced with an annealing or regularisation schedule. Two ways to accomplish this are to detach the mask in the function that produces the warped reference frame for this part of training or to use the teacher's occlusion mask mteacher. The latter option is preferred as it performs slightly better but both options are possible.
[0617] One reason the latter option is preferred is that the student occlusion mask quickly moccbecomes almost entirety all Is (i.e. operating to fully mask out the miniframe) once the supervised training is introduced. However, the only gradients that the miniframe xt,minireceives in the initial training is through mse xt−1,w,corr, xt^, so if the all-1 student mask moccis used then the miniframe xt,minieffectively sits out training until the student mask improves. This leads to undertrained miniframes xt,miniin which is detrimental to performance. Using the teacher's mask mteacherin the warping operation means that the miniframe xt,miniis used a lot L1T2 / W0
[0618] more (i.e. through the more frequent 0 areas of the teacher mask mteacher, and so therefore receives useful gradients throughout training.
[0619] Returning to the annealing or regularisation schedule, we define the overall loss as: Loss = D 4- AR 4- Manneal
[0620] where:
[0621]
[0622] ^anneal & ’ mse[Xt-i,w,corr> %t) + 0 ' ^‘^^\,^'t-l,w,corrOCC:detatch’ ^t) + Pmse.mocc>mteache> )
[0623] where x̂t−1,w,corr_occ_detatchis the warped reference frame corrected using the miniframe xt,miniand mask moccdetached so that gradients are prevented from flowing through mse(
[0624]
[0625] xt−1,w,corr, xt) to the occlusion mask mocc, and where xt−1,w,corris the warped reference frame corrected using the miniframe xt,miniand mask moccwithout the detach operation so that gradients can flow through to the occlusion mask mocc.
[0626] It will be appreciated that gradients on the occlusion mask moccare thus of the form: dL dD dR dmse(xt-r w corr,xt') dmse(mocc, mteacher~) -=1 - +2’ a - +a- a -+V - A -
[0627]
[0628] ∂moccwhich facilitates the gradual toggling in of the gradients from
[0629]
[0630] mse(xt−1,w,corr,x using the regularisation or annealing parameters a and / ?.
[0631] At the start of the training schedule, a = 0, indicating that gradients may not propagate through to the mask from mse(xt−1,w,corr,xt^, thereby allowing the networks to focus on learning xt,miniand moccas separate tasks. At the end of the training schedule, a = 1, indicating that gradients may propagate fully through to the mask from mse(x̂t−1,w,corr, xt), better aligning the tasks of learning the two tensors (i.e. to minimise mse(
[0632]
[0633] xt−1,w,corr, ). The value of / ? may be set to 0 at this point, so that the occlusion mask moccfully adapts to the miniframes xt,minithat the networks produce, rather than receiving encouragement from mse mocc,mteacher) to match masks that assume the teacher's miniframes. The values of a may gradually move from 0 to 1 following some predefined schedule based on e.g. number of training steps or some other criteria, for example how close to convergence of the overall loss to some predetermined loss value, or some other requirement.
[0634] For completeness, it is noted that an annealing or regularisation schedule such as that illustrated above is particularly effective at facilitating the use of moccand xt,miniin combination with the encoder masks of concept 2 above. This is because the inventors have found that using moccand xt,minitogether with an encoder mask makes it easier for the L1T2 / W0
[0635] networks to "cheat" during training as both the encoder mask and the occlusion mask moccare synergistically used by the networks to control what information is sent by the encode side and / or masked out on the decode side. When used together, a local minimum can be reached in training where the networks mask away xt,minifor all input frames on the decode side and the encoder mask then leams to mask on the encode side any information that would be used for
[0636] As will be appreciated, whilst the examples described above use the flow hyper decoder 1211 and flow base decoder 1210 to predict moccand xt,mini, this is illustrative only. At the most general level, the decoder networks are correcting the warped previously decoded frame xt-1,wusing some reference that contains useful information to perform that correction. In the examples above, this additional functionality is bootstrapped onto the networks that primarily serve other purposes (i.e. the networks of the flow part of the pipeline). However, it is additionally or alternatively envisaged that dedicated networks may be introduced into the pipeline whose primary purpose is to produce the reference information to be used for correcting
[0637] Returning now to Figure 12, it can be seen that the warped reference image that is being passed to the residual part of the pipeline is the corrected warped representation of the reference image xt−1,w,corr. For the sake of avoiding crowded notation, this corrected warped representation of the reference image will simply be referred to herein as xt-1,w.
[0638] Figure 13 illustrates the components of a residual part of a pipeline comprising an encoder neural network 1301 (such as the encoder neural network 211 of Figure 3) and a decoder neural network 1302 (such as the decoder neural network 213 of Figure 3).
[0639] The encoder neural network 1301 in Figure 13 may be considered as a single neural network or as a plurality of neural network components made up of a residual encoder neural network 1304, a residual hyper encoder neural network 1305, a residual hyper hyper encoder neural network 1306, a residual decoder neural network 1307, a residual hyper decoder neural network 1308 and a residual hyper hyper decoder neural network 1309, together with corresponding entropy encoding modules 1310, 1311, 1312 that together use the outputs of the decoders 1307. 1308, 1309 to entropy encode the latent representation outputs of the encoders 1304, 1305, 1306 into a bitstream. As shown in Figure 13, the latent representation outputs of the encoders are based on the current image xtand the warped representation of the (previously decoded) reference image xt-1,w, produced by the flow part of the pipeline. L1T2 / W0
[0640] The residual decoder neural network 1302 in Figure 13 may be considered as a single neural network or as a plurality of neural network components made up of a residual decoder 1307. a residual hyper decoder 1308 and a residual hyper hyper decoder 1309. with corresponding entropy decoding modules 1313, 1314, 1315 that together use the bitstreams and respective outputs of each other to entropy decode the bitstream and reconstruct the various latent representations into an output image xtwhich is a lossy approximation of the input current image xtfed into the residual encoder 1304. Note that the operation of the one or more of the residual decoder 1307, the residual hyper decoder 1308, the residual hyper hyper decoder 1309 of the residual decoder neural network 1302 may be conditioned on the warped representation of the previously decoded image xt-1,wproduced by the flow part of the pipeline, for example by combining its tensor into an input or output of one or more layers of the networks.
[0641] Note also for completeness that the warped representation of the previously decoded reference image xt-1,wmay be produced from a single image or from multiple images of the image sequence. This may be in either ground truth form xt−1and / or in reconstructed form xt-1. This need not be limited to a strictly temporally "past" frame either. For example, in a pipeline where B-frames are being used, the previously decoded image may be a "future" frame in the image sequence that has been encoded and decoded before the current frame as would be applicable in a random access encoding and decoding scheme. Thus, the previously decoded image may be said to be a reference image that may be selected from or produced from one or more images of the image sequence being encoded and decoded. It is in particular envisaged that a reference frame predictor neural network comprising a plurality of convolution and / or activation layers to receive as input the one or more previously decoded images, and to output a tensor that combines all these images into a single composite image. This composite image may then be used as a reference image against which to compare the current image when producing optical flow information and / or residual information indicative of a difference between the current frame and the one or more previously decoded images. Thus, the term "past" as used herein can be may refer not only to a "past" frame (in the sense of a temporal order of a sequence of images) but also to a future frame (as in B-frames). Thus throughout all of the present description, the term xt-1,wcan mean xt-n w, and the term f̂t−1can mean ft-m, where n and m can be both positive and negative, and / or relate to multiple images of a sequence.
[0642] The flow part and residual part of the pipeline shown in Figures 12 and 13 may be combined into a P- and / or B-frame module, which may be combined with the I-frame part of L1T2 / W0
[0643] the pipeline, which together may be used as an end-to-end, Al-based I-frame, P- and / or B-frame video compression pipeline. That is, the I-frame part may compress, transmit, and reconstruct a first frame of an image sequence as an 1-frame, and then a plurality of other frames of the image sequence may be compressed, transmitted, and reconstructed as P- and / or B-frames, together forming a group of pictures (GOP), which in turn may be subdivided into miniGOPs.
[0644] Returning now to mini GOP control and setting mini GOP size(s) for an image sequence based on information derived from at least two images of the image sequence. One non-limiting example of such information may be the optical flow information produced by the flow part of the pipeline, which is indicative of a difference between the at least two images. Another nonlimiting example of such information may be the occlusion mask mocc, also produced by the flow part of the pipeline which, as described above, is indicative of non-motion-compensatable differences between the at least two images. Other types of difference information may also or alternatively be used including, but not limited to, an MSE score, a MAD-based score, and so on. The difference based information may be used alone or indeed together.
[0645] An example implementation using the optical flow information and the occlusion mask to control the mini-GOP size will be described in more detail below.
[0646] First, it has been found to be advantageous in Al-based compression to start with the assumption that a small miniGOP size is optimal, for example a starting miniGOP size of 1 (each mini-GOP contains one P-frame). One reason for this is that very high motion and non-motion-compensatable differences are particularly problematic for Al-based compression pipelines and, accordingly, there are many cases where a miniGOP size of 1 will be optimal. If the starting assumption is instead that the miniGOP size is 16 or 32, and this turns out to be too high, then the additional compute that may be performed to work down towards an optimal, smaller miniGOP size may significantly slow down runtime.
[0647] Once the starting miniGOP size has been set, two images of an image sequence are selected, for example the images at frame index 0 and frame index 1, and a difference betw een them is estimated. In this implementation, the difference comprises the optical How information tensor and the occlusion mask tensor. These can be indicative of high motion and / or non-motion-compensatable differences.
[0648] Next, one or more statistical values associated with each of these tensors is calculated. The statistical value(s) are indicative of a size or magnitude of one or more values of the tensor(s). For example, norms of each of the slices of the optical flow information tensor and of each of the slices of the occlusion mask tensor are calculated. These may then be compared to a threshold value. The threshold value against which the optical flow information norms may L1T2 / W0
[0649] be compared, and the threshold value against which the occlusion mask norms may be compared may be set based on empirical experimentation, may be learned during training, or some combination thereof (for example an approximate threshold may be initialized empirically to be close to a best guess value and then made learnable during training of the rest of the pipeline).
[0650] If any of the norms exceed the threshold value(s), then it indicates that there is high motion and / or non-motion-compensatable differences between the images at frame index 0 and at frame index 1 and accordingly it indicates that even the starting miniGOP size of 1 is not optimal, and instead, the image at frame index 1 may be transmitted as an I-frame.
[0651] If none of the norms exceed the threshold value(s), then it indicates that the difference between the two images is not especially great and accordingly there are likely to be bitrate gains that can be made without negatively impacting distortion scores by increasing the miniGOP size from 1 to a higher number. In this example it may be doubled to 2.
[0652] Next, the image index 0 is compared to the image at frame index 2 in the same way. If any of the norms exceed the threshold value(s), the miniGOP size of 2 is too large and accordingly, the miniGOP size is left at its starting size of 1. If none of the norms exceed the threshold value(s), the miniGOP size is increased from 2 to a higher number. In this example it may be doubled to 4.
[0653] This process is then repeated until at least one of the norms exceeds the threshold(s), or until some predefined maximum miniGOP size is reached. Once at least one of the norms exceeds the threshold(s), then the miniGOP size from the previous difference estimation where the threshold was not exceeded is selected as the optimal miniGOP size for the images of the sequence up to the frame index that has been checked. The images up to that frame index may then be assigned to a current miniGOP with the image at the index at the end of the miniGOP being assigned to the miniGOP as a P-frame, and the frames in between the first frame of the miniGOP and the end P-frame being assigned to the miniGOP as B-frames. The images of the miniGOP are then encoded, transmitted and decoded. The information indicating the miniGOP size, the type of frame (I-, P-, B-) the image is to be encoded as, and / or its reference frame dependenc(ies) in a hierarchical structure (learned or hardcoded) may be transmitted in the bitstream as meta-information.
[0654] The above methodology in generalised form is set out in more detail in the pseudocode below, where N is the increase in miniGOP size multiple after each loop, starting _size is the starting miniGOP size which is gradually increased, max_minigop_size is a predefined maximum miniGOP size, delta is a function, such as the networks of the flow part of the L1T2 / W0
[0655] pipeline, that produce difference information on which the miniGOP size is based, such as the optical flow information and the occlusion mask, and decision_criterion is a function, such as a threshold function, which determines whether to increase the miniGOP size or not:
[0656] Algorithm 3 MiniGOP Coiittol
[0657] Input: end.. o..pop d\, nexi..f>wn»eJdx, I ’.a t.piehor nix. xnixtinpp -e
[0658]
[0659] M GWWJW «— next...ft num dx < end.of '_gt>p_idx)
[0660] while nett _ut _end_oj..pop iilld miftigrtp_pi?.e < mux_inin<ifop_xiz.e do
[0661] new..minipo..end.jdx «— next^frutnejdx +■ (N - 1 i * mini^np^ize
[0662] if new_mifngop_end_tdx > t'iid...oj..pop..idx then
[0663] not _<a..end. o f..pop *- fitlse
[0664] new,.j. ntpe p„ndp:lx in ><<;.'!■•. dx
[0665] end if
[0666]
[0667] ( fi'o e-:, nts.tf) <-!ior_idx, t!ew_ini><ipop_einfj<lxi xh()uld_exren _mimgi.tp •;— ■ Btastos Rjrn».tON’(.flow, maxh }
[0668] if xh<j’f<d_ext > l_i i ipop then
[0669] d>ifpt>p xiz •— nilnt^op_size x N
[0670] nexi_ fra m e_idx <— n-w tdolp p_.e d__irl
[0671] else
[0672] break
[0673] end if
[0674] end while
[0675] Note that multiplication by N is illustrative only, other increases such as increasing by some integer such as 1 or 2 is also envisaged.
[0676] Note that the approach of starting from a very small starting miniGOP size and increasing up from there can result in long sequences with a miniGOP size of 1 (effectively resulting in a very long sequences of only P-frames) when motion in an image sequence is very large. It has been found that Al-based compression pipelines can outperform rate distortion performance of traditional codecs when compressing these types of frame-type sequences and accordingly can achieve comparable distortion performance at lower bitrates than traditional codecs by setting compression quality of the P-frames (e g. using a gain unit approach such as that described in Liu, Jinming, et al. " Rate-distortion-cognition controllable versatile neural L1T2 / W0
[0677] image compression." European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.) to be lower when such long sequences are detected. One approach to this is envisaged to be checking if an image is at the end of a run of P-frames (so far) of even length, if yes, reduce its quality level by 1.
[0678] It will be appreciated that both the use of optical flow information and / or the occlusion mask for mini-GOP length determination are exemplary only. It is envisaged that other methods of identifying motion or indeed other differences between at least two frames of a sequence on which miniGOP size may be based are also envisaged. For example, a lightweight MSE difference calculation, one or more cost volume calculations, and / or any other method of difference estimation may also be used to set the miniGOP size. The mini-GOP control methods described in this section may be used in a standalone manner in an Al-based compression pipeline, and / or may be used together with the other concepts described herein. For example, if used together with the methods of concept 1, the threshold may be a learnable
[0679] Concept 3: Hierarchy sequence specialisation
[0680] As described in the introductory sections above, Al-based compression pipelines may use temporal and spatial reference frame dependency or hierarchy schemes, which may be learned as in concept 1, or predetermined based on some heuristics and / or frame statistics as in concept 2, or indeed naively hardcoded for a given frame sequence. Using temporal and spatial reference frame dependency or hierarchy schemes in Al-based compression presents a number of synergistic opportunities to improve one or both of rate distortion performance and runtime of Al-based compression pipelines by virtue of the ability to specialize the networks of one or more parts of the pipeline based on what type of frame the frame is being compressed as, its quality level, and / or based on where it is temporally in the reference frame hierarchy being used.
[0681] The rate distortion performance of Al-based compression pipelines is based on the weights and / or biases of the networks which are learned during training by minimizing a rate distortion loss function for a given training data set. Networks that are trained on only one type of scene (e.g. low motion, natural landscape scenes) will often perform very well on such scene types but perform poorly on another scene ty pe not included in the training data (e g. high motion, video game scenes) and so on. Including a mix of all scene types on training data can help to achieve acceptable overall performance and generalization to all such scene ty pes but this can come at the cost of reduced performance on each of the individual scene types. The same considerations apply to frame types, frame quality level, and where the frame is in the L1T2 / W0
[0682] temporal reference frame hierarchy. For example, if the networks of the flow-residual part of the pipeline are trained only on P-frames that appear immediately after an I-frame in a frame type sequence of I-P-I-P-I-P-..., then the networks of the flow-residual part of the pipeline will become specialized at encoding and decoding P- frames immediately after I-frames with very good rate distortion performance. However, this comes at the cost of having reduced performance on other frame types and frame positions in the hierarchy. For example, if such networks were used for P-frames in the frame type sequence I-P-P-P-P-P-P-P-... then the networks will likely perform very well on the first P-frame but start to perform very poorly as the temporal distance from the 1-frame increases as artefacts introduced into each consecutive P-frame begin to propagate and become worse as the sequence progresses. This can be conceptually understood as the networks being used to always receiving low distortion I-frames as reference frames and thus being used to not needing to use many bits for the P-frame because the 1-frame reference frame contains a lot of high quality information that can be re-used. When the networks receive increasingly higher distortion P-frames, they treat them as high quality I-frames and thus re-use a lot of information, even if they contain increasingly worse artefacts. Similarly, if such networks are used to encode P- and B-frames in the sequence I-P-B-B-B-P-B-B-B-..., the networks do not generalise well to using reference frames containing bidirectional reference frame information as the only reference frame they have seen in training is an immediately preceding 1-frame. One approach to help address this, as was the case with the scene type problem, is that the training data may comprise a mix of different frame types, qualities, and positions in different temporal and spatial reference frame dependency and hierarchy schemes. This allows the networks of the flow-residual part of the pipeline to generalise well enough to a wide variety of scenarios it will likely face in inference. But, as w as the case with the scene types, this ability to generalise comes at the cost of lower performance on each individual frame type, quality and each individual position in different temporal reference frame dependency and hierarchy schemes.
[0683] The present inventors have realised that the ability of one or more networks of Al-based compression pipelines to specialise on not just scene type, but also on frame type, quality and position in different temporal reference frame dependency and hierarchy schemes, allows AI-based compression pipelines to achieve overall higher rate distortion performance compared to using networks that generalise well but are not specialised. In very general terms, during inference one or more of the networks may be provided with different weight sets and / or architecture configurations based on what the frame type is and / or where the frame is in the hierarchy scheme (e.g. is it an I-, P-, or B-frame. is it a root node of the miniGOP, is it a leaf L1T2 / W0
[0684] node of the mini GOP in that it does not have any frames that use it as a reference frame, is it a high quality or low quality frame, and so on). Whereby each of the different weight sets and / or architecture configurations are those that were learned by minimising a rate distortion loss function on training data with a higher proportion (up to 100%) of the frame type, frame quality and / or where the frame is in the hierarchy scheme that that weight set and / or architecture configuration is intended to specialise in. During inference, each frame is encoded and decoded by the various encoder and / or decoder networks with a flag indicating one or more of its frame type, quality and / or where in the hierarchy the frame is. Based on the flag, the weight sets and / or network configurations that have been specialised on that frame type and / or that hierarchy position are used in the pipeline to encode and decode that frame. This approach has a very minor increase in memory overhead as the encoder side and / or decoder side networks that have multiple weight sets and / or configurations available are deployed with these additional weight files and / or configuration files. However, this slight increase in memory overhead does not significantly increase runtime of the pipeline.
[0685] Indeed, because the specialised weight sets and / or configurations have better rate distortion performance than the generalised weight sets and / or configurations, the size and complexity of the networks using the specialized weights sets and / or configurations may be reduced (e.g. in terms of number of network layers, number of channels and / or spatial dimensions in the various input and output tensors of the layers and so on) to achieve comparable performance to the generalised weight sets and / or configurations. For example, if the one or more networks using the generalised weight sets and / or configurations comprise 5 conv relu blocks, then reducing this down to 3 conv relu blocks and specialising the network during training can achieve comparable rate distortion performance while running faster.
[0686] An illustrative example of this approach will now be described with reference to Figure 14.
[0687] Figure 14 illustratively shows a residual part of an Al-based compression pipeline corresponding to that of Figure 13. Like-numbered references refer to like-numbered elements. However, unlike in Figure 13, the residual part in Figure 14 also illustrates how a plurality 1401 of specialized weight sets and / or network configurations may be provided for the residual part (base) decoder, in addition to the generalised (base) residual decoder(s) 1307. The entropy decoding module 1313 entropy decodes the bitstream part that comprises the residual latent representation information using the entropy parameters produced by the residual hyper decoder 1308 and a flag indicating one or more of (i) what type of frame is being decoded and (ii) where in frame dependency hierarchy the frame is (and optionally its quality level). The L1T2 / W0
[0688] weight set and / or network configuration that has been specialized on that type of frame and / or reference frame hierarchy position is selected based on the flag and the decoder neural network runs using the selection to decode the residual latent representation, using the warped reference frame to produce the output image xtas that frame type and / or in that position in the frame hierarchy. As will be appreciated, the flag may be provided as metadata in the bitstream, alternatively it may be be predetermined using heuristics and / or statistics associated with the distribution of the latent representation, or using any other method.
[0689] Note that in this example, the residual hyper decoder 1308 and the residual hyper decoder 1309 do not have specialised weight sets and / or network configurations. However, it is envisaged that such weight sets and / or configurations may be used. For example, the learned entropy parameters used for entropy encoding and decoding the residual hyper hyper latent representation may be different for each of the plurality 1401 of specialised weight sets and / or network configurations. This can be advantageous because the distributions of the hyper hyper latent representations associated with different frame types in different positions in the hierarchy are likely to have very different spatial variances. Accordingly, sharing a single set of entropy parameters in the hyper hyper network may find it challenging to model the distributions of the hyper hyper latent representation well enough to achieve acceptable compression rates.
[0690] Note also that some parts of the plurality 1401 of specialised weight sets and / or network configurations may be shared. For example, if pre-processing is performed on the input latent representation (such as performing a Haar transform using one or more convolution operations), then these may be the same for each of the weight sets and / or configurations. In another example, if rate control is being implemented using a gain unit approach or example such as is described in Cui. Ze. et al. " Asymmetric gained deep image compression with continuous rate adaptation." Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition. 2021, then the gain units may be shared across the plurality 1401 of specialised weight sets and / or network configurations. In another example, quantisation parameters associated with the layers (for example associated with quantising the weights of the layers, or the weight kernels, and so on), may be shared.
[0691] Also shown in Figure 14 are a plurality 1402 of tensors having learnable values associated with the residual hyper encoder 1305, and a plurality 1403 of tensors having learnable values associated with the residual hyper hyper encoder 1306. The tensors are injected (e.g. by element wise addition or multiplication) with one or more inputs and / or outputs of one or more layers of the residual hyper encoder 1305 networks or the residual hyper hyper encoder L1T2 / W0
[0692] 1306 networks respectively. For example, one or more of the tensors may be combined by way of element-wise addition to the residual latent representation that is input into the residual hyper encoder 1305 and / or to the residual hyper latent representation that is input into the residual hyper hyper encoder 1306.
[0693] During training, a unique tensors for each type of frame (e.g. P- or B- frame) in the training data and / or its place in a hierarchy (e.g. a first P-frame after an I-frame, a root node, a leaf node, and so on) is first initialised and updated by an optimiser function after a backward pass with a training objective of minimising the rate distortion loss function. Conceptually, these pluralities 1402, 1403 of tensors can be understood as an architectural mechanism to facilitate conditioning of the residual hyper encoder 1305 and residual hyper hyper encoder 1306 on the frame type and frame position in the hierarchy scheme and provides an extra degree of freedom to the residual hyper and hyper hyper networks to specialise on the different frame types and frame positions in the hierarchy scheme. That is, on the encode side, when a frame type and / or hierarchy position flag is provided, the corresponding tensor of the pluralities 1401, 1402, of tensors may be injected into the residual hyper encoder 1306 network and residual hyper hyper encoder 1307 network, thus modifying the values of the tensors where they were injected so as to become specialised to that specific frame type and / or hierarchy position.
[0694] Consider a toy example frame sequence I-P-B-B. A unique tensor may be learned for each of the P-frame, the first B-frame, and the second B-frame. When the corresponding flag indicating which frame is being encoded is provided with the frame, the corresponding tensor of the pluralities 1401, 1402 of tensors is selected and combined by way of element- wise addition to the residual hyper latent representation and hyper hyper latent representation respectively, thereby making modifications to their values so as to make the overall rate distortion performance of the pipeline when compressing the I-P-B-B sequence better than without the injection of the respective tensors.
[0695] Figure 15 illustrates an additional or alternative method for specializing one or more networks of the pipeline on frame type and or frame position in a frame hierarchy.
[0696] That is, Figure 15 illustrates the encoder side of a residual part of an Al-based compression pipeline, such as the encoder side of Figure 13. Like-numbered numerals refer to like elements, and are not repeated here. However, Figure 15 further introduces an encoderside, learned mask(s) mi, mj, mkthat fully or partially masks some or all of the information in one or more of the residual latent representation, the residual hyper latent representation and / or the residual hyper hyper latent representation before it is transmitted. In very general terms, it provides an architectural mechanism for the encoder-side networks to explicitly reduce the L1T2 / W0
[0697] entropy of the distribution of the latent representation, hyper latent representation and / or hyper hyper latent representation by masking any portions that may not be used on the decode side of the pipeline in a meaningful way, thereby allowing the various latent representations to be entropy encoded into a smaller bitstream size.
[0698] For example, the encoder-side learned mask(s) mt, mj, mkmay comprise a tensor applied, by element wise multiplication, addition or other operation, to one or all of the residual latent representation, the residual hyper latent representation or the residual hyper hyper latent representation. The values of the mask tensors may be produced by a dedicated encoder mask neural network separate to the networks of the residual part of the pipeline, or they may be output directly by the existing neural networks thereof. In both cases, it is envisaged that the neural network(s) that produce the encoder-side learned mask are trained together with the other neural networks of the Al-based compression pipeline to allow the networks to leam how best to use the architectural mechanism to produce an encoder-side learned mask that best minimises the loss of whatever loss function is used during training (e.g. a rate distortion loss function).
[0699] In Figure 15, the mask(s) mi, mj, mkare generated by a mask generator g(φ). The specific form of g(φ) is not important, for example it may comprise a simple MLP, a convolutional neural network, or any other network with trainable parameters. In one example, g(φ) comprises an alternating sequence of a convolutional layer followed by an activation layer, all or some with learnable weights, and optionally one or more optional reshaping layers to ensure the output mask shape matches that of the residual latent representation, the residual hyper latent representation and / or the residual hyper hyper latent representation it is to be applied to.
[0700] Returning now to specialising on frame type and / or frame position in a frame hierarchy. Figure 15 further shows respective pluralities 1501, 1502, 1503 of learned tensors that whose values are learned in the same way as those of Figure 14.
[0701] That is, during training, a unique tensors for each type of frame (e.g. P- or B- frame) in the training data and / or its place in a hierarchy (e.g. a first P-frame after an I-frame, a root node, a leaf node, and so on) is first initialised and updated by an optimiser function after a backward pass with a training objective of minimising the rate distortion loss function. Conceptually, these pluralities 1501, 1502, 1503 of tensors can be understood as an architectural mechanism to facilitate conditioning of the one or more of the learned masks on the frame type and frame position in the hierarchy scheme and provides an extra degree of freedom to allow the pipeline to specialise on the different frame types and frame positions in the hierarchy scheme. That is, L1T2 / W0
[0702] on the encode side, when a frame type and / or hierarchy position flag is provided, the corresponding tensor of the pluralities 1501, 1502, 1503, of tensors may be injected into one or more of the masks mt, mj, mk, thus modifying the values of the tensors where they were injected so as to become specialised to that specific frame type and / or hierarchy position.
[0703] It will be appreciated that specialisation on frame type and / or hierarchy position, quality level, and so on, in a mini GOP using the above three methods are intended to be illustrative only, and other approaches are also envisaged.
[0704] Each of the above approaches may be used in a standalone manner, or used together.
[0705] Concept 4: Quality level awareness
[0706] Rate control using gain units have already been described above in connection with concept 1. By way of reminder, a plurality of gain units may be introduced into the pipeline, for example at or around the input layers, that modify the values of the latent representation e.g. by element wise multiplication with the values of the gain unit tensor, so as to transform them to make their distribution more efficiently entropy encodable. The gain unit tensors may be discrete but in order to facilitate continuous rate or quality control, an interpolation function (e.g. a linear interpolation function or any other interpolation function that is constructed from the discrete gain units) may be applied to obtain gain unit values between the discrete gain units.
[0707] As complexity of frame hierarchy schemes increases, the number of gain units a given pipeline would expect to have increases. That is, each possible temporal position and quality level and frame type in a hierarchy scheme may be given its own set of gain units and this training difficulty (each gain unit is learned by sampling during training so if the number of gain units increases, the number of training steps used to ensure each gain unit is sampled a sufficient number of times increases, additionally, the more gain units there are the harder it is for them to be monotonic). Further, the more gain units there are, the more it increases codebase complexity which can introduce technical debt and maintenance challenges.
[0708] One approach to overcoming this problem is to use learned gain unit offsets as described above in concept 1. That is, a predetermined number of discrete gain units are initialized and trained, and from these an interpolation is constructed that takes as input argument a gain unit index (i.e. how far up or down from the discrete gain unit values the interpolation has to go). A unique gain unit index offset is then learned for each of the different frame types and temporal positions in a hierarchy scheme. When a frame in a given hierarchy position is fed into the pipeline with a flag indicating that position, the corresponding gain unit index offset is then fed into the interpolation function to obtain the gain unit values associated with that hierarchy L1T2 / W0
[0709] position, which in turn are used to modify the various latent representations as the frame progresses through the pipeline.
[0710] To illustrate this, consider a toy example with 5 gain units. Without an interpolation function, rate control is only possible at 5 discrete levels: 0, 1, 2, 3, 4, whereby the discrete gain unit values are selected by the pipeline responsive to receiving a flag indicating which gain unit to use. With an interpolation function, rate control becomes continuous in that the any number (e.g. 0.1, 3.2, 4.7123.... and so on) can be fed into the interpolation function and it will produce gain unit values suitably interpolated between the discrete gain unit values. However, in order to suitably select what number to feed into the interpolation function to produce the desired rate control effect for a given frame fype and / or hierachy position, these values are learned. For example, if the highest quality frame type and position (e.g. a root node P-frame) uses the highest quality gam unit values, e.g. index 0 as a starting point, it may be that a value of 0.37 might produce better performance for a given rate, and so during training the offset of 0.37 is learned. During inference when a root node P-frame is received, a look up table may be used to retrieve the offset of 0.37, which is fed into the interpolation function to retrieve the actual gain unit values that are to be used by the pipeline. In practice, the offsets may be implemented using a look up table mapping frame type and hierarchy position to the offset values. Optionally, some maximum ceiling gain unit index may be set, and if an offset pushes the value passed into the interpolation function beyond this ceiling, it is replaced with the corresponding end of range discrete gain unit index number, and the other discrete gain unit index numbers are rescaled accordingly. For example if a maximum gam unit index number is 4, and an offset for a lowest quality frame is learned that pushes this number to 8, then this frame is assigned to the end of range discrete gain unit index number 4 (i.e. divided by 2), and the other gain unit index numbers and offsets are scaled down as well (i.e. divided by 2) to keep the gain unit indices plus learned offsets within the range of 0-4. The maximum may be set initially rather than being computed on the fly to ensure the rescaling factor is always consistent.
[0711] Using this approach, the problem of increased training difficulty is mitigated. For example, as the number of discrete gain units remains small (for example 5-10 gain units, preferably 7 gain units) training time need not increase significantly because it is not as challenging to ensure that each gain unit is sampled enough times in a reasonable time, and code base complexity does not increase, for example because look up tables with additional mappings to different offset scalars can be scaled very easily. L1T2 / W0
[0712] As has been described above in more detail in concept 1, these learned gain unit index offsets facilitate continuous rate control without the need to handcraft gain unit indices or values.
[0713] However, the inventors have found that, during training, learned offsets have a tendency to produce increasingly large values to the point where the values that are passed into the interpolation function frequently exceed the maximum allowed gain unit index value. This bunching at the higher end of the gain unit index range can be problematic because it may result in a loss of granularity when scaling is applied to bring the gain unit index numbers passed into the interpolation function back into the allowable range. The inventors have found that continuous rate control is improved when the values passed into the interpolation function are distributed across the whole of the allowed range, rather than bunching around the upper end of it.
[0714] The inventors have realised that a cause of this is that the learned offsets are not aware of each other. That is, during training, a unique offset is learned for each frame type and / or position in a hierarch scheme, but there is nothing telling the networks where the offsets are converging to. It has been found that introducing level awareness by including in the offsets associated with the highest quality frame type (e.g. a root node P-frame) a cumulative sum of the offsets of the lower quality frame types up to the low est quality frame type (e.g. a leaf node B-frame) solves the bunching problem.
[0715] Conceptually, this can be understood as making the offsets aware of each other by telling the netw orks what the offsets of lower frame quality add up to, so the offset for the next level up is only how much "extra" needs to be done on top of everything below' it.
[0716] In more detail, if the interpolation function is denoted by f, if there are 3 possible hierarchy levels a, b, c we want learned offsets i1, i2, i3for to produce gain unit values Ga, Gb, Gcthen:
[0717] Ga= f(ia) = f(i3+ i2+ i1)
[0718] Gb= f(ib) = f(i2+ i1)
[0719] Gc= f(ic) = f(i1)
[0720] Or more generally:
[0721] Gn= f(in+ in-1
[0722]
[0723] L1T2 / W0
[0724] That is, the learned offsets associated with a given quality level are based on a cumulative sum of all the offsets below that quality level plus an extra offset amount, to thereby make the offsets aware of each other.
[0725] The training process for learning these offsets now with a cumulative sum introduced remains largely unchanged to that described in concept 1. That is, an offset is initialized with the condition £”=1ij for each frame type or position in a hierarchy scheme. During training, one of those frame types is selected, and the offsets associated with it are optimized during the backward pass. This process is repeated until some end-of-training condition is met, thereby producing a set of learned offsets that are aware of each other. It has been found that this approach produces a set of learned offsets that make use of the full range of available values to pass into the interpolation function, thereby at least partially addressing the bunching problem described above.
[0726] The concept of quality level awareness through level aware gain unit offsets may be used alone or together with the other concepts described herein.
[0727] The subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly-embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. The subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible non transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them. The computer storage medium is not, however, a propagated signal.
[0728] The term "data processing apparatus" encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include special purpose logic circuitry, e g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an L1T2 / W0
[0729] execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
[0730] A computer program (which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication netw ork.
[0731] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
[0732] Computers suitable for the execution of a computer program include, by way of example, can be based on general or special purpose microprocessors or both, or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read only memory or a random access memory or both. The essential elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a VR headset, a game console, a Global Positioning System (GPS) receiver, a server, a mobile phones, a tablet computer, a notebook computer, a music player, an e-book reader, a laptop or desktop computer, a PDAs, a smart phone, or other stationary or portable devices, that L1T2 / W0
[0733] includes one or more processors and computer readable media, or a portable storage device, e.g., a universal serial bus (USB) flash drive, to name just a few.
[0734] Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory. media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto optical disks; and CD ROM and DVD-ROM disks. The processor and the memory’ can be supplemented by, or incorporated in, special purpose logic circuitry.
[0735] The subject matter described in this specification can be implemented in a computing system that includes a back end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back end, middleware, or front end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (" LAN") and a wide area network (" WAN"), e.g., the Internet.
[0736] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
[0737] While this specification contains many specific implementation details, these should be construed as descriptions of features that may be specific to particular examples of particular inventions. Certain features that are described in this specification in the context of separate examples can also be implemented in combination in a single example. Conversely, various features that are described in the context of a single example can also be implemented in multiple examples separately or in any suitable sub-combination.
[0738] Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the examples described above should not be understood as requiring such separation in all examples, and it should be L1T2 / W0
[0739] understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
[0740] Finally, it will be appreciated that GOPs and miniGOPs may be defined as graphs whereby connections between nodes defines which frames depend on which other frames. The graph may be a directed acyclic graph with leaf nodes, root nodes, and so on whereby leaf nodes do not have any other frames dependent on them.
Claims
L1T2 / W0CLAIMS1. A method for lossy image or video encoding and transmission, and decoding, the method comprising the steps of:receiving a sequence of images at a first computer system;assigning one or more of the images to a mini GOP; andencoding, transmitting and decoding the images of the mini GOP by:with a first neural network, encoding the images to produce latent representations; transmitting the latent representations to a second computer system; andwith a second neural network, decoding the latent representations to produce output images, wherein the output images are approximations of the images of the mini GOP, wherein the first and second neural network use a first set of weights and / or network architecture for encoding and decoding a first image of the mini GOP, andwherein the first and second neural network use a second set of weights and / or network architecture for encoding and decoding a second image of the mini GOP.
2. The method of claim 1, comprising, for each image of the mini GOP, selecting the first or the second set of weights and / or network architecture based on a frame type of the image.
3. The method of claim 1 or 2, comprising, for each image of the mini GOP, selecting the first or the second set of weights and / or network architecture based on a frame position in the mini GOP of the image.
4. The method of any of claims 1 to 3, comprising, for each image of the mini GOP, selecting the first or the second set of weights and / or network architecture based on a quality level in the mini GOP of the image.
5. The method of any of claims 1 to 4, wherein the first and second neural network using the first set of weights and / or network architecture comprises more layers than the first and second neural netw ork using the second set of w eights and / or network architecture.
6. The method of claim 5, wherein the second image is assigned to a leaf node of the miniGOP.
7. The method of any of claims 1 to 6, wherein the second neural netw ork comprises a decoder neural network, a hyper decoder neural network, and a hyper hyper decoder neural network.L1T2 / W08. The method of claim 7, wherein the first set of weights and / or network architecture associated with the decoder neural network is different to the second set of weights and / or network architecture associated with the decoder neural network.
9. The method of claim 7, wherein the first set of weights and / or network architecture associated with the decoder neural network is different to the second set of weights and / or network architecture associated with the decoder neural network.
10. The method of any of claims 1 to 9, wherein the first neural network comprises an encoder neural network, a hyper encoder neural network, and a hyper hyper encoder neural network.
11. The method of claim 10, wherein the first set of weights and / or network architecture associated with the hyper encoder neural network and / or the hyper hyper encoder neural network is different to the second set of weights and / or netw ork architecture associated with the hyper encoder neural network and the hyper hyper encoder neural network.
12. The method of claim 10 or 11, wherein the first neural netw ork comprises a decoder neural network, a hyper decoder neural network, and a hyper hyper decoder neural network.
13. The method of claim 12, comprising producing a mask for masking at least a portion of an output of at least one of the decoder neural netw ork, the hyper decoder neural network, and / or the hyper hyper decoder neural network, and the method comprising modifying the mask based on a frame type, a frame position in the mini GOP and / or a frame quality in the mini GOP of the image being encoded.
14. A method for lossy image or video encoding and transmission, the method comprising the steps of:receiving a sequence of images at a first computer system;assigning one or more of the images to a mini GOP; andencoding and transmitting the images of the mini GOP by:with a first neural network, encoding the images to produce latent representations; transmitting the latent representations to a second computer system;w herein the first neural network uses a first set of weights and / or netw ork architecture for encoding a first image of the mini GOP, andwherein the first neural network uses a second set of w eights and / or network architecture for encoding a second image of the mini GOP.L1T2 / W015. A method for lossy image or video receipt and decoding, the method comprising the steps of:receiving latent representations at a second computer system, the latent representations produced by receiving a sequence of images at a first computer system, assigning one or more of the images to a mini GOP, and encoding the images of the mini GOP by, with a first neural network, encoding the images to produce latent representations;the method further comprising decoding the images of the mini GOP by:with a second neural network, decoding the latent representations to produce output images, wherein the output images are approximations of the images of the mini GOP, wherein the second neural network uses a first set of weights and / or network architecture for decoding a first image of the mini GOP, andwherein the second neural network uses a second set of weights and / or network architecture for decoding a second image of the mini GOP.
16. The method of claim 14 or 15. comprising, for each image of the mini GOP. selecting the first or the second set of weights and / or network architecture based on a frame type of the image.
17. The method of any of claims 14 to 16, comprising, for each image of the mini GOP, selecting the first or the second set of weights and / or network architecture based on a frame position in the mini GOP of the image.
18. The method of any of claims 14 to 17, comprising, for each image of the mini GOP, selecting the first or the second set of weights and / or network architecture based on a quality level in the mini GOP of the image.
19. The method of any of claims 14 to 18, wherein the first and second neural network using the first set of weights and / or network architecture comprises more layers than the first and second neural network using the second set of weights and / or network architecture.
20. The method of claim 19, wherein the second image is assigned to a leaf node of the mini GOP.
21. The method of any of claims 14 to 20, wherein the second neural network comprises a decoder neural network, a hyper decoder neural network, and a hyper hyper decoder neural network.L1T2 / W022. The method of claim 21, wherein the first set of weights and / or network architecture associated with the decoder neural network is different to the second set of weights and / or network architecture associated with the decoder neural network.
23. The method of claim 21, wherein the first set of weights and / or network architecture associated with the decoder neural network is different to the second set of weights and / or network architecture associated with the decoder neural network.
24. The method of any of claims 14 to 23, wherein the first neural network comprises an encoder neural network, a hyper encoder neural network, and a hyper hyper encoder neural network.
25. The method of claim 24, wherein the first set of weights and / or network architecture associated with the hyper encoder neural network and / or the hyper hyper encoder neural network is different to the second set of weights and / or network architecture associated with the hyper encoder neural network and the hyper hyper encoder neural network.
26. The method of claim 24 or 25, wherein the first neural network comprises a decoder neural network, a hyper decoder neural network, and a hyper hyper decoder neural network.
27. The method of claim 26, comprising producing a mask for masking at least a portion of an output of at least one of the decoder neural network, the hyper decoder neural network, and / or the hyper hyper decoder neural network, and the method comprising modifying the mask based on a frame type, a frame position in the mini GOP and / or a frame quality in the mini GOP of the image being encoded.
28. A data processing apparatus configured to perform the method of any of claims 1 to 27.
29. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of claims 1 to 27.
30. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of any of claims 1 to 27.
31. A method for lossy image or video encoding and transmission, and decoding, the method comprising the steps of:receiving a sequence of images at a first computer system;L1T2 / W0encoding, transmiting to a second computer system and decoding at least one image of the sequence of images by:predicting a target quality level and / or a reference image selection for encoding and decoding the at least one image;with a first neural network, at the predicted target uality level and / or using the predicted reference image selection, encoding the at least one image to produce a latent representation;transmiting the latent representation to a second computer system; andwith a second neural network, decoding the latent representation to produce an output image, wherein the output image is an approximation of the at least one image at the predicted target quality level.
32. The method of claim 31, wherein predicting the reference image selection is based on a property of the at least one image.
33. The method of claims 31 or 32, wherein predicting the reference image selection is based on a property of one or more previously decoded images of the sequence of images.
34. The method of claim 33, wherein predicting the reference image selection is based on how close respective representations of the one or more previously decoded images are to the at least one image.
35. The method of claim 34, comprising estimating how close respective representations of the one or more previously decoded images are to the at least one image by estimating respective mean square errors between the representations and the at least one image.
36. The method of claim 34 or 35, wherein the respective representations comprise warped representations of the one or more previously decoded images, and wherein the method comprises producing the respective warped representations of the one or more previously decoded images by estimating optical flow information indicative of differences between the at least one image and the respective previously decoded images, and using the estimated optical flow information to warp the respective previously decoded images.
37. The method of any of claims 34 to 36, wherein the respective representations further comprise non-warped representations of the one or more previously decoded images, and wherein predicting the reference image selection is based on a ratio between (i) how close aL1T2 / W0warped representation is to the at least one image and (ii) how close a corresponding nonwarped representation is to the at least one image.
38. The method of any of claims 31 to 37, wherein predicting the target quality level is based on the at least one image.
39. The method of any of claims 31 to 38, wherein predicting the target quality level comprises predicting a target quality level from a plurality of discrete target quality levels.
40. The method of any of claims 31 to 39, wherein predicting the target quality level comprises selecting a target quality level based on a predetermined sequence of discrete target quality levels when a first condition is met, or selecting a target quality level deviating from the predetermined sequence when a second condition is met.
41. The method of claim 40, wherein predicting the target quality level comprises:with the first neural network, encoding at a starting target quality level the at least one image to produce a latent representation;with the second neural network decoding the latent representation to produce an approximation of the at least one image having the starting target quality level;estimating a difference between the at least one image and the approximation of the at least one image having the starting target uality level;updating the starting target quality level based on the difference;repeating the above steps iteratively until a predetermined condition is met to reach a final target quality level; andusing the final target quality level as the predicted target uality level.
42. The method of claim 41, comprising downsampling the at least one image, and wherein predicting the target quality level is based on the downsampled at least one image.
43. The method of any of claims 31 to 42, wherein the target quality level is associated with one or more variable rate control parameters, wherein the method comprises modifying an output of the first neural network using the one or more variable rate control parameters to produce the latent representation, and modifying the latent representation using the one or more variable rate control parameters before performing the decoding, and wherein the method comprises applying an offset to the one or more learned variable rate control parameters.L1T2 / W044. The method of claim 43, wherein the offset is based on the predicted target quality level and / or reference image selection.
45. The method of claim 43 or 44, wherein the offset comprises a learned offset.
46. The method of any of claims 43 to 45, wherein the one or more variable rate control parameters comprise a matrix of first values, wherein the offset comprises a matrix of second values, and wherein applying the offset to the variable rate control parameters comprises modifying the first values using the second values.
47. The method of any of claims 31 to 46. comprising repeating the predicting, encoding, and decoding for each image of the sequence of images to predict a sequence of target quality levels and reference image selections of a scalable video coding scheme optimised for the sequence of images.
48. The method of claim 47, comprising transmitting from the first computer system to the second computer system information representing the sequence of target quality' levels and reference image selections.
49. A method for lossy image or video encoding and transmission, the method comprising the steps of:receiving a sequence of images at a first computer system;encoding, transmitting to a second computer system at least one image of the sequence of images by:predicting a target quality' level and / or a reference image selection for encoding and decoding the at least one image;with a first neural network, at the predicted target quality level and / or using the predicted reference image selection, encoding the at least one image to produce a latent representation;transmitting the latent representation to a second computer system.
50. A method for lossy image or video receipt and decoding, the method comprising the steps of:receiving a latent representation at a second computer system, the latent representation produced by predicting a target quality level and / or reference image selection, with a first neural network, at the predicted target quality level and / or using the predicted reference image selection, encoding the at least one image to produce the latent representation; andL1T2 / W0with a second neural network, decoding the latent representation to produce an output image, wherein the output image is an approximation of the at least one image at the predicted target quality level.
51. The method of claim 49 or 50, wherein predicting the reference image selection is based on a property of the at least one image.
52. The method of any of claims 49 to 51, wherein predicting the reference image selection is based on a property' of one or more previously decoded images of the sequence of images.
53. The method of claim 52. wherein predicting the reference image selection is based on how close respective representations of the one or more previously decoded images are to the at least one image.
54. The method of claim 53, comprising estimating how close respective representations of the one or more previously decoded images are to the at least one image by estimating respective mean square errors between the representations and the at least one image.
55. The method of claim 53 or 54, wherein the respective representations comprise warped representations of the one or more previously decoded images, and wherein the method comprises producing the respective warped representations of the one or more previously decoded images by estimating optical flow information indicative of differences between the at least one image and the respective previously decoded images, and using the estimated optical flow information to warp the respective previously decoded images.
56. The method of any of claims 53 to 55, wherein the respective representations further comprise non-warped representations of the one or more previously decoded images, and wherein predicting the reference image selection is based on a ratio between (i) how close a warped representation is to the at least one image and (ii) how close a corresponding nonwarped representation is to the at least one image.
57. The method of any of claims 49 to 56, wherein predicting the target quality level is based on the at least one image.
58. The method of any of claims 49 to 57, wherein predicting the target quality level comprises predicting a target quality level from a plurality of discrete target quality levels.L1T2 / W059. The method of any of claims 49 to 58, wherein predicting the target quality level comprises selecting a target quality level based on a predetermined sequence of discrete target quality levels when a first condition is met. or selecting a target quality level deviating from the predetermined sequence when a second condition is met.
60. The method of claim 59, wherein predicting the target quality level comprises:with the first neural network, encoding at a starting target quality level the at least one image to produce a latent representation;with the second neural network decoding the latent representation to produce an approximation of the at least one image having the starting target quality level;estimating a difference between the at least one image and the approximation of the at least one image having the starting target quality' level;updating the starting target quality level based on the difference;repeating the above steps iteratively until a predetermined condition is met to reach a final target quality level; andusing the final target quality level as the predicted target quality level.
61. The method of claim 60, comprising downsampling the at least one image, and wherein predicting the target quality level is based on the downsampled at least one image.
62. The method of any of claims 49 to 61, wherein the target quality level is associated with one or more variable rate control parameters, wherein the method comprises modifying an output of the first neural network using the one or more variable rate control parameters to produce the latent representation, and modifying the latent representation using the one or more variable rate control parameters before performing the decoding, and wherein the method comprises applying an offset to the one or more learned variable rate control parameters.
63. The method of claim 62, wherein the offset is based on the predicted target quality level and / or reference image selection.
64. The method of claim 62 or 63, wherein the offset comprises a learned offset.
65. The method of any of claims 62 to 64, wherein the one or more variable rate control parameters comprise a matrix of first values, wherein the offset comprises a matrix of second values, and wherein applying the offset to the variable rate control parameters comprises modifying the first values using the second values.L1T2 / W066. The method of any of claims 1 to 65, comprising repeating the predicting, encoding, and decoding for each image of the sequence of images to predict a sequence of target quality levels and reference image selections of a scalable video coding scheme optimised for the sequence of images.
67. The method of claim 66, comprising transmitting from the first computer system to the second computer system information representing the sequence of target quality levels and reference image selections.
68. A data processing apparatus configured to perform the method of any of claims 31 to 67.
69. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of claims 31 to 67.
70. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of any of claims 31 to 67.
71. A method for lossy image or video encoding and transmission, and decoding, the method comprising the steps ofreceiving a sequence of images at a first computer system;setting a mini group of pictures (GOP) size based on at least two images of the sequence and assigning one or more of the images to a mini GOP of the size;encoding, transmitting and decoding the images of the mini GOP by:with a first neural network, encoding the images to produce latent representations; transmitting the latent representations to a second computer system; andwith a second neural network, decoding the latent representations to produce output images, wherein the output images are approximations of the images of the mini GOP.
72. The method of claim 71, wherein setting the mini GOP size comprises setting a starting mini GOP size and repeating the steps of:(i) estimating a difference between at least two images of the sequence, and(ii) changing the mini GOP size, until the difference exceeds a threshold.
73. The method of claim 72, wherein the difference between the at least two images comprises optical flow information.L1T2 / W074. The method of claim 73, wherein the optical flow information comprises a tensor of values and wherein the threshold is exceeded when a norm of at least one slice of the tensor exceeds a threshold value.
75. The method of any of claims 71 to 74, with a third neural network and using the at least two images, producing the optical flow information.
76. The method of claim 71, wherein setting the mini GOP size comprises setting a starting mini GOP size and repeating the steps of:(i) with a fourth neural network and using the at least two images, producing a mask for occluding pixels of a first image of the at least two images when the first image is a reference frame for encoding and decoding a second image of the at least two images,(ii) estimating a statistical value associated with the mask, and(iii) changing the mini GOP size, until the statistical value exceeds a threshold.
77. The method of claim 76, wherein the mask comprises a tensor of values and wherein the threshold is exceeded when a norm of at least one slice of the tensor exceeds a threshold value.
78. The method of any of claims 71 to 77. wherein changing the mini GOP size comprises increasing the mini GOP size.
79. The method of any of claims 71 to 78, wherein the at least two images of the sequence comprise a first image and a next image in the sequence.
80. The method of any of claims 71 to 79, wherein assigning the one or more images to the mini GOP comprises specifying, for the one or more images, a frame type parameter, a quality-level parameter, and / or an encode order parameter.
81. The method of any of claims 71 to 80, comprising assigning all images of the sequence to one or more mini GOPs before performing the encoding, transmitting and decoding.
82. The method of any of claims 71 to 81, comprising performing the encoding, transmitting and decoding of a first mini GOP before assigning one or more of the images to a second mini GOP.L1T2 / W083. The method of any of claims 71 to 82, comprising downsampling the at least two images of the sequence, and wherein setting the mini GOP size is based on the at least two downsampled images of the sequence.
84. A method for lossy image or video encoding and transmission, the method comprising the steps of:receiving a sequence of images at a first computer system;setting a mini group of pictures (GOP) size based on at least two images of the sequence and assigning one or more of the images to a mini GOP of the size;encoding and transmitting the images of the mini GOP by:with a first neural network, encoding the images to produce latent representations; and transmitting the latent representations to a second computer system.
85. A method for lossy image or video receipt and decoding, the method comprising the steps of:receiving latent representations at a second computer system, the latent representations produced at a first computer system by setting a mini group of pictures (GOP) size based on at least two images of a sequence of images, assigning one or more of the images to a mini GOP of the size, and encoding the images of the mini GOP by: with a first neural network, encoding the images to produce the latent representations; andwith a second neural network, decoding the latent representations to produce output images, wherein the output images are approximations of the images of the mini GOP.
86. The method of claim 84 or 85, wherein setting the mini GOP size comprises setting a starting mini GOP size and repeating the steps of:(i) estimating a difference between at least two images of the sequence, and(ii) changing the mini GOP size, until the difference exceeds a threshold.
87. The method of claim 86, wherein the difference between the at least two images comprises optical flow information.
88. The method of claim 87, wherein the optical flow information comprises a tensor of values and wherein the threshold is exceeded when a norm of at least one slice of the tensor exceeds a threshold value.L1T2 / W089. The method of any of claims 86 to 88, with a third neural network and using the at least two images, producing the optical flow information.
90. The method of claims 84 or 85, wherein setting the mini GOP size comprises setting a starting mini GOP size and repeating the steps of:(i) with a fourth neural network and using the at least two images, producing a mask for occluding pixels of a first image of the at least two images when the first image is a reference frame for encoding and decoding a second image of the at least two images,(ii) estimating a statistical value associated with the mask, and(iii) changing the mini GOP size, until the statistical value exceeds a threshold.
91. The method of claim 90, wherein the mask comprises a tensor of values and wherein the threshold is exceeded when a norm of at least one slice of the tensor exceeds a threshold value.
92. The method of any of claims 86 to 91. wherein changing the mini GOP size comprises increasing the mini GOP size.
93. The method of any of claims 86 to 92, wherein the at least two images of the sequence comprise a first image and a next image in the sequence.
94. The method of any of claims 86 to 93, wherein assigning the one or more images to the mini GOP comprises specifying, for the one or more images, a frame type parameter, a quality-level parameter, and / or an encode order parameter.
95. The method of any of claims 86 to 94, comprising assigning all images of the sequence to one or more mini GOPs before performing the encoding, transmitting and decoding.
96. The method of any of claims 86 to 95, comprising performing the encoding, transmitting and decoding of a first mini GOP before assigning one or more of the images to a second mini GOP.
97. The method of any of claims 86 to 96, comprising downsampling the at least two images of the sequence, and wherein setting the mini GOP size is based on the at least two downsampled images of the sequence.
98. A data processing apparatus configured to perform the method of any of claims 71 to 97.L1T2 / W099. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of claims 71 to 97.
100. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of any of claims 71 to 97.
101. A method for lossy image or video encoding, transmission and decoding, the method comprising the steps of:receiving an input image at a first computer system;encoding the input image using a first trained neural network to produce a latent representation;selecting a parameter associated with a position in a frame hierarchy scheme, and using the selected parameter to produce values for modifying the latent representation;processing the latent representation using the values to produce a modified latent representation;transmitting the modified latent representation to a second computer system; and processing the modified latent representation using the values and decoding the processed modified latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image at the position in the frame hierarchy scheme.
102. The method of claim 101, wherein producing the values comprises applying a function to the selected parameter, wherein the function is based on one or more learned gain units.
103. The method of claim 102, wherein each gain unit comprises a tensor of learned values, and wherein the function comprises an interpolation function for producing, based on the selected parameter, the values for modifying the latent representation by interpolating between the learned values of the gain unit tensors.
104. The method of any of claims 101 to 103, wherein the selected parameter comprises one of a plurality of learned parameters.
105. The method of claim 104, wherein at least one of the plurality of learned parameters is based on at least one other of the plurality of learned parameters.
106. The method of claim 104 or 105, wherein at least one of the plurality of learned parameters comprises a cumulative sum of at least two others of the plurality of learned parameters.L1T2 / W0107. A method for lossy image or video encoding and transmission, the method comprising the steps of:receiving an input image at a first computer system;encoding the input image using a first trained neural network to produce a latent representation;selecting a parameter associated with a position in a frame hierarchy scheme, and using the selected parameter to produce values for modifying the latent representation;processing the latent representation using the values to produce a modified latent representation; andtransmitting the modified latent representation to a second computer system.
108. A method for lossy image or video receipt and decoding, the method comprising the steps of:receiving a modified latent representation at a second computer system, the modified latent representation produced by encoding an input image using a first trained neural network to produce a latent representation, selecting a parameter associated with a position in a frame hierarchy scheme, using the selected parameter to produce values for modifying the latent representation, and processing the latent representation using the values to produce the modified latent representation; andprocessing the modified latent representation using the values and decoding the processed modified latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image at the position in the frame hierarchy scheme.
109. The method of claim 107 or 108, wherein producing the values comprises applying a function to the selected parameter, wherein the function is based on one or more learned gain units.
110. The method of any of claims 107 to 109, wherein each gain unit comprises a tensor of learned values, and wherein the function comprises an interpolation function for producing, based on the selected parameter, the values for modifying the latent representation by interpolating between the learned values of the gain unit tensors.
111. The method of any of claims 107 to 110, wherein the selected parameter comprises one of a plurality of learned parameters.L1T2 / W0112. The method of claim 111, wherein at least one of the plurality of learned parameters is based on at least one other of the plurality of learned parameters.
113. The method of claim 111 or 112, wherein at least one of the plurality of learned parameters comprises a cumulative sum of at least two others of the plurality of learned parameters.
114. A data processing apparatus configured to perform the method of any of claims 101 to 113.
115. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of claims 101 to 113.
116. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of any of claims 101 to 113.