Method and apparatus for encoding a picture and decoding a bitstream using a neural network

Neural networks with padding and cropping layers in multi-stage context models and hyperscale decoders improve video coding efficiency by managing tensor sizes, addressing bandwidth and memory constraints.

JP2026521805APending Publication Date: 2026-07-01HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2024-06-21
Publication Date
2026-07-01

AI Technical Summary

Technical Problem

Existing video coding technologies face challenges in achieving efficient compression with minimal loss of image quality, particularly when dealing with limited network bandwidth and memory resources.

Method used

The use of neural networks with multi-stage context models, hyperscale decoders, and composite transform networks that incorporate padding and cropping layers to manage tensor sizes, ensuring integer dimensions and reducing data processing, thereby improving coding efficiency.

Benefits of technology

This approach minimizes information loss and enhances coding efficiency by maintaining device interoperability and reducing memory usage, while supporting various signal sizes and dimensions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026521805000001_ABST
    Figure 2026521805000001_ABST
Patent Text Reader

Abstract

This disclosure relates to a method, neural network, encoder, and decoder for processing pictures. Specifically, a padding layer is added before a layer having the same function as a downsampling layer, and a cropping layer is added after a layer having the same function as an upsampling layer, thereby reducing the amount of data processed by the neural network and improving encoding efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] This disclosure relates to methods for processing pictures using a neural network, a neural network, an encoder and decoder for performing these methods, and a computer-readable storage medium. [Background technology]

[0002] Video coding (video encoding and decoding) is used in a wide range of digital video applications, such as broadcast digital TV, video transmission over the internet and mobile networks, video chat, video conferencing, real-time conversation applications such as DVDs and Blu-ray discs, video content acquisition and editing systems, and camcorders in security applications.

[0003] Even the amount of video data required to render a relatively short video can be substantial, which can pose challenges when the data is streamed or otherwise transmitted over communication networks with limited bandwidth. Therefore, video data is generally compressed before being transmitted over modern telecommunications networks. Video size can also be a problem when video is stored on a storage device, as memory resources can be limited. Video compressors often use software and / or hardware at the source to encode video data before transmission or storage, thereby reducing the amount of data required to represent a digital video image. The compressed data is received at the destination by a video decompressor that decodes the video data. Given limited network resources and the increasing demand for higher video quality, improved compression and decompression techniques that improve compression ratios with little to no sacrifice of image quality are desirable.

[0004] Neural networks and deep learning technologies that utilize neural networks have long been used in the fields of video and image encoding and decoding.

[0005] In such cases, a bitstream typically represents, or is, data that can be reasonably represented by a two-dimensional matrix of values. For example, this applies to bitstreams that represent, or are, images, video sequences, or similar data. Apart from 2D data, the neural networks and frameworks referred to in this disclosure may be applied to further source signals, such as audio signals, or other signals, which are typically represented as 1D signals.

[0006] For example, neural networks can be used for both image recognition using deep learning neural networks and for encoding pictures. Correspondingly, such networks can be used to decode encoded pictures. Other source signals, such as signals with fewer or more dimensions than two, can also be processed by similar networks.

[0007] It may be desirable to provide a neural network framework that can be efficiently applied to a variety of different signals that may have different sizes. [Overview of the project] [Means for solving the problem]

[0008] Embodiments of this disclosure may ensure that the original information of a picture can be reconstructed with as little information loss as possible, and may enable effective processing of the picture while improving coding efficiency.

[0009] One embodiment of the present disclosure relates to a method for processing images using a neural network (NN), wherein the NN includes a multi-stage context model (MCM), and the MCM comprises multiple MCMs. kThe model includes these, which are applied to the reshuffled data, with a first down-shuffle layer applied to the first input of the MCM process and an up-shuffle layer applied to the output of the MCM process. There is a first padding layer before the first down-shuffle layer and a cropping layer after the up-shuffle layer, and the method is a step of obtaining a first tensor, where the first tensor is the reconstructed residual, a step of obtaining a second tensor, where the second tensor is the explicit prediction which is the output of the hyperdecoder, a step of padding the first tensor with the first padding layer, and a step of down-shuffle the padded first tensor based on the first down-shuffle layer to obtain a reshuffled first tensor, and multiple MCM k The process includes the steps of: processing the reshuffled first and second tensors based on the model to obtain a latent space tensor; up-shuffling the latent space tensor based on an up-shuffling layer to obtain a reshuffled latent space tensor; and cropping the reshuffled latent space tensor based on a cropping layer to obtain a reconstructed latent tensor, wherein the depth and stride of the padding layer described above are equal to the depth and stride of the cropping layer described above.

[0010] The present invention, in an MCM structure, adds a padding layer before the downsampling layer or a layer with the same function as the downsampling layer, and adds a cropping layer after the upsampling layer or a layer with the same function as the upsampling layer, thereby avoiding non-integer sized tensors in any processing step, avoiding loss of device interoperability (because fractional sizes of tensors are undefined and can therefore be treated differently by different devices and processors), reducing the amount of processed data (and memory usage) in the neural network, and improving coding efficiency.

[0011] MCM kThe model is stage or MCM k Also referred to as stages, the number of stages may be 8, but it is not limited to 8; the number of stages may be 6 or less, or 10 or more.

[0012] Furthermore, the exact design of the MCM structure may change in the future; for example, the MCM structure may include more or fewer stages, more or fewer down-shuffle or up-shuffle layers, or some other layers to change the size or shape of the tensor. However, regardless of how the MCM structure changes, it can be understood that there must be a padding layer before the downsampling / down-shuffle layer, or a layer with the same function as the downsampling layer, and a cropping layer after the upsampling / up-shuffle layer, or a layer with the same function as the upsampling layer. Thus, non-integer sized tensors can be avoided at any processing step, the lack of device interoperability (since fractional sized tensors are undefined and can therefore be treated differently by different devices and processors) can be avoided, the amount of data processed (and memory usage) in the neural network can be reduced, and coding efficiency can be improved.

[0013] In one embodiment, the number of inputs to the tensor slice of the first down-shuffle layer is 2.

[0014] In one embodiment, the number of inputs to the tensor slice in the up-shuffle layer is 2.

[0015] In one embodiment, the first padding layer is Padd(H in ,W in ) is shown as, where H in ,W in are the height and width of the first tensor.

[0016] In one embodiment, the first padding layer has a stride of 2 and a depth of 5.

[0017] In one embodiment, the second padding layer has a stride of 2 and a depth of 5.

[0018] In one embodiment, the cropping layer has a stride of 2 and a depth of 5.

[0019] In one embodiment, Crop(H in ,W in ), where H in ,W in are the height and width of the tensor output to the synthesis transformation.

[0020] In one embodiment, the first tensor is a reconstructed residual tensor, and the second tensor is an explicit prediction tensor.

[0021] One embodiment of the present invention discloses a neural network (NN), the NN includes a multi-stage context model (MCM), the MCM includes a plurality of MCM k models, a first downshuffle layer, and an upshuffle layer. There is a first padding layer before the first downshuffle layer, and a cropping layer after the upshuffle layer. The first downshuffle layer is used to change the size or shape of the tensor, and the upshuffle layer is used to change the size or shape of the tensor.

[0022] In one embodiment, the number of inputs of the tensor slice of the first downshuffle layer is 2.

[0023] In one embodiment, the number of inputs of the tensor slice of the upshuffle layer is 2.

[0024] In one embodiment, the first padding layer dd(H in ,W in ), where H in ,W inare the height and width of the first tensor.

[0025] In one embodiment, the cropping layer is Crop(H in ,W in ) is shown as, where H in ,W in These are the height and width of the tensor output to the composite transformation.

[0026] In one embodiment, the first padding layer receives a tensor having a first size and outputs a tensor having a second size.

[0027] In one embodiment, the first padding layer is created by replication.

[0028] In one embodiment, MCM k The model is MCM before k is 0 to k-1 k The model's output tensor is used as input.

[0029] One embodiment of the present invention discloses an encoder for encoding a picture, the encoder comprising a receiver for receiving a picture and one or more processors configured to implement a neural network NN, the NN comprising a multi-stage context model MCM, the MCM comprising multiple MCMs k The model includes an MCM further comprising a first down-shuffle layer and an up-shuffle layer, with a first padding layer preceding the first down-shuffle layer and a cropping layer following the up-shuffle layer, and the encoder further comprising a transmitter that outputs a bitstream, and the encoder is adapted to perform the method according to any of the embodiments described above.

[0030] One embodiment of the present invention discloses a decoder for decoding a bitstream representing a picture, the decoder comprising a receiver for receiving the bitstream and one or more processors configured to implement a neural network (NN), the NN comprising a multi-stage context model (MCM), the MCM comprising multiple MCMsk The model includes an MCM further comprising a first down-shuffle layer and an up-shuffle layer, with a first padding layer preceding the first down-shuffle layer and a cropping layer following the up-shuffle layer, and the decoder further comprising a transmitter for outputting the decoded picture, the decoder being adapted to perform any one of the methods of the embodiments described above.

[0031] One embodiment of the present invention discloses a method for processing an image using a neural network (NN), the NN including a hyperscale decoder, the hyperscale decoder including a base operating point and a high operating point, the hyperscale decoder including a first quantized transposed convolutional layer, followed by a first cropping layer and a first normalized linear unit, a first quantized convolutional layer, followed by a second normalized linear unit, a second quantized transposed convolutional layer, followed by a second cropping layer and a third normalized linear unit, and a second quantized convolutional layer. For high operating points, the hyperscale decoder includes two quantized convolutional layers, each followed by a pixel shuffle layer, a cropping layer, and a normalized linear unit, and then includes three quantized convolutional layers, with two of the three quantized convolutional layers followed by a normalized linear unit, and the method is as follows: Steps to obtain the input tensor, Steps to obtain the size of the input tensor, Steps to obtain the operating point indicator, A step of deciding to process the input tensor using the base operating point or high operating point based on the operating point indicator, The steps include: outputting the processed tensor, Includes.

[0032] A hyperscale decoder includes two processing pipelines, one of which is the base operating point and the other is the high operating point. In some other embodiments, the base operating point may also be referred to as the base profile, baseline, base channel, base subnetwork, or some other name, and correspondingly, the high operating point may also be referred to as the high profile, high line, high channel, high subnetwork, or some other name.

[0033] Furthermore, the exact design of the hyperscale decoder may change in the future, and it may include more or fewer quantized transposed convolutional layers, and it may include some other layers to change the size or shape of the tensor. However, regardless of how the hyperscale decoder changes, it can be understood that there must be a padding layer before the downsampling / downshuffle layer, or a layer that functions the same as the downsampling layer, and a cropping layer after the upsampling / upshuffle layer, or a layer that functions the same as the upsampling layer.

[0034] This invention proposes using tensor boundary processing in a hyperscale decoder to reduce the amount of data processed in a neural network and improve coding efficiency. Tensor boundary processing means adding a padding layer before the downsampling layer or a layer with the same function as the downsampling layer, and adding a cropping layer after the upsampling layer or a layer with the same function as the upsampling layer. This ensures that the tensor size is an integer at each step, and thus uncertainty is avoided (uncertain processes cause platform dependency and eliminate device interoperability). Since the hyperscale decoder outputs parameters for the entropy decoder, it must have bit-accurate behavior; otherwise, what is parsed from the bitstream bits cannot be correctly interpreted.

[0035] In one embodiment, both the first quantized transposed convolutional layer and the second quantized transposed convolutional layer have a kernel size of 4 × 4 and a stride of 2.

[0036] In one embodiment, the first cropping layer has a stride of 2 and a depth of 6.

[0037] In one embodiment, the second cropping layer has a stride of 2 and a depth of 5.

[0038] In one embodiment, the cropping layer following the first of two quantized convolutional layers at the high operating point has a stride of 2 and a depth of 6, and the second of the two quantized convolutional layers at the high operating point has a stride of 2 and a depth of 5.

[0039] In one embodiment, the processed tensor is a hyperscale decoder standard deviation tensor.

[0040] In one embodiment, when the operating point indicator is equal to 0, the input tensor is processed using the base operating point.

[0041] In one embodiment, when the operating point indicator is equal to 1, the high operating point is used to process the input tensor.

[0042] In one embodiment, the first quantized convolutional layer has a kernel size of 3 × 3.

[0043] In one embodiment, the pixel shuffle layer is configured to change the number of channels from 4C to C.

[0044] One embodiment of the present invention discloses a neural network (NN) which includes a hyperscale decoder, the hyperscale decoder includes a base operating point and a high operating point, and with respect to the base operating point, the hyperscale decoder includes a first quantized transposed convolutional layer, followed by a first cropping layer and a first normalized linear unit, a first quantized convolutional layer, followed by a second normalized linear unit, a second quantized transposed convolutional layer, followed by a second cropping layer and a third normalized linear unit, and a second quantized convolutional layer. For high operating points, the hyperscale decoder includes two quantized convolutional layers, each followed by a pixel shuffle layer, a cropping layer, and a normalized linear unit, and then three quantized convolutional layers, two of which are followed by normalized linear units.

[0045] One embodiment of the present invention discloses an encoder for encoding a picture, the encoder comprising a receiver for receiving a picture and one or more processors configured to implement a neural network (NN), the NN comprising a hyperscale decoder, the hyperscale decoder comprising a base operating point and a high operating point, the hyperscale decoder comprising a first quantized transposed convolutional layer, followed by a first cropping layer and a first normalized linear unit, a first quantized convolutional layer, followed by a second normalized linear unit, a second quantized transposed convolutional layer, followed by a second cropping layer and a third normalized linear unit, and a second quantized convolutional layer. For high operating points, the hyperscale decoder includes two quantization convolutional layers, each of which is followed in order by a pixel shuffle layer, a cropping layer, and a normalization linear unit, and then includes three quantization convolutional layers, with two of the three quantization convolutional layers followed by a normalization linear unit, and the encoder further includes a transmitter for outputting a bitstream, and the encoder is adapted to perform the method according to any one of the embodiments described above.

[0046] One embodiment of the present invention discloses a decoder for decoding a bitstream representing a picture, the decoder comprising a receiver for receiving the bitstream and one or more processors configured to implement a neural network (NN), the NN comprising a hyperscale decoder, the hyperscale decoder comprising a base operating point and a high operating point, the hyperscale decoder comprising a first quantized transposed convolutional layer, followed by a first cropping layer and a first normalized linear unit, a first quantized convolutional layer, followed by a second normalized linear unit, a second quantized transposed convolutional layer, followed by a second cropping layer and a third normalized linear unit, and a second quantized convolutional layer. For a high operating point, the hyperscale decoder includes two quantization convolutional layers, each of which is followed by a pixel shuffle layer, a cropping layer, and a normalization linear unit, and then includes three quantization convolutional layers, with two of the three quantization convolutional layers followed by a normalization linear unit, and the decoder further includes a transmitter for outputting the decoded picture, and the decoder is adapted to perform any one of the embodiments described above.

[0047] One embodiment of the present invention discloses a method for processing an image using a neural network (NN), the NN comprising a composite transform network, the composite transform network comprising a connected layer configured to concatenate a principal tensor and an auxiliary tensor as input tensors, a base operating point, and a high operating point, wherein with respect to the base operating point, the composite transform network comprises a lightweight residual block, followed by a first transposed convolutional layer combined with a first cropping layer and a first residual activation unit, a second transposed convolutional layer combined with a second cropping layer and a second residual activation unit, a first convolutional layer combined with a third residual activation unit, a second convolutional layer, followed by a first pixel shuffle layer and a third cropping layer, For high operating points, the composite transformation network includes two residual blocks, followed by a third transposed convolutional layer combined with a fourth cropping layer and a first residual activation, a fourth transposed convolutional layer combined with a fifth cropping layer and a second residual activation, followed by a third convolutional layer, and a residual nonlocal attention block combined with a sixth cropping layer and a third residual activation, and is completed by a fifth transposed convolutional layer and a seventh cropping layer, and the method is as follows: The steps include obtaining the input tensor by concatenating the main tensor and the auxiliary tensor, Steps to obtain the operating point indicator, Steps to obtain the size of the input tensor, A step of deciding to process the input tensor using the base operating point or high operating point based on the operating point indicator, The steps include: outputting the processed tensor, Includes.

[0048] A composite transformation network can be understood to include two processing pipelines, one of which is the base operating point and the other is the high base operating point. In some other embodiments, the base operating point may also be referred to as the base profile, baseline, base pipeline, base channel, base subnetwork, or some other name, and correspondingly, the high operating point may also be referred to as the high profile, high line, high pipeline, high channel, high subnetwork, or some other name.

[0049] Furthermore, the exact design of the composite transformation network may change in the future, and the composite transformation network may include more or fewer quantized transpose convolutional layers, more or fewer quantized convolutional layers, and more or fewer pixel shuffle layers. The composite transformation network may also include some other layers to change the size or shape of the tensors. However, regardless of how the composite transformation network is modified, it can be understood that there must be padding layers before downsampling / downshuffle layers, or layers that function the same as downsampling layers, and there must be cropping layers after upsampling / upshuffle layers, or layers that function the same as upsampling layers.

[0050] This invention proposes using tensor boundary processing in a synthetic transform network to reduce the amount of data processed in the neural network and improve coding efficiency. Tensor boundary processing means adding a padding layer before a downsampling layer or a layer with the same function as a downsampling layer, and adding a cropping layer after an upsampling layer or a layer with the same function as an upsampling layer. This ensures that the tensor size is an integer without introducing uncertainty at any step of the processing.

[0051] In one embodiment, the first cropping layer has a stride of 2 and a depth of 4.

[0052] In one embodiment, the second cropping layer has a stride of 2 and a depth of 3.

[0053] In one embodiment, the third cropping layer has a stride of 4 and a depth of 1.

[0054] In one embodiment, the fourth cropping layer has a stride of 2 and a depth of 4.

[0055] In one embodiment, the fifth cropping layer has a stride of 2 and a depth of 3.

[0056] In one embodiment, the sixth cropping layer has a stride of 2 and a depth of 2.

[0057] In one embodiment, the seventh cropping layer has a stride of 2 and a depth of 1.

[0058] In one embodiment, when the operating point indicator is equal to 0, the input tensor is processed using the base operating point.

[0059] In one embodiment, when the operating point indicator is equal to 1, the high operating point is used to process the input tensor.

[0060] In one embodiment, the principal tensor is a reconstructed latent space tensor.

[0061] One embodiment of the present invention discloses a neural network (NN) comprising a synthetic transform network, the synthetic transform network comprising a connected layer configured to concatenate a principal tensor and an auxiliary tensor as input tensors, a base operating point, and a high operating point, wherein with respect to the base operating point, the synthetic transform network comprises a lightweight residual block, a subsequent first transposed convolutional layer combined with a first cropping layer and a first residual activation unit, a second transposed convolutional layer combined with a second cropping layer and a second residual activation unit, a first convolutional layer combined with a third residual activation unit, and a subsequent first pixel shuffle layer and a third cropping layer. For high operating points, the composite transform network includes two residual blocks, followed by a third transposed convolutional layer combined with a fourth cropping layer and a first residual activation, a fourth transposed convolutional layer combined with a fifth cropping layer and a second residual activation, a third convolutional layer, followed by a second pixel shuffle layer, and a residual nonlocal attention block combined with a sixth cropping layer and a third residual activation, and is completed by a fifth transposed convolutional layer and a subsequent seventh cropping layer.

[0062] One embodiment of the present invention discloses an encoder for encoding a picture, the encoder comprising a receiver for receiving a picture and one or more processors configured to implement a neural network (NN), the NN comprising a composite transform network, the composite transform network comprising a connected layer configured to concatenate a principal tensor and an auxiliary tensor as input tensors, a base operating point and a high operating point, the composite transform network comprising a lightweight residual block, a subsequent first transposed convolutional layer combined with a first cropping layer and a first residual activation unit, a second transposed convolutional layer combined with a second cropping layer and a second residual activation unit, a first convolutional layer combined with a third residual activation unit, a second convolutional layer, and a subsequent first pixel shuffle layer and a third cropping layer. For high operating points, the composite transform network includes two residual blocks, followed by a third transposed convolutional layer combined with a fourth cropping layer and a first residual activation, a fourth transposed convolutional layer combined with a fifth cropping layer and a second residual activation, followed by a third convolutional layer, followed by a second pixel shuffle layer, and a residual nonlocal attention block combined with a sixth cropping layer and a third residual activation, and is completed by a fifth transposed convolutional layer and a subsequent seventh cropping layer. The encoder further includes a transmitter for outputting a bitstream, and the encoder is adapted to perform the method according to any one of the embodiments described above.

[0063] One embodiment of the present invention discloses a decoder for decoding a bitstream representing a picture, the decoder comprising a receiver for receiving the bitstream and one or more processors configured to implement a neural network (NN), the NN comprising a composite transform network, the composite transform network comprising a connected layer configured to concatenate a principal tensor and an auxiliary tensor as input tensors, a base operating point and a high operating point, the composite transform network comprising a lightweight residual block, a first transposed convolutional layer combined with a first cropping layer and a first residual activation unit, a second transposed convolutional layer combined with a second cropping layer and a second residual activation unit, a first convolutional layer combined with a third residual activation unit, a second convolutional layer, and subsequently a first pixel shuffle layer and a third cropping layer. For high operating points, the composite transform network includes two residual blocks, followed by a third transposed convolutional layer combined with a fourth cropping layer and a first residual activation, a fourth transposed convolutional layer combined with a fifth cropping layer and a second residual activation, followed by a third convolutional layer combined with a second pixel shuffle layer, and a residual nonlocal attention block combined with a sixth cropping layer and a third residual activation, and is completed by a fifth transposed convolutional layer and a seventh cropping layer, and the decoder further includes a transmitter for outputting the decoded picture, and the decoder is adapted to perform any one of the methods of the embodiments described above.

[0064] One embodiment of the present invention discloses a decoder for decoding a bitstream representing a picture, the decoder comprising one or more processors for implementing a neural network (NN), the one or more processors being adapted to perform the method according to any one of the embodiments described above.

[0065] One embodiment of the present invention discloses an encoder for encoding a picture, the encoder comprising one or more processors implementing a neural network (NN), the one or more processors being adapted to perform the method according to any one of the embodiments described above.

[0066] Embodiments of the present invention disclose a computer program product that, when executed on a computer system, includes computer-executable instructions that cause the computer system to perform a method according to any one of the embodiments described above.

[0067] One embodiment of the present invention discloses a neural network (NN), the NN includes a multi-stage context model (MCM), and the MCM is a plurality of MCMs k The model includes, further comprising one or more down-shuffle layers and one or more up-shuffle layers, with a padding layer before each of the one or more down-shuffle layers and a cropping layer after each of the one or more up-shuffle layers.

[0068] In one embodiment, the NN further includes a hyperscale decoder, the hyperscale decoder includes a baseline and a highline, the baseline includes two or more quantized transposed convolutional layers, each of which is followed by a cropping layer and a normalized linear unit, and the highline includes two or more quantized convolutional layers, each of which is followed by a pixel shuffle layer, a cropping layer and a normalized linear unit in that order.

[0069] In one embodiment, the NN further comprises a composite transformation network, the composite transformation network comprising a baseline and a high line, the baseline comprising two or more transposed convolutional layers, each of the two or more transposed convolutional layers followed by a cropping layer and a residual activation unit, each of the one or more convolutional layers followed by a pixel shuffle layer and a cropping layer, the high line comprising two or more transposed convolutional layers, each of the two or more transposed convolutional layers followed by a cropping layer and residual activation, and after the convolutional layers followed by a residual nonlocal attention block combined with a pixel shuffle layer and a cropping layer.

[0070] One embodiment of the present invention discloses an encoder for encoding a picture, the encoder comprising a receiver for receiving a picture, a transmitter for outputting a bitstream, and one or more processors configured to implement a neural network NN as described in any one of claims 42 to 44.

[0071] One embodiment of the present invention discloses a decoder for decoding a bitstream representing a picture, the decoder comprising a receiver for receiving a bitstream, a transmitter for outputting a decoded picture, and one or more processors configured to implement a neural network (NN) as described in any one of claims 42 to 44.

[0072] One embodiment of the present invention discloses a method for processing a picture using a neural network (NN), wherein the NN includes a multi-stage context model (MCM), and the MCM comprises multiple MCMs. k The model includes, further comprising one or more down-shuffle layers and one or more up-shuffle layers, with a padding layer before each of the one or more down-shuffle layers and a cropping layer after each of the one or more up-shuffle layers, and the method is Steps to obtain the input tensor, A step of padding a first tensor with a padding layer preceding each of one or more down-shuffle layers, wherein the first tensor is an input tensor or a tensor obtained by processing the input tensor. The steps include: after a second tensor is output from each of one or more up-shuffle layers, cropping the second tensor using a cropping layer; Includes.

[0073] In one embodiment, the NN further includes a hyperscale decoder, the hyperscale decoder includes a baseline and a highline, the baseline includes two or more quantized transposed convolutional layers, each of which is followed by a cropping layer and a normalized linear unit, the highline includes two or more quantized convolutional layers, each of which is followed by a pixel shuffle layer, a cropping layer and a normalized linear unit, and the method is as follows: Steps to obtain the operating point indicator, The steps include determining, based on the operating point indicator, to process the third tensor using the baseline or high operating line, Includes.

[0074] In one embodiment, the NN further comprises a composite transformation network, the composite transformation network comprising a baseline and a high line, the baseline comprising two or more transposed convolutional layers, each of the two or more transposed convolutional layers followed by a cropping layer and a residual activation unit, each of the one or more convolutional layers followed by a pixel shuffle layer and a cropping layer, the high line comprising two or more transposed convolutional layers, each of the two or more transposed convolutional layers followed by a cropping layer and residual activation, and after the convolutional layer, a residual nonlocal attention block combined with a pixel shuffle layer and a cropping layer, the method is The steps include obtaining a second input tensor by concatenating the main tensor and the auxiliary tensor, Steps to obtain the operating point indicator, The steps include deciding to process the second input tensor using a baseline or high line based on the operating point indicator, Includes.

[0075] In one embodiment, when the operating point indicator is equal to 0, the baseline is used to process the second input tensor.

[0076] In one embodiment, when the operating point indicator is equal to 1, the high line is used to process the second input tensor.

[0077] In general, a picture in the context of this disclosure may constitute a still image or a moving image such as a video or video sequence. Furthermore, a portion of a larger picture or video sequence may be encompassed by the term "picture." A picture may also be referred to as a frame or image.

[0078] Applied to the input, and its size S is set in at least one dimension.

number

[0079] In this regard, obtaining one of several resizing methods should be understood as meaning that while multiple resizing methods are available for image encoding, preferably, one is used not arbitrarily, but depending on additional information. This can result in the selection of a resizing method that is particularly suitable for obtaining the intended output of the neural network, for example, with respect to the size of the input or output.

[0080] The input to the neural network may be a two-dimensional input, such as the image itself, or a matrix representing sample values ​​of the image, or other structure representing the image. The input does not necessarily have to be the picture itself, but may be related to a pre-processed or otherwise processed version of the picture. Pre-processing or processing of the image provided as input to the neural network may include, for example, preparing or modifying the image for further processing by the neural network.

[0081] In the context of this disclosure, a downsampling layer can be understood as a layer that reduces the size of an input, for example, by applying a convolution to the input. This can include reducing the size by a coefficient also called the downsampling ratio of the downsampling layer, where the downsampling ratio is the size of the input reduced from the input size S.

number

[0082] The output of a neural network can be referred to as an encoded picture, but the output of a neural network is not necessarily a bitstream representing an already encoded picture. The output encoding the picture may be binarized and may also include additional information, for example, about the resizing method used to apply resizing.

[0083] This embodiment allows selecting a resizing method depending on the situation and applying the resizing method for resizing. For example, in some cases, the size S of the input may be increased during resizing to a size larger than S before the input was processed by the neural network.

number

number

number

number

[0084] In further embodiments, multiple resizing methods include padding, zero padding, reflection padding, repeating padding, cropping, and resizing the input size S.

number

number

[0085] This evaluation allows us to determine, for example, whether increasing or decreasing the size is more efficient in terms of computation, and accordingly, to decide which resizing method to use (e.g., padding or cropping).

[0086] If, during this comparison, it is found that C is equal to F, then the resizing method that changes the input size S is not applied. Using these formulas, a reliable evaluation can be made as to whether it is efficient to increase or decrease the size S.

[0087] In one embodiment, one or more instructions include an instruction, the first value of which indicates that padding or cropping should be applied as the resizing method, and the second value of which indicates that interpolation should be applied as the resizing method. In this context, the first and second values ​​of the instruction mean that the instruction can take either the first or second value. This allows information about which resizing method to use to be provided for encoding with as little information as possible. This instruction will also be referred to below as the “first instruction” for ease of distinction from other instructions. It may or may not be present, independently of the presence or absence of the other instructions described below.

[0088] Specifically, the instruction may be a flag having a size of 1 bit, or may be specified to include such a flag. This allows for indicating with less information whether to apply an increase or decrease in the input size S during resizing.

[0089] In one embodiment, one or more instructions include an instruction, the first value of which indicates that size S should be increased, and the second value of which indicates that size S should be decreased. This instruction will also be referred to below as the “second instruction” for ease of distinction from other instructions.

[0090] In further embodiments, one or more instructions include a first value of the instruction indicating that padding should be applied as the resizing method, and a second value of the instruction indicating that cropping should be applied as the resizing method. This also provides information on whether to use padding or cropping for resizing. This instruction is also referred to below as the “fourth instruction” for ease of distinction from other instructions. This instruction may exist independently of the presence or absence of other instructions. However, in some embodiments, it may be provided when the first instruction indicates that padding or cropping should be applied as the resizing method.

[0091] Specifically, the instruction may be a flag having a size of 1 bit, or may include one. This minimizes the size of the instruction while ensuring that the necessary information is provided.

[0092] In other embodiments, one or more instructions include a value indicating whether zero padding, reflection padding, or repeating padding should be applied as a resizing method. This allows for different types of padding to be provided. This instruction is also referred to below as the “fifth instruction” for ease of distinction from other instructions. This instruction may exist independently of the presence or absence of other instructions. However, in some embodiments, it may be provided when the fourth instruction indicates that padding or cropping should be applied as a resizing method.

[0093] Specifically, information regarding the resizing method used may include the input size, the picture size, the resizing method applied, one or more instructions, and at least one of the downsampling ratios of at least one downsampling layer of the NN. The instructions may be the first to fifth instructions described above. However, other instructions are also possible. This disclosure is not limited to the instructions provided.

[0094] This information can be used to reliably decrypt the bitstream.

[0095] Specifically, the instruction may be provided to be a flag having a size of 1 bit, or to contain such a flag. This reduces the size of the instruction to the minimum.

[0096] Furthermore, a computer-readable storage medium is provided that, when executed on a computer system, contains computer-executable instructions that cause the computer system to perform a method according to any of the embodiments described above. [Brief explanation of the drawing]

[0097] [Figure 1A] Block diagram showing an example of a video coding system configured to implement an embodiment of the present disclosure.

[0098] [Figure 1B] This block diagram shows another example of a video coding system configured to implement some embodiments of the present disclosure.

[0099] [Figure 2] This is a block diagram showing an example of an encoding or decoding device.

[0100] [Figure 3] This is a block diagram showing other examples of encoding or decoding devices.

[0101] [Figure 4] This figure shows an encoder and decoder together according to one embodiment.

[0102] [Figure 5] A schematic diagram of input encoding and decoding is shown.

[0103] [Figure 6]This figure shows an encoder and decoder in accordance with the VAE framework.

[0104] [Figure 7] This figure shows the components of an encoder according to one embodiment shown in Figure 4.

[0105] [Figure 8] This diagram shows the components of the decoder according to Figure 4 in one embodiment.

[0106] [Figure 9] This figure shows the rescaling and processing of the input.

[0107] [Figure 10] This is a diagram showing an encoder and decoder.

[0108] [Figure 11] This figure shows additional encoders and decoders.

[0109] [Figure 12] This figure shows input rescaling and processing according to one embodiment.

[0110] [Figure 13] This figure shows an embodiment of signaling a rescaling option according to one embodiment.

[0111] [Figure 14] This figure shows a more specific implementation of the embodiment shown in Figure 13.

[0112] [Figure 15] This figure shows a more specific implementation of the embodiment shown in Figure 14.

[0113] [Figure 16] This figure compares different possibilities for padding calculations.

[0114] [Figure 17] This figure shows a further comparison of different possibilities for padding calculations.

[0115] [Figure 18] This figure shows the relationship between an encoder and a decoder, and the processing of inputs to the encoder and decoder, according to one embodiment.

[0116] [Figure 19] A schematic diagram of a neural network as part of an encoder according to one embodiment is shown.

[0117] [Figure 20] A flowchart of a method for encoding a picture according to one embodiment is shown.

[0118] [Figure 21] This figure shows one or more embodiments of instructions provided for encoding.

[0119] [Figure 22] A schematic diagram of a neural network as part of a decoder according to one embodiment is shown.

[0120] [Figure 23] A flowchart of a method for decoding a bitstream according to one embodiment is shown.

[0121] [Figure 24] This figure shows one or more embodiments of instructions provided for decoding.

[0122] [Figure 25] A schematic diagram of an encoder according to one embodiment is shown.

[0123] [Figure 26] A schematic diagram of a decoder according to one embodiment is shown.

[0124] [Figure 27A] A schematic diagram of the residual activation unit is shown.

[0125] [Figure 27B] A schematic diagram of the residual activation function is shown.

[0126] [Figure 27C] A schematic diagram of residual nonlocal attention block (RBAN) is shown.

[0127] [Figure 27D] A schematic diagram of the residual block (RB) is shown.

[0128] [Figure 27E] A schematic diagram of a lightweight residual block (LRB) is shown.

[0129] [Figure 28] A flowchart of a method for processing images according to one embodiment is shown.

[0130] [Figure 29] This figure shows an example of a multi-stage context modeling process.

[0131] [Figure 29A] This figure shows another example of a multi-stage context modeling process.

[0132] [Figure 30] This figure shows an example of a channel network process.

[0133] [Figure 31] A flowchart of a method for processing pictures according to one embodiment is shown.

[0134] [Figure 32] This figure shows an example of a hyperscale decoder.

[0135] [Figure 33] A flowchart of a method for processing pictures according to one embodiment is shown.

[0136] [Figure 34] This figure illustrates an example of a composite transformation network. [Modes for carrying out the invention]

[0137] Several embodiments will be described below with reference to the drawings. Figures 1 to 3 show video coding systems and methods that may be used in conjunction with more specific embodiments of the present invention shown in further figures. Specifically, embodiments described in relation to Figures 1 to 3 may be used in conjunction with coding / decoding techniques that utilize neural networks to code and / or decode bitstreams, as will be described further below.

[0138] The following description refers to the accompanying drawings, which form part of the Disclosure and illustrate specific aspects of the Disclosure or specific aspects in which embodiments of the Disclosure may be used. It is understood that embodiments may be used in other ways and may include structural or logical modifications not shown in the drawings. Therefore, the following detailed description should not be constrained to mean limitingly, and the scope of the Disclosure is defined by the accompanying claims.

[0139] For example, disclosures relating to a described method may also apply to a corresponding device or system configured to perform the method, and vice versa. For example, if one or more specific method steps are described, the corresponding device may include one or more units, e.g., functional units, to perform the one or more described method steps (e.g., one unit performing one or more steps, or multiple units each performing one or more of the multiple steps). Conversely, if a specific apparatus is described based on one or more units, e.g., functional units, the corresponding method may include one or more steps to perform the function of one or more units (e.g., one unit performing one or more steps, or multiple units each performing one or more of the multiple steps). Furthermore, it is understood that the various exemplary embodiments and / or aspect features described herein may be combined with each other unless otherwise specified.

[0140] Video coding generally refers to the processing of a sequence of pictures that make up a video or video sequence. In the field of video coding, the terms "frame" or "image" may be used as synonyms instead of the term "picture." Video coding (or coding in general) consists of two parts: video encoding and video decoding. Video encoding is performed on the source side and typically involves processing the original video picture (e.g., by compression) to reduce the amount of data required to represent the video picture (for more efficient storage and / or transmission). Video decoding is performed on the destination side and typically involves the reverse processing compared to the encoder in order to reconstruct the video picture. Embodiments referring to "coding" of a video picture (or pictures in general) shall be understood to be relating to the "encoding" or "decoding" of the video picture or each video sequence. The combination of the encoding and decoding parts is also called a codec (Coding and Decoding).

[0141] In lossless video coding, the original video picture can be reconstructed, meaning the reconstructed video picture will have the same quality as the original video picture (assuming there is no transmission loss or other data loss during storage or transmission). In lossy video coding, further compression is performed, for example by quantization, reducing the amount of data representing video pictures that cannot be fully reconstructed by the decoder, meaning the quality of the reconstructed video picture will be lower or worse than the quality of the original video picture.

[0142] Several video coding standards belong to the group of “lossy hybrid video codecs” (i.e., combining spatial and temporal prediction in the sample domain with 2D transform coding to apply quantization in the transform domain). Each picture in a video sequence is generally divided into a set of non-overlapping blocks, and coding is generally performed at the block level. In other words, in an encoder, video is typically processed, or encoded, at the block (video block) level, for example, by generating predicted blocks using spatial (intra-picture) prediction and / or temporal (inter-picture) prediction, subtracting the predicted blocks from the current block (the block currently being processed / to be processed) to obtain a residual block, transforming the residual block, and quantizing the residual block in the transform domain to reduce (compress) the amount of data to be transmitted. In a decoder, the reverse process compared to the encoder is applied to the encoded or compressed block to reconstruct the current block for representation. Furthermore, the encoder replicates the decoder processing loop, resulting in both generating identical predictions (e.g., intra-predictions and inter-predictions) and / or reconstructions for processing subsequent blocks, i.e., coding. Recently, part or all of the coding and decoding chain has been implemented using neural networks, or generally any machine learning or deep learning framework.

[0143] In the following embodiment of the video coding system 10, the video encoder 20 and video decoder 30 are described with reference to Figure 1.

[0144] Figure 1A is a schematic block diagram showing an exemplary coding system 10 that may utilize the techniques of this application, for example, a video coding system 10 (or simply coding system 10). The video encoder 20 (or simply encoder 20) and video decoder 30 (or simply decoder 30) of the video coding system 10 represent examples of devices that may be configured to perform the techniques described in the various examples of this application.

[0145] As shown in Figure 1A, the encoding system 10 includes a source device 12 configured to provide the encoded picture data 21 to, for example, a destination device 14 for decoding the encoded picture data 21.

[0146] The source device 12 includes an encoder 20 and may further include, optionally, a picture source 16, a preprocessor (or preprocessing unit) 18, such as a picture preprocessor 18, and a communication interface or communication unit 22. Some embodiments of the present disclosure (e.g., relating to initial rescaling or rescaling between two preceding layers) may be implemented by the encoder 20. Some embodiments (e.g., relating to initial rescaling) may be implemented by the picture preprocessor 18.

[0147] The picture source 16 includes, or may include, any type of picture capture device, e.g., a camera for capturing real-world pictures, and / or any type of picture generation device, e.g., a computer graphics processor for generating computer-animated pictures, or any other type of device for acquiring and / or providing real-world pictures, computer-generated pictures (e.g., screen content, virtual reality (VR) pictures), and / or any combination thereof (e.g., augmented reality (AR) pictures). The picture source may be any type of memory or storage for storing any of the aforementioned pictures.

[0148] To distinguish it from the processing performed by the preprocessor 18 and the preprocessing unit 18, the picture or picture data 17 may also be referred to as the raw picture or raw picture data 17.

[0149] The preprocessor 18 is configured to receive (raw) picture data 17, perform preprocessing on the picture data 17 to obtain a preprocessed picture 19 or preprocessed picture data 19. The preprocessing performed by the preprocessor 18 may include, for example, cropping, color format conversion (e.g., RGB to YCbCr), color correction, or noise reduction. It can be understood that the preprocessing unit 18 may be an optional component.

[0150] The video encoder 20 is configured to receive pre-processed picture data 19 and provide encoded picture data 21.

[0151] The communication interface 22 of the source device 12 may be configured to receive the encoded picture data 21 and transmit the encoded picture data 21 (or any further processed version thereof) to another device, such as the destination device 14 or any other device, via the communication channel 13 for storage or direct reconstruction.

[0152] The destination device 14 includes a decoder 30 (e.g., a video decoder 30) and may additionally, or optionally, include a communication interface or communication unit 28, a post-processor 32 (or post-processing unit 32), and a display device 34.

[0153] The communication interface 28 of the destination device 14 is configured to receive encoded picture data 21 (or any further processed version thereof) from, for example, directly from the source device 12 or from any other source, for example, a storage device, for example, an encoded picture data storage device, and to provide the encoded picture data 21 to the decoder 30.

[0154] Communication interfaces 22 and 28 may be configured to transmit or receive encoded picture data 21 or encoded data 21 via a direct communication link between the source device 12 and the destination device 14, for example, via a direct wired or wireless connection, or via any type of network, for example, a wired or wireless network or any combination thereof, or any type of private and public network or any combination thereof.

[0155] The communication interface 22 may be configured, for example, to package the encoded picture data 21 into an appropriate format, such as a packet, and / or to process the encoded picture data using any kind of transmit encoding or processing for transmission over a communication link or communication network.

[0156] A communication interface 28, which is the counterpart to communication interface 22, may be configured, for example, to receive transmitted data and process the transmitted data using any type of corresponding transmission decoding or processing and / or depackaging to obtain encoded picture data 21.

[0157] Both communication interfaces 22 and 28 may be configured as one-way or two-way communication interfaces, as indicated by the arrow of the communication channel 13 pointing from the source device 12 to the destination device 14 in Figure 1A, and may be configured to send and receive messages, set up connections, acknowledge and exchange any other information related to communication links and / or data transmission, such as encoded picture data transmission.

[0158] The decoder 30 is configured to receive the encoded picture data 21 and provide the decoded picture data 31 or the decoded picture 31 (further details will be explained below, for example, with reference to Figure 3).

[0159] The post-processor 32 of the destination device 14 is configured to post-process the decoded picture data 31 (also referred to as reconstructed picture data), for example, the decoded picture 31, to obtain post-processed picture data 33, for example, the post-processed picture 33. The post-processing performed by the post-processing unit 32 may include, for example, color format conversion (e.g., YCbCr to RGB), color correction, cropping, or resampling, or any other processing to prepare the decoded picture data 31 for display by the display device 34, for example.

[0160] Some embodiments of this disclosure may be implemented by the decoder 30 or by the postprocessor 32.

[0161] The display device 34 of the destination device 14 is configured to receive post-processed picture data 33 for displaying the picture to, for example, a user or viewer. The display device 34 may be any type of display for representing the reconstructed picture, for example, an integrated or external display or monitor, or may include one. The display may include, for example, a liquid crystal display (LCD), an organic light-emitting diode (OLED) display, a plasma display, a projector, a microLED display, a liquid crystal on silicon (LCoS), a digital light processor (DLP), or any other type of display.

[0162] Although Figure 1A shows the source device 12 and the destination device 14 as separate devices, the device embodiment may also include the functionality of both the source device 12 or its corresponding functionality and the destination device 14 or its corresponding functionality. In such embodiments, the source device 12 or its corresponding functionality and the destination device 14 or its corresponding functionality may be implemented using the same hardware and / or software, or by separate hardware and / or software, or any combination thereof.

[0163] As will be apparent to those skilled in the art based on the description, the presence of different units or functions within the source device 12 and / or destination device 14, as shown in Figure 1A, and the (exact) division of functions may vary depending on the actual device and application.

[0164] An encoder 20 (e.g., a video encoder 20) or a decoder 30 (e.g., a video decoder 30), or both an encoder 20 and a decoder 30, may be implemented via a processing circuit as shown in Figure 1B, such as one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), discrete logic, hardware, dedicated to video coding, or any combination thereof. The encoder 20 may be implemented via a processing circuit 46 and may embody various modules and / or any other encoder systems or subsystems described herein. The decoder 30 may be implemented via a processing circuit 46 and may embody various modules and / or any other decoder systems or subsystems described herein. The processing circuit may be configured to perform various operations, as described later. As shown in Figure 3, when the technique is partially implemented in software, the device may store instructions for the software in a suitable non-temporary computer-readable storage medium and execute the instructions in hardware using one or more processors to perform the technique of this disclosure. Either the video encoder 20 or the video decoder 30 can be integrated as part of a combined encoder / decoder (codec) in a single device, for example, as shown in Figure 1B.

[0165] The source device 12 and the destination device 14 can include any of a wide range of devices, such as any type of handheld device or fixed device, e.g., a notebook computer or laptop computer, mobile phone, smartphone, tablet or tablet computer, camera, desktop computer, set-top box, television, display device, digital media player, video game console, a video streaming device (such as a content service server or content delivery server), a broadcast receiver device, a broadcast transmitter device, etc., and may use no operating system or any type of operating system. In some cases, the source device 12 and the destination device 14 can be equipped for wireless communication. Thus, the source device 12 and the destination device 14 can be wireless communication devices.

[0166] In some cases, the video coding system 10 shown in FIG. 1A is merely an example, and the techniques of this application can be applied to video coding settings (e.g., video encoding or video decoding) that do not necessarily involve data communication between an encoding device and a decoding device. In other examples, the data is retrieved from local memory, streamed over a network, or the like. The video encoding device can encode data, store it in memory, and / or the video decoding device can retrieve and decode data from memory. In some examples, encoding and decoding are performed by devices that do not communicate with each other but simply encode data into memory and / or retrieve and decode data from memory.

[0167] For the sake of convenience of explanation, some embodiments are described herein by referring to the reference software of Versatile Video Coding (VVC), which is, for example, High Efficiency Video Coding (HEVC), or the next-generation video coding standard developed by the Video Coding Experts Group (VCEG) of ITU-T and the Joint Collaborative Team on Video Coding (JCT-VC) of ISO / IEC Motion Picture Experts Group (MPEG). Those skilled in the art will understand that the embodiments of the present invention are not limited to HEVC or VVC.

[0168] FIG. 2 is a schematic diagram of a video coding device 400 according to an embodiment of the present invention. The video coding device 400 is suitable for implementing the disclosed embodiments as described herein. In one embodiment, the video coding device 400 can be a decoder such as the video decoder 30 of FIG. 1A, or an encoder such as the video encoder 20 of FIG. 1A.

[0169] The video coding device 400 includes an ingress port 410 (or input port 410) and a receiver unit (Rx) 420 for receiving data, a processor, logic unit, or central processing unit (CPU) 430 for processing data, a transmitter unit (Tx) 440 and an egress port 450 (or output port 450) for transmitting data, and a memory 460 for storing data. The video coding device 400 may also include optical - electrical (OE) components and electrical - optical (EO) components coupled to the ingress port 410, the receiver unit 420, the transmitter unit 440, and the egress port 450 for the egress or ingress of optical or electrical signals.

[0170] The processor 430 is implemented by hardware and software. The processor 430 may be implemented as one or more CPU chips, cores (for example, as a multi-core processor), FPGAs, ASICs, and DSPs. The processor 430 communicates with the ingress port 410, the receiver unit 420, the transmitter unit 440, the egress port 450, and the memory 460. The processor 430 includes a coding module 470. The coding module 470 implements the embodiments disclosed above. For example, the coding module 470 implements, processes, prepares, or provides various coding operations. Thus, including the coding module 470 provides a substantial improvement to the functionality of the video coding device 400, resulting in the conversion of the video coding device 400 to different states. Alternatively, the coding module 470 is implemented as instructions stored in the memory 460 and executed by the processor 430.

[0171] Memory 460 may include one or more disks, tape drives, and solid-state drives, and may be used as overflow data storage devices to store such programs when they are selected for execution, and to store instructions and data to be read during program execution. Memory 460 may be, for example, volatile and / or non-volatile, and may be read-only memory (ROM), random access memory (RAM), tri-level associative memory (TCAM), and / or static random access memory (SRAM).

[0172] Figure 3 is a simplified block diagram of a device 500 that can be used as either or both of the source device 12 and destination device 14 in Figure 1, according to an exemplary embodiment.

[0173] The processor 502 within the device 500 may be a central processing unit. Alternatively, the processor 502 may be any other type of device or multiple devices capable of manipulating or processing currently existing or future-developed information. While the disclosed embodiments can be implemented with a single processor, e.g., processor 502, the advantages in speed and efficiency can be achieved using two or more processors.

[0174] The memory 504 within the device 500 may, in one implementation, be a read-only memory (ROM) device or a random access memory (RAM) device. Any other suitable type of storage device may be used as memory 504. Memory 504 may contain code and data 506 accessed by the processor 502 using the bus 512. Memory 504 may further include an operating system 508 and an application program 510, the application program 510 including at least one program that enables the processor 502 to perform the method described herein. For example, the application program 510 may include applications 1 to N, the applications 1 to N further including a video coding application that performs the method described herein.

[0175] The device 500 may also include one or more output devices, such as a display 518. The display 518 may, in one example, be a touch-sensitive display that combines a display with a touch-sensitive element that can operate to sense touch input. The display 518 may be coupled to the processor 502 via the bus 512.

[0176] Although shown here as a single bus, the bus 512 of device 500 may consist of multiple buses. Furthermore, the secondary storage device 514 may be directly coupled to other components of device 500 or accessed via a network, and may include a single integrated unit such as a memory card, or multiple units such as multiple memory cards. Thus, device 500 can be implemented in a wide variety of configurations.

[0177] More specific and non-limiting exemplary embodiments of the present invention are described below. Before that, some explanations are provided to aid in understanding this disclosure.

[0178] Artificial neural networks (ANNs), or connectionist systems, are computing systems vaguely inspired by the biological neural networks that make up animal brains. ANNs are based on a collection of connected units or nodes, called artificial neurons, which roughly model neurons in a biological brain. Each connection can transmit signals to other neurons, much like synapses in a biological brain. The receiving artificial neuron can then process these signals and transmit them to the neurons it connects to. In ANN implementations, the "signals" in a connection are real numbers, and the output of each neuron can be calculated by some nonlinear function of the sum of its inputs. These connections are called edges. Neurons and edges typically have weights that are adjusted as learning progresses. The weights increase or decrease the intensity of the signals in a connection. Neurons may have thresholds such that they transmit a signal only when the aggregate signal exceeds a threshold. Typically, neurons are aggregated into layers. Different layers may perform different transformations on their inputs. The signal moves from the first layer (input layer), and, in some cases, traverses multiple layers, before reaching the final layer (output layer).

[0179] The original goal of the ANN approach was to solve problems in the same way the human brain does. Over time, attention shifted to performing specific tasks, leading to a departure from biology. ANNs have been used for a variety of tasks, including computer vision.

[0180] The name "Convolutional Neural Network" (CNN) indicates that the network uses a mathematical operation called convolution. Convolution is a special type of linear operation. A convolutional network is simply a neural network that uses convolution instead of general matrix multiplication in at least one of its layers. A convolutional neural network consists of an input layer, an output layer, and several hidden layers. The input layer is the layer to which the input is provided for processing. For example, the neural network in Figure 6 is a CNN. The hidden layers of a CNN typically consist of a series of convolutional layers that convolve by multiplication or other dot products. The result of a layer is one or more feature maps, sometimes called channels. There may be subsampling that involves some or all of the layers. As a result, the feature maps may be smaller. The activation function in a CNN can be a RELU (Normalized Linear Unit) layer or a GDN layer, as exemplified above, followed by additional convolutions such as pooling layers, fully connected layers, and normalization layers. These layers are called hidden layers because their inputs and outputs are masked by the activation function and the final convolution. These layers are colloquially called rotations, but this is merely a convention. Mathematically, they are technically sliding dot products or cross-correlations. This is important to the indices in the matrix in that it affects how the weights are determined at a particular index point.

[0181] When programming a CNN to process pictures or images, the input is a tensor with shape (number of images) × (image width) × (image height) × (image depth). After passing through a convolutional layer, the image is abstracted into a feature map with shape (number of images) × (feature map width) × (feature map height) × (feature map channels). A convolutional layer in a neural network should have the following attributes: a convolutional kernel defined by width and height (hyperparameters); the number of input and output channels (hyperparameters); and the depth of the convolutional filter (input channels) must be equal to the number of channels (depth) of the input feature map.

[0182] In the past, conventional multilayer perceptron (MLP) models have been used for image recognition. However, due to the full connectivity between nodes, they suffered from high dimensionality and did not scale well with higher resolution images. A 1000x1000 pixel image with RGB color channels has 3 million weights, which is too many to process efficiently at a scale with full connectivity. Furthermore, such network architectures do not take into account the spatial structure of the data, treating distant input pixels in the same way as pixels that are close to each other. This ignores locality of reference in image data, both computationally and semantically. Therefore, for purposes such as image recognition, which are governed by spatially local input patterns, the full connectivity of neurons is redundant. Convolutional neural network (CNN) models mitigate the problems posed by MLP architectures by leveraging the strong spatially local correlations present in natural images. Convolutional layers are the core building blocks of a CNN. The parameters of a layer consist of a set of learnable filters (kernels as described above), which have small receptive fields but extend across the entire depth of the input volume. During the forward pass, each filter is convolved across the width and height of the input volume, calculating the dot product between the filter entry and the input to generate a two-dimensional activation map of that filter. As a result, the network learns which filters activate when it detects a particular type of feature, which is a spatial location within the input.

[0183] The entire output volume of the convolutional layer is formed by stacking the activation maps of all filters along the depth dimension. Therefore, all entries in the output volume can also be interpreted as the outputs of neurons that look at a small region in the input and share parameters with neurons in the same activation map. A feature map, or activation map, is the output activation of a given filter. Feature map and activation have the same meaning. In some papers, it is called an activation map because it is a mapping corresponding to the activation of different parts of an image, and also called a feature map because it is also a mapping of where certain features are found in the image. High activation means that a certain feature has been found.

[0184] Another important concept in CNNs is pooling, a form of nonlinear downsampling. There are several nonlinear functions for performing pooling, the most common of which is max pooling. It divides the input image into a set of non-overlapping rectangles and outputs the maximum value for each such sub-region. Intuitively, the exact location of a feature is less important than its rough location relative to other features. This is the idea behind the use of pooling in convolutional neural networks. The pooling layer works to progressively reduce the spatial size of the representation, thereby reducing the number of parameters in the network, the memory footprint, and the computational complexity, and thus also controlling overfitting. In CNN architectures, it is common to periodically insert pooling layers between consecutive convolutional layers. The pooling operation provides invariance for other forms of translation.

[0185] The above ReLU stands for Normalized Linear Unit and applies a non-saturated activation function. This effectively removes negative values ​​from the activation map by setting negative values ​​to zero. This increases the nonlinear properties of the decision function and the network as a whole without affecting the receptive field of the convolutional layer. Other functions, such as the saturated hyperbolic tangent and sigmoid functions, are also used to increase nonlinearity. ReLU is often preferred over other functions because it trains neural networks several times faster without a significant penalty to generalized accuracy.

[0186] After several convolutional and max-pooling layers, high-level inference in a neural network is performed through fully connected layers. Neurons in fully connected layers have connections to all activations in the previous layer, as seen in typical (non-convolutional) artificial neural networks. Thus, their activations can be computed as affine transformations, followed by matrix multiplication and then bias offsets (vector addition of learned or fixed bias terms).

[0187] An autoencoder is a type of artificial neural network used to learn efficient data coding in an unsupervised manner. The purpose of an autoencoder is to learn a representation (encode) of a set of data, typically for dimensionality reduction, by training the network to ignore signal "noise". Along with the reduction side, the reconstruction side is learned, and the autoencoder attempts to generate a representation from the reduced encoding that is as close as possible to its original input, hence its name.

[0188] Picture size: Refers to the width or height of a picture, or the width-height pair. The width and height of an image are usually measured in terms of the number of luminance samples.

[0189] Downsampling: Downsampling is a process in which the sampling rate (sampling interval) of a discrete input signal is reduced. For example, if the input signal is an image with a size of height h and width w (hereinafter also referred to as H and W), and the output of downsampling has a height of h2 and a width of w2, at least one of the following holds. · h2 < h · w2 < w

[0190] In one embodiment, downsampling can be implemented to retain only every mth sample and discard the rest of the input signal (basically a picture in the context of the present invention).

[0191] Upsampling: Upsampling is a process in which the sampling rate (sampling interval) of a discrete input signal is increased. For example, if the size of the input image is h and w (hereinafter also referred to as H and W), and the output of downsampling is h2 and w2, at least one of the following holds. · h < h2 · w < w2

[0192] Resampling: Both the downsampling and upsampling processes are examples of resampling. Resampling is a process of changing the sampling rate (sampling interval) of an input signal.

[0193] Interpolation filtering: During the upsampling or downsampling process, filtering can be applied to improve the accuracy of the resampled signal and reduce the impact of aliasing. An interpolation filter typically includes a weighted combination of sample values at sample positions around the resampling position. This can be implemented as follows.

Number

[0194] Cropping: Cutting off the outer edges of a digital image. Cropping can be used to make an image smaller (in terms of the number of samples) and / or to change the aspect ratio (length to width) of an image.

[0195] Padding: Padding refers to increasing the size of an input image (or image) by generating new samples at the boundaries of the image. This can be done, for example, by using predefined sample values ​​or by using sample values ​​at positions within the input image.

[0196] Resizing: Resizing is a general term for changing the size of an input image. This can be done using either padding or cropping. It can also be done through resizing operations that use interpolation. Hereafter, resizing will also be referred to as rescaling.

[0197] Integer division: Integer division is a type of division that truncates the decimal part (remainder).

[0198] Convolution: Convolution is given by the following general formula, where f() can be defined as an input signal and g() can be defined as a filter.

number

[0199] Downsampling layer: A processing layer, such as a layer in a neural network, that results in a reduction of at least one of the dimensions of the input. Generally, the input can have three or more dimensions, and these dimensions can include the number of channels, width, and height. However, this disclosure is not limited to such signals. Rather, signals that may have one or two dimensions (such as audio signals or audio signals with multiple channels) can be processed. A downsampling layer typically refers to a reduction in the width and / or height dimension. This can be implemented by operations such as convolution, averaging, and max pooling. Other downsampling methods are also possible, and the present invention is not limited in this respect.

[0200] Upsampling layer: A processing layer, such as a layer in a neural network, that results in an increase in one of the dimensions of the input. Generally, the input can have three or more dimensions, which can include the number of channels, width, and height. An upsampling layer typically refers to an increase in the width and / or height dimension. This can be implemented with operations such as deconvolution and replication. Other upsampling methods are also possible, and the present invention is not limited in this respect.

[0201] Several deep learning-based image and video compression algorithms follow the variational autoencoder framework (VAE), for example, G-VAE: A Continuously Variable Rate Deep Image Compression Framework (Ze Cui, Jing Wang, Bo Bai, Tiansheng Guo, Yihui Feng), available at https: / / arxiv.org / abs / 2003.02012.

[0202] The VAE framework can be counted as a nonlinear transform coding model.

[0203] The transformation process can be divided into four main parts. Figure 4 illustrates the VAE framework. In Figure 4, the encoder 601 maps the input image x to a latent representation (denoted by y) via the function y=f(x). This latent representation will hereafter be referred to as part of the "latent space" or a point within it. The function f() is a transformation function that converts the input signal x to a more compressible representation y. The quantizer 602 converts the latent representation y into a quantized latent representation with (discrete) values.

number

number

number

[0204] The latent space can be understood as a compressed representation of data where similar data points are closer to each other within the latent space. The latent space is useful for learning data features and finding simpler representations of the data for analysis.

[0205] Quantized latent representation T,

number

number

[0206] Furthermore, the image is a reconstructed representation of the quantized latent representation.

number

number

number

number

number

[0207] In Figure 4, component AE 605 is an arithmetic coding module and a quantized latent representation

number

number

number

[0208] Arithmetic decoding (AD) 606 is the process of reversing the binarization process, where the binary numbers are converted back to sample values. Arithmetic decoding is provided by the arithmetic decoding module 606.

[0209] Please note that this disclosure is not limited to this particular framework. Furthermore, this disclosure is not limited to image or video compression, but may also apply to object detection, image generation, and recognition systems.

[0210] Figure 4 shows two interconnected subnetworks. In this context, a subnetwork is a logical division between parts of an entire network. For example, in Figure 4, modules 601, 602, 604, 605, and 606 are referred to as the "encoder / decoder" subnetwork. The "encoder / decoder" subnetwork is responsible for encoding (generating) and decoding (analyzing) the first bitstream, "Bitstream 1". The second network in Figure 4, including modules 603, 608, 609, 610, and 607, is referred to as the "hyperencoder / decoder" subnetwork. The second subnetwork is responsible for generating the second bitstream, "Bitstream 2". The two subnetworks have different purposes. The first subnetwork is, • Transformation of input image x to its latent representation y (which is easier to compress x) 601, • Latent expression y is a quantized latent expression

number

number

number

[0211] The purpose of the second subnetwork is to obtain the statistical properties of the samples in "bitstream 1" (e.g., mean, variance, and correlation between samples in bitstream 1) so that the compression of bitstream 1 by the first subnetwork becomes more efficient. The second subnetwork generates a second bitstream, "bitstream 2," which contains the information (e.g., mean, variance, and correlation between samples in bitstream 1).

[0212] The second network is a quantized latent representation

number

number

number

number

number

number

number

number

number

number

number

number

number

[0213] Figure 4 illustrates an example of a VAE (Variational Autoencoder), the details of which may differ in different implementations. For example, in a particular implementation, additional components may exist to more efficiently obtain the statistical properties of the samples of bitstream 1. In one such implementation, there may be a context modeler that aims to extract cross-correlation information of bitstream 1. The statistical information provided by the second subnetwork may be used by the AE (Arithmetic Encoder) 605 and AD (Arithmetic Decoder) 606 components.

[0214] Figure 4 shows an encoder and decoder in a single diagram. As will be obvious to those skilled in the art, encoders and decoders may, and very often, be incorporated into different devices.

[0215] Figure 7 shows the encoder, and Figure 8 shows the decoder component of the VAE framework separated. What is described below with respect to Figures 7 and 8 may also apply to the neural networks, encoders, and decoders provided below, particularly with respect to Figures 19, 22, and 25 and 26.

[0216] As input, the encoder receives a picture, according to some embodiments. The input picture may include one or more channels, such as a color channel or other types of channels, for example, a depth channel or a motion information channel. The outputs of the encoder (shown in Figure 7) are bitstream 1 and bitstream 2. Bitstream 1 is the output of the encoder's first subnetwork, and bitstream 2 is the output of the encoder's second subnetwork.

[0217] Similarly, in Figure 8, two bitstreams, namely bitstream 1 and bitstream 2, are received as input, and the resulting image is reconstructed (decoded).

number

[0218] As described above, a VAE can be divided into different logical units that perform different actions. This is illustrated in Figures 7 and 8, where Figure 7 shows the components involved in encoding signals such as video and provided encoded information. This encoded information is received, for example, by the decoder component shown in Figure 8 for encoding. The encoder and decoder components indicated by numerals 9xx and 10xx may, in their function, correspond to the components referred to above in Figure 4 and indicated by numeral 6xx.

[0219] Specifically, as shown in Figure 7, the encoder includes an encoder 901 that converts input x into a signal y, which is then provided to a quantizer 902. The quantizer 902 provides information to the arithmetic coding module 905 and the hyperencoder 903. The hyperencoder 903 provides the aforementioned bitstream 2 to the hyperdecoder 907, which then signals the information to the arithmetic coding module 905.

[0220] Encoding can utilize convolution, as will be explained in more detail below with respect to Figure 19. Decoding can utilize deconvolution, as will be explained in more detail below with respect to Figures 19 and 22.

[0221] The output of the arithmetic coding module is bitstream 1. Bitstreams 1 and 2 are the outputs of the coded signals, which are then provided (transmitted) to the decoding process.

[0222] Unit 901 is referred to as the "encoder," but the complete subnetwork shown in Figure 7 can also be called the "encoder." The encoding process generally refers to a unit (module) that transforms an input into an encoded (e.g., compressed) output. From Figure 7, it can be seen that unit 901 can actually be considered the core of the entire subnetwork, as it performs the transformation of input x to y, which is a compressed version of x. Compression in encoder 901 can be achieved, for example, by applying a neural network, or generally any processing network having one or more layers. In such a network, compression can be achieved by cascading processes that include downsampling, which reduces the size and / or number of channels of the input. Thus, the encoder is sometimes referred to as, for example, a neural network (NN)-based encoder.

[0223] The remaining parts in the diagram (quantization unit, hyperencoder, hyperdecoder, arithmetic encoder / decoder) are all responsible for improving the efficiency of the encoding process or for converting the compressed output y into a sequence of bits (bitstream). Quantization may be provided to further compress the output of the NN encoder 901 by lossy compression. The AE 905, in combination with the hyperencoder 903 and hyperdecoder 907 used to constitute the AE 905, can perform binarization, which can further compress the signal quantized by lossless compression. Therefore, the entire subnetwork in Figure 7 can also be called the "encoder".

[0224] Most deep learning (DL)-based image / video compression systems reduce the dimensionality of a signal before converting it to binary digits (bits). For example, in the VAE framework, the encoder, which is a nonlinear transformation, maps the input image x to y, where y has smaller width and height than x. Since y has smaller width and height, and therefore smaller size, the dimensionality (size) of the signal is reduced, and thus it is easier to compress the signal y. Generally, it should be noted that encoders do not necessarily need to reduce the size in both (or generally all) dimensions. Rather, some exemplary implementations may provide encoders that reduce the size in only one dimension (or generally a subset thereof).

[0225] The general principle of compression is illustrated in Figure 5. The latent space, which is the output of the encoder and the input of the decoder, represents the compressed data. Note that the size of the latent space can be much smaller than the size of the input signal. Here, the term size may refer to the resolution, for example, the number of samples in the feature map output by the encoder. The resolution may be given as the product of the number of samples per dimension (e.g., width × height × number of channels of the input image or feature map).

[0226] Reducing the size of the input signal is illustrated in Figure 5, which represents a deep learning-based encoder and decoder. In Figure 5, the input image x corresponds to the input data, which is the input to the encoder. The transformed signal y corresponds to the latent space, which has fewer dimensions or size than the input signal in at least one dimension. Each row of circles represents a layer in the processing chain of the encoder or decoder. The number of circles in each layer indicates the size or dimensionality of the signal in that layer.

[0227] Figure 5 shows that the encoding operation corresponds to reducing the size of the input signal, and the decoding operation corresponds to reconstructing the image to its original size.

[0228] One way to reduce signal size is downsampling. Downsampling is a process in which the sampling rate of the input signal is reduced. For example, if the sizes of the input images are h and w, and the output of downsampling is h2 and w2, then at least one of the following is true: · h2 <h ·w2 <w

[0229] Reducing the signal size is typically done stepwise, rather than all at once, along a chain of processing layers. For example, if an input image x has dimensions h and w (representing height and width) and a latent space y has dimensions h / 16 and w / 16, then reducing the size might be done in four layers during encoding, with each layer reducing the signal size by half in each dimension.

[0230] Several deep learning-based video / image compression methods employ multiple downsampling layers. For example, the VAE framework in Figure 6 utilizes six downsampling layers, marked 801-806. Layers containing downsampling are indicated by downward arrows in the layer description. Layer Description

number

number

number

number

[0231] Figure 6 also shows a decoder including upsampling layers 807-812. A further layer 820, implemented as a convolutional layer but not providing upsampling to the received input, is provided between the upsampling layers 811 and 810 in the order of input processing. A corresponding convolutional layer 830 is also shown for the decoder. Such layers may be provided within the NN to perform operations on inputs that do not change the size of the input but change certain characteristics. However, such layers are not required.

[0232] Looking at the processing order of bitstream 2 as it passes through the decoder, the upsampling layers are executed in reverse order, that is, from upsampling layer 812 to upsampling layer 807. Each upsampling layer, here,

number

[0233] In the first subnetwork, several convolutional layers (801-803) are followed by generalized division normalization (GDN) on the encoder side and inverse GDN (IGDN) on the decoder side. In the second subnetwork, the activation function applied is ReLU. Note that this disclosure is not limited to such implementations, and generally, other activation functions may be used instead of GDN or ReLU.

[0234] Image and video compression systems generally cannot handle arbitrary input image sizes. This is because some processing units in a compression system (such as conversion units and motion compensation units) operate in minimum units, and if the input image size is not an integer multiple of the minimum processing unit, the image cannot be processed.

[0235] For example, HEVC specifies four transposition unit (TU) sizes—4x4, 8x8, 16x16, and 32x32—for coding predictive residuals. Since the minimum transposition unit size is 4x4, it is not possible to process an input image with a size of 3x3 using an HEVC encoder and decoder. Similarly, if the image or picture size is not a multiple of 4 in one dimension, it is not possible to process the image or picture individually, as it is not possible to divide the image or picture into sizes that can be processed by valid transposition units (4x4, 8x8, 16x16, and 32x32). Therefore, a requirement of the HEVC standard is that the input image or picture must be a multiple of the minimum coding unit size, which is 8x8. Otherwise, the input image or picture cannot be compressed by HEVC. Similar requirements are imposed by other codecs. It may be desirable to maintain such restrictions in order to utilize existing hardware or software, or to maintain some or even partial interoperability of existing codecs. However, this disclosure is not limited to any particular transformation block size.

[0236] Some DNN (Deep Neural Network) or NN (Neural Network) based image and video compression systems utilize multiple downsampling layers. In Figure 6, for example, four downsampling layers are included in the first subnetwork (layers 801-804), and two additional downsampling layers are included in the second subnetwork (layers 805-806). Therefore, given the sizes of the input image as w and h (representing width and height), the outputs of the first subnetwork are w / 16 and h / 16, and the outputs of the second subnetwork are w / 64 and h / 64.

[0237] In deep neural networks, the term "deep" usually refers to the number of processing layers applied sequentially to the input. While there are no clear definitions or guidelines for which networks should be called deep networks, neural networks with a large number of layers are generally referred to as deep neural networks. Therefore, for the purposes of this application, there is no significant difference between DNNs and NNs. A DNN may refer to an NN having two or more layers.

[0238] During downsampling, for example, when convolution is applied to the input, a fractional (final) size of the encoded picture may be obtained. Such a fractional size cannot be reasonably processed by subsequent layers of the neural network or by the decoder.

[0239] In other words, some downsampling operations (such as convolution) may expect (e.g., by design) that the size of the input to a particular layer of the neural network satisfies certain conditions, and as a result, the operations that perform downsampling, or operations that follow downsampling, within the layers of the neural network are still clearly defined mathematical operations. For example, the downsampling ratio.

number

[0240] To provide a numerical example, the downsampling ratio of a layer may be 4. The first input has a size of 512 in the dimension to which downsampling is applied. 512 is an integer multiple of 4, since 128 × 4 = 512. Therefore, the input can be processed by the downsampling layer and yield a reasonable output. The second input may have a size of 513 in the dimension to which downsampling is applied. 513 is not an integer multiple of 4, and therefore this input cannot be processed reasonably by the downsampling layer or subsequent downsampling layers if they, for example, by design, expect a certain input size (e.g., 512). Taking this into consideration, rescaling (also called resizing) may be applied before the input is processed by the neural network to ensure that the input can be processed reasonably by each layer of the neural network (according to a given layer input size), even if the input size is not always the same. This rescaling involves changing or adapting the actual size of the input to the neural network (e.g., to the input layer of the neural network), so that the above condition is satisfied for all downsampling layers of the neural network. This rescaling reduces the size of the input in the dimension to which downsampling is applied.

number

[0241] This ensures that the size of the inputs to the neural network is such that each layer can process its respective input, for example, according to a predetermined input size configuration for that layer.

[0242] However, there are limits to reducing the size of the picture to be encoded by providing such rescaling, and accordingly, the size of the encoded picture that can be provided to the decoder for reconstructing the encoded information also has a lower limit. Furthermore, the approaches provided so far may add a considerable amount of entropy to the bitstream (when increasing the size of the bitstream by rescaling) or a considerable amount of information loss (when decreasing the size of the bitstream by rescaling). Both can negatively affect the quality of the decoded bitstream.

[0243] Therefore, it is difficult to obtain high-quality encoded / decoded bitstreams and the data they represent, while simultaneously providing encoded bitstreams with reduced size.

[0244] The size of the output of each layer in the network cannot be a fraction (it must be an integer of the number of rows and columns of the sample), thus imposing constraints on the input image size. In Figure 6, to ensure processing reliability, the input image size is an integer multiple of 64 in both the horizontal and vertical directions. Otherwise, the output of the second network would not be an integer.

[0245] To solve this problem, it is possible to use a method of padding the input image with zeros to make it a multiple of 64 samples in each direction. According to this solution, the input image size can be expanded in width and height by the following amounts:

number

[0246] Another possibility for solving the above problem is to crop the input image, that is, discard rows and columns of samples from the edges of the input image, so that the input image size is a multiple of 64 samples. The minimum number of rows and samples that need to be taken by cropping can be calculated as follows:

number

[0247] Using the above, horizontal (h new ) and vertical (w new The new size of the input image in the dimension of ) is as follows: For padding: ·h new =h+h diff ·w new =w+w diff For cropping: ·h new =hh diff ·w new =w+w diff

[0248] This is also shown in Figures 10 and 11. Figure 10 shows that the encoder and decoder (collectively shown as 1200) may include several downsampling and upsampling layers. Each layer applies downsampling by a coefficient of 2 or upsampling by a coefficient of 2. Furthermore, the encoder and decoder may include further components such as a generalized partition normalization (GDN) 1201 on the encoder side and an inverse GDN (IGDN) 1202 on the decoder side. In addition, both the encoder and decoder may include one or more ReLUs, specifically leaky ReLUs 1203. The encoder may also be provided with a factorization entropy model 1205, and the decoder with a Gaussian entropy model 1206. Furthermore, multiple convolution masks 1204 may be provided. In addition, in the embodiments of Figures 10 and 11, the encoder includes a universal quantizer (UnivQuan) 1207, and the decoder includes an attention module 1208. For ease of reference, functionally corresponding components have corresponding numbers in Figure 11.

[0249] The total number of downsampling operations and strides defines conditions regarding the input channel size, i.e., the size of the input to the neural network.

[0250] Here, if the input channel size is an integer multiple of 64 = 2 × 2 × 2 × 2 × 2 × 2, the channel size remains an integer after all downsampling operations have been performed. By applying the corresponding upsampling operation in the decoder during upsampling, and by applying the same rescaling at the end of processing the input through the upsampling layer (for example, using the FWD sizing module shown in this figure), the output size is again identical to the input size in the encoder.

[0251] This results in a reliable reconstruction of the original input.

[0252] Figure 11 shows a more general example of what was described in Figure 10. This example also shows an encoder and decoder, which together are shown as 1300. m downsampling layers (and corresponding upsampling layers) have a downsampling ratio s i and has a corresponding upsampling ratio. Here, the input channel size is

number

[0253] As mentioned above, this mode, which changes the size of the input, can still have some drawbacks.

[0254] In Figure 6, the bitstreams indicated by "Bitstream 1" and "Bitstream 2" have sizes that are, respectively,

number

number

[0255] The purpose of compression is to reduce the size of the bitstream while maintaining high quality of the reconstructed image, so in order to reduce the bitrate, h new and w new It should be as small as possible.

[0256] Therefore, the problem with "padding with zeros" is the increase in bitrate due to the larger input size. In other words, the size of the input image increases by adding redundant data to the input image, which means that more side information must be sent from the encoder to the decoder for the reconstruction of the input signal. As a result, the size of the bitstream increases.

[0257] As an example, using the encoder / decoder pair in Figure 6, if the input image has a size of 416×240, which is a commonly known image size format as WQVGA (Wide Quarter Video Graphics Array), the input image must be padded to a size equal to 448×256, which is equivalent to a 15% increase in bitrate due to the inclusion of redundant data.

[0258] The problem with the second approach (cropping the input image) is the loss of information. The purpose of compression and restoration is to transmit the input signal while maintaining high fidelity, so discarding part of the signal is counterproductive. Therefore, trimming is not advantageous unless it is known that there are unwanted parts in the input signal.

[0259] For example, resizing of the input image is performed before all downsampling or upsampling layers in a DNN-based picture or video compression system. More specifically, if the downsampling layer has a downsampling ratio of 2 (the input size is halved at the layer's output), and the layer has an odd number of sample rows or columns, input resizing is applied to the layer's input; if the number of sample rows or columns is even (a multiple of 2), no padding is applied.

[0260] Furthermore, if a corresponding downsampling layer applies resizing at its input, the resizing operation may be applied last, for example, at the output of an upsampling layer. The corresponding layers of a downsampling layer can be found by counting the number of upsampling layers starting from the reconstructed image and counting the number of downsampling layers starting from the input image. This is illustrated by Figure 18, where upsampling layer 1 and downsampling layer 1 are corresponding layers, upsampling layer 2 and downsampling layer 2 are corresponding layers, and so on.

[0261] The resizing operations applied at the input of the downsampling layer and the resizing operations applied at the output of the upsampling layer are complementary so that the size of the data at both outputs remains the same.

[0262] As a result, the increase in bitstream size is minimized. An exemplary embodiment can be illustrated with reference to Figure 12, in contrast to Figure 9 which illustrates other approaches. In Figure 9, the input is resized before it is provided to the DNN so that the resized input can be processed throughout the DNN. The example shown in Figure 9 can be implemented with the encoder / decoder described in Figure 6.

[0263] In Figure 12, an input image of arbitrary size is given to the neural network. In this example, the neural network includes N downsampling layers, and each layer i (1 ≤ i ≤ N) has a downsampling ratio r i It has. "<=" indicates less than or equal to. Downsampling ratio r i While not necessarily the same for various values ​​of i, in some embodiments they may all be equal, for example, all r i=r=2 is possible. In Figure 12, downsampling layers 1 to M are summarized as subnet 1 of the downsampling layers. Subnet 1 provides bitstream 1 as its output. However, this summary of the downsampling layers is for illustrative purposes only in this context. A second subnet 2, containing layers M+1 to N, provides bitstream 2 as its output.

[0264] In this example, the input to the downsampling layer, for example, downsampling layer M, is provided to the downsampling layer, but after being processed by the previous downsampling layer (in this case, layer M-1), the input to downsampling layer M is sized

number

[0265] In Figure 9, the input image is padded to accommodate all the downsampling layers that will process the data sequentially (this is a form of image resizing). In Figure 9, the downsampling ratio is symbolically chosen to be equal to 2 for illustrative purposes. In this case, there are N layers that perform a 2x downsampling, so the input image size is 2 N Pad (zero-fill) so that it becomes an integer multiple of . Note that in this specification, an integer "multiple" may still be equal to 1, that is, multiple does not have a plural meaning, but rather a multiplication meaning (e.g., by one or more).

[0266] An example is shown in Figure 12. In Figure 12, input resizing is applied before each downsampling layer. The input is resized to be an integer multiple of the downsampling ratio of each layer. For example, if the downsampling ratio of a layer is 3:1 (input size:output size), i.e., a ratio of 3, the input of the layer is resized to be a multiple of 3.

[0267] Some examples can also be applied to Figure 6. In Figure 6, there are six layers with downsampling, namely layers 801, 802, 803, 804, 805, and 806. All of the downsampling layers have a coefficient of 2. In one example, input resizing is applied before all six layers. In Figure 6, resizing is also applied after each of the upsampling layers (807, 808, 809, 810, 811, and 812) in a corresponding manner (as described in the paragraph above). This means that resizing applied before the downsampling layers at a particular order or position in the encoder's neural network is applied at the corresponding position in the decoder.

[0268] In some embodiments, there may be two options for rescaling the input, one of which may be selected depending on the situation or conditions, as will be further described below, for example. These embodiments will be described with reference to Figures 13 to 15.

[0269] The first option 1501 may include, for example, padding the input with zeros or redundant information from the input itself to increase the input size to a size that matches an integer multiple of the downsampling ratio. On the decoder side, for rescaling, this option may employ cropping to reduce the input size to, for example, a size that matches the target input size of the preceding upsampling layer.

[0270] This option can be implemented computationally efficient, but it only allows for increasing the size on the encoder side.

[0271] The second option 1502 allows the use of interpolation in the encoder and decoder to rescale / resize the input. This means that interpolation can be used to increase the input size to an integer multiple of the downsampling ratio of all downsampling layers, or to an intended size such as the target input size of all upsampling layers, or to decrease the input size to an integer multiple of the combined downsampling ratio of all downsampling layers of the NN, or to an intended size such as the target input size of all upsampling layers of the NN. This makes it possible to apply resizing in the encoder by increasing or decreasing the input size. Furthermore, option 1502 allows the use of different interpolation filters, thereby providing spectral characteristic control.

[0272] The different options 1501 and 1502 may be signaled in the bitstream, for example, as side information. The distinction between the first option (option 1) 1501 and the second option (option 2) 1502 may be signaled using an indicator such as the syntax element methodIdx, which can take one of two values. For example, the first value (e.g., 0) indicates padding / cropping, and the second value (e.g., 1) indicates the interpolation used for resizing. For example, a decoder may encode a picture and receive a bitstream containing side information, potentially including the element methodIdx. By parsing this bitstream, the side information can be obtained and the value of methodIdx can be derived. Based on the value of methodIdx, the decoder may proceed with the corresponding resize or rescaling method, using padding / cropping if methodIdx has the first value, or using interpolation if the interpolation of methodIdx has the second value.

[0273] This is shown in Figure 13. Depending on whether the value of methodIdx is 0 or 1, either clipping (including either padding or cropping) or interpolation is selected.

[0274] While the embodiment in Figure 13 refers to the selection or determination between clipping (including padding / cropping) and interpolation based on methodIdx as a method used to achieve resizing, it should be noted that the present invention is not limited thereto. The method described in relation to Figure 13 can also be achieved when the first option 1501 is interpolation to increase the size during the resizing operation, and the second option 1502 is interpolation to decrease the size during the resizing operation. Any two or more different resizing methods (depending on the binary size of methodIdx) as described above and below may be selected from among them and signaled using methodIdx. In general, methodIdx does not need to be a separate syntax element; it may be shown or coded together with one or more other parameters.

[0275] As shown in Figure 14, further instructions or flags may be provided. In addition to methodIdx, a Size Change flag (1 bit) SCIdx may be conditionally signaled only in the case of a second option 1502. In the embodiment of Figure 14, the second option 1502 includes the use of interpolation to achieve resizing. In Figure 14, the second option 1502 is selected when methodIdx=1. The Size Change flag SCIdx may have a third or fourth value, which may be either 0 (e.g., for the third value) or 1 (e.g., for the fourth value). In this embodiment, "0" indicates downsizing and "1" indicates upsizing. Thus, when SCIdx is 0, interpolation to achieve resizing is performed so that the size of the input becomes smaller. If SCIdx is 1, interpolation to achieve resizing is performed so that the size of the input becomes larger. Conditional coding of SCIdx can provide a more concise and efficient syntax. However, the present invention is not limited by such conditional syntax, and SCIdx may be specified independently of methodIdx, or together with methodIdx (for example, within a common syntax element which may take only a subset of values ​​from the values ​​representing all combinations of SCIdx and methodIdx).

[0276] Similar to the instruction methodIdx, SCIdx can also be obtained by the decoder by parsing the bitstream, which potentially decodes the picture to be reconstructed. Once a value for SCIdx is obtained, downsizing or upsizing can be selected.

[0277] In addition to, or instead of, the above instructions, additional (side) instructions for the resize filter index RFIdx may be signaled (instructed within the bitstream), as shown in Figure 15.

[0278] In some embodiments, RFIdx may be conditionally indicated for a second option 1502, which may include signaling when methodIdx=1 and not signaling when methodIdx=0. RFIdx may have a size greater than 1 bit and may, for example, signal which interpolation filter is used in the interpolation to achieve resizing, depending on its value. Alternatively, or in addition, RFIdx may specify filter coefficients from a plurality of interpolation filters, which may be, for example, bilinear, bicubic, Lanczos3, Lanczos5, and Lanczos8, among others.

[0279] As described above, at least one of methodIdx, SCIdx, and RFIdx, or all of them, or at least two of them, may be a bitstream that also encodes the picture to be reconstructed, or may be given in a bitstream that is an additional bitstream. The decoder may then parse each bitstream and obtain the values ​​of methodIdx and / or SCIdx and / or RFIdx. Depending on the values, the actions described above may be taken.

[0280] The filter used for interpolation to achieve resizing can be determined, for example, by the scaling ratio.

[0281] As shown in item 1701 in the lower right of Figure 15, the value of RFIdx can be explicitly signaled. Alternatively, or in addition, RFIdx may be obtained from a lookup table such that RFIdx = LUT(SCIdx).

[0282] In another example, there are two lookup tables, one for upsizing and one for downsizing. In this case, LUT1(SCIdx) shows the resize filter when downsizing is selected, and LUT2(SCIdx) shows the resize filter for upsizing.

[0283] In general, the present invention is not limited to any particular method of signaling for RFIdx. It may be signaled individually and independently of other elements, or together with them.

[0284] The above directives methodIdx, SCIdx, and RFIdx are provided as a nested structure in which the existence of SCIdx and RFIdx may depend on the value of methodIdx. However, each of methodIdx, SCIdx, and RFIdx may be provided independently, even if one or more of the other directives are not provided.

[0285] Furthermore, according to some embodiments, additional instructions may be provided instead of or in addition to these instructions, which may be an index indicating an entry in the lookup table. The lookup table LUT may contain multiple entries, each entry specifying a method of resizing. Entries may exist in the LUT specifying that padding or cropping or interpolation should be used. In addition, or otherwise, the LUT may include entries specifying that a particular type of padding (reflection padding, repeating padding, or zero padding) should be used. In addition, or otherwise, instead of or in addition to entries specifying that interpolation should be used, entries may include entries specifying that interpolation should be used to increase or decrease the size by resizing, and / or entries specifying filters to be used.

[0286] Exemplary, a LUT may include four entries for padding / cropping: one specifying cropping, one specifying zero padding, one specifying repeating padding, and one specifying reflection padding. Furthermore, the table may include entries for interpolation to be used to increase the size by resizing. Each of these entries can specify a different interpolation filter, which can include bilinear, bicubic, Lanczos3, Lanczos5, Lanczos8, and N-tab filters. This means there may be six entries specifying different ways to increase the size by interpolation (one for each filter). Additionally, six entries may be provided for decreasing the size by interpolation, each specifying the corresponding filter to be used in the interpolation. Thus, the index may be provided to take 16 different values ​​corresponding to the 16 different entries in the LUT (four for padding methods and cropping, plus six entries each for interpolation to increase the size using a specific filter and interpolation to decrease the size using a specific filter). A LUT may be available to a decoder or encoder so that, depending on the indicated value, it can determine how the encoder or decoder should resize the image.

[0287] Figures 16 and 17 show several examples of resizing methods. Figures 16 and 17 illustrate three different types of padding operations and their performance. The horizontal axis in the figures represents the sample position, and the vertical axis represents the value of each sample.

[0288] It should be noted that the following description is illustrative only and is not intended to limit the present invention to any particular type of padding operation. The vertical lines indicate the boundaries of the input (picture, according to the embodiment), and to the right of the boundary are the sample positions to which the padding operation is applied to generate new samples. These parts are also referred to below as “unavailable parts,” meaning that they are not present in the original input but are added by padding during a rescaling operation for further processing. To the left of the input boundary lines are the samples that are available and are part of the input. The three padding methods shown in the figure are duplication padding, reflection padding, and zero padding. For downsampling operations performed according to some embodiments, the input to one or more downsampling layers of the NN is the padded information, i.e., the original input expanded by the applied padding.

[0289] In Figure 16, the unavailable positions (i.e., sample positions) that can be filled with padding are positions 4 and 5. When padding with zeros, the unavailable positions are filled with samples having a value of 0. In the case of reflection padding, the sample value at position 4 is set to be equal to the sample value at position 2, and the value at position 5 is set to be equal to the value at position 1. In other words, reflection padding is equivalent to mirroring the available sample at position 3, which is the last available sample at the input boundary. In the case of replication padding, the sample value at position 3 is copied to positions 4 and 5. Different padding types may be preferred for different applications.

[0290] Specifically, the type of padding applied may depend on the task being performed. For example, Padding or filling with zeros can be reasonable for computer vision (CV) tasks such as recognition or detection tasks. This does not add any information, so as not to change the amount / value / importance of information already present in the original input.

[0291] Reflection padding can be a computationally easy technique because the added value only needs to be copied from the existing value along a defined "reflection line" (i.e., the boundary of the original input).

[0292] Repetitive padding (also known as repetition padding) can be preferable for compression tasks using convolutional layers because it ensures continuity of most sample values ​​and derivatives. The derivatives of the samples (including available and padded samples) are shown on the right side of Figures 16 and 17. For example, with reflection padding, the derivative of the signal shows a sharp change at position 4 (a value of -9 is obtained at this position for the exemplary values ​​shown in the figure). Since smooth signals (signals with small derivatives) are easily compressed, reflection padding may not be desirable for video compression tasks.

[0293] In the illustrated example, replication padding exhibits the smallest change in the derivative. While this is advantageous from the standpoint of video compression tasks, it adds more redundant information at the boundaries. This means that the information at the boundaries may carry more weight than intended for other tasks, and therefore, in some implementations, the overall performance of zero-padding may outperform reflection padding.

[0294] Figure 18 shows a further example, where encoder 2010 and decoder 2020 are shown side by side. In the illustrated example, the encoder includes multiple downsampling layers 1 to N. The downsampling layers can be grouped together or form parts of subnetworks 2011 and 2012 of a neural network within encoder 2010. These subnetworks can be responsible for providing specific bitstreams 1 and 2, for example, which may be provided to decoder 2020. In this sense, the subnetworks of the downsampling layers of the encoder can form logical units that cannot be reasonably separated. As shown in Figure 18, the first subnetwork 2011 of encoder 2020 includes downsampling layers 1 to 3, each downsampling layer having its own downsampling ratio. The second subnetwork 2012 includes downsampling layers M to N, each having its own downsampling ratio.

[0295] Decoder 2020 has a corresponding structure of upsampling layers 1 to N. One subnetwork 2022 of decoder 2020 contains upsampling layers N to M, and the other subnetwork 2021 contains upsampling layers 3 to 1 (here, the numbering is done in descending order to match the decoder when viewed in terms of the processing order of each input).

[0296] As described above, the rescaling applied to the input before the downsampling layer 2 of the encoder is applied in accordance with the output of the upsampling layer 2. This means that the size of the input to the downsampling layer 2 is the same as the size of the output of the upsampling layer 2, as described above.

[0297] More generally, a rescaling applied to the input of the downsampling layer n of an encoder corresponds to a rescaling applied to the output of the upsampling layer n, so that the size of the rescaled input is the same as the size of the rescaled output.

[0298] Figure 19 shows a further exemplary embodiment of a neural network 2100, which may be part of an encoder as described in relation to Figure 25, for example, and is used to encode a picture according to embodiments of the present disclosure.

[0299] The neural network 2100 may include multiple layers 2110, 2120, 2130, and 2140 for this purpose. During encoding, for example, an image input as input 2101 is expected to be reduced in size by processing the input through subsequent layers of the neural network 2100. Finally, the encoded picture may be provided as output 2105. Specifically, the output may be a binarized version of the encoded picture that constitutes the bitstream 2105, and can be considered the output of the neural network 2100, or more generally, the encoder on which the neural network is implemented.

[0300] During this processing of the input through the neural network 2100, the input 2101, which may be a picture or some already processed version of a picture, is successively input to successive layers of the neural network 2100 in the processing order shown in the figure, thereby potentially yielding intermediate outputs 2102, 2103, and 2104 that are output by the current layer of the neural network and provided as input to the immediately following layer of the neural network. In the embodiment of Figure 19, one input 2101 is shown that is converted into a single output 2105 during processing by the neural network, but it is also possible that one or more intermediate outputs are provided by the neural network after the input has been processed in, for example, layer 2120. After processing the input in layer 2120, an intermediate bitstream or subbitstream may be output that is already smaller in size compared to the original input but has not been processed by the subsequent layers 2130 and 2140 of the neural network 2100. This may be provided, for example, when the encoder is implemented in the manner illustrated in Figures 4 and 7, and the encoder provides a first bitstream (bitstream 1) and a second bitstream (bitstream 2) as outputs. However, this is not mandatory and may be done depending on the circumstances.

[0301] According to this disclosure, a neural network can be made smaller by including one or more downsampling layers that apply downsampling to the received input. The neural network shown in Figure 19 includes four layers 2110, 2120, 2130, and 2140. Not all of these layers are necessarily implemented as downsampling layers. Some of the layers, for example layers 2130 and 2140, may be implemented as layers that do not apply downsampling to the input but process the input in other ways.

[0302] The downsampling layer may be associated with a downsampling ratio r that has an integer value greater than 1. When it receives an input of a given size S, the downsampling layer scales the size of the input during processing.

number

number

[0303] For example, if a neural network includes six downsampling layers, each with a downsampling ratio r=2, the original size S of the input is reduced to 1 / 64.

[0304] Generally, the size of the output 2105 of a neural network may be denoted by P. According to this disclosure, the size P may generally be smaller than the size S of the input, taking the above into consideration.

[0305] When processing input 2101 via a neural network, the input size may preferably be an integer multiple of the product of the downsampling ratios of all downsampling layers. Downsampling layers typically apply matrix operations or similar operations that require processing an integer number of samples. When the input to a downsampling layer has a size S (and therefore the number of samples S) that is not an integer multiple of the downsampling ratio of that layer, reasonable processing of this input may not be possible.

[0306] For example, if the NN has a total of two downsampling layers (e.g., layers 2110 and 2120 in Figure 19) each with a downsampling ratio of 2 (along with other processing layers that do not perform downsampling), and the size of the input image is 1024 × 512, no problem is observed. After two downsampling operations, the resulting downsampled output is 256 × 128. However, if the input has a size of 1024 × 511, after the first downsampling layer, the expected size of the intermediate output 2102 is 512 × 255.5, which is not an integer that can be understood as referring to a sample fraction (sub-per) that the NN may not be able to construct, so it is not possible to process the input using the NN. This means that the NN in this example cannot process input images that are not multiples of 4 × 4, where 4 in each dimension represents the product of the downsampling ratios of the two downsampling layers in this example.

[0307] The problem is illustrated above for a small number of downsampling layers (e.g., 2). However, image compression is a complex task (since images or pictures are usually quite large) and typically requires a deep neural network to perform this task. This typically means that the number of downsampling layers in the NN is more than 2 or much larger. This makes things more complicated, for example, if the number of downsampling layers is 6 (each with a downsampling ratio of 2), and the neural network applies downsampling in 2 dimensions, the NN can only process input sizes that are multiples of 26 × 26 = 64 × 64. Most images acquired by different end-user devices do not meet this requirement.

[0308] To achieve downsampling, the downsampling layer can be modified using convolution.

[0309] Such a convolution involves element-wise multiplication of the entries in the original input matrix (in this example, a matrix with 1024 × 512 entries, where entries are denoted by Mij) with a kernel K that is performed (shifted) across this matrix and typically has a size smaller than the size of the input. The convolution operation of two discrete variables can be written as follows:

number

[0310] Therefore, for all possible values ​​of the function (f*g)[n], n is equivalent to performing (shifting) a kernel or filter f[] on the input array g[] and performing element-wise multiplication at each shift position.

[0311] In the example above, kernel K is a 2x2 matrix that runs on the input in a step range of 2, resulting in the first entry D in the downsampled bitstream D. 11 This is an entry in kernel K M 11 M 12 M 21 M 22 It is obtained by multiplying by the next horizontal entry D 12 This is the inner product of kernels having entries, or entries M 13 M 14 M 23 M 24 This is obtained by calculating the inner product of the contracted matrices having . In the vertical direction, this is ultimately obtained by calculating the inner product of M and K respectively, entry D ij This is done in a corresponding manner to obtain a matrix D that has half the number of entries in each direction or dimension.

[0312] In other words, the amount of shift used to obtain the convolutional output determines the downsampling ratio. If the kernel is shifted by 2 samples between each computation step, the output is downsampled by a coefficient of 2. A downsampling ratio of 2 can be expressed in the above formula as follows:

number

[0313] The transposed convolution operation can be expressed mathematically in the same way as the convolution operation. Transposed convolution can be performed during the decoding of an encoded picture, as illustrated with respect to Figures 22-24. The term "transposed" corresponds to the fact that a transposed convolution operation corresponds to the inversion of a particular convolution operation. However, from an implementation standpoint, the transposed convolution operation can be implemented similarly by using the above formula. The upsampling operation using transposed convolution is a function, i.e.,

number

[0314] In the above formula, u corresponds to the upsampling ratio, and the int() function corresponds to conversion to an integer. The int() operation can be implemented, for example, as a rounding operation.

[0315] In the above formula, the values ​​m and n can be scalar indices if the convolution kernel or filter f() and the input variable array g() are one-dimensional arrays. If the kernel and input array are multidimensional, they can also be understood as multidimensional indices.

[0316] This disclosure is not limited to downsampling or upsampling via convolution and deconvolution. Any possible method of downsampling or upsampling may be implemented in layers of a neural network (NN).

[0317] This process (downsampling) can be repeated if two or more downsampling layers are provided within the neural network to further reduce the size. This allows the encoded bitstream 2105 to be provided as the output from the neural network as shown in Figure 19. This repeated downsampling can be implemented in the encoder as described in Figures 6, 10, and 11.

[0318] The encoder, specifically the layers of the neural network 2100, is not limited to simply including downsampling layers that apply convolution; other downsampling layers that do not necessarily apply convolution to achieve a reduction in input size are also conceivable.

[0319] Furthermore, the layers of the neural network 2100 may include, or be associated with, additional units that perform other operations on the respective inputs and / or outputs of those corresponding layers of the neural network. For example, layer 2120 of the neural network may include a downsampling layer, and in the processing order of the input to this layer before downsampling, a normalization linear unit (ReLU) and / or a batch normalizer may be provided.

[0320] The normalized linear unit is,

number

[0321] This ensures that all values ​​in the corrected matrix are greater than or equal to 0. This may be necessary or advantageous for certain applications.

[0322] The batch normalizer first processes the entries P of a matrix P having size MxN. ij from,

number

[0323] Using this average value V, we can enter P' ij A batch normalization matrix P' having the following characteristics: P' ij =P ij -V It is obtained by [this method].

[0324] Both the calculations obtained by the batch normalizer and the calculations obtained by the normalized linear unit do not change the number (or size) of entries, but only change the values ​​in the matrix.

[0325] Such units may be placed before or after each downsampling layer, depending on the situation. Specifically, as the downsampling layers reduce the number of entries in the matrix, it may be more appropriate to place the batch normalizer after each downsampling layer in the order of bitstream processing. This results in V and P' ij The number of calculations required to obtain this is significantly reduced. When convolution is used in the downsampling layer, some entries may be zero, so it may be advantageous to place the normalized linear unit before applying the convolution, as this simplifies multiplication and allows for a smaller matrix.

[0326] However, the present invention is not limited in this respect, and batch normalizers or normalization linear units may be arranged in a different order relative to the downsampling layer.

[0327] Furthermore, each layer does not necessarily have one of these additional units, or other additional units that perform other modifications or calculations may be used. When a neural network processes input, matrix operations such as the convolution described above are applied.

[0328] Here, since matrix operations are performed, it is preferable that the input to the neural network 2100 has a size that is an integer multiple of the product of all the downsampling ratios in order for each downsampling layer to process the input.Continuing the above example, if we assume there are six downsampling layers, each with a downsampling ratio of 2, this means that the input to the neural network should have a size that is an integer multiple of 64 in order for it to be reliably processed by the neural network.Now, if we consider an input with a size of 540 in at least one dimension, this input cannot be reasonably processed through the neural network because it is not an integer multiple of the product of all the downsampling ratios of the downsampling layers of the neural network.

[0329] Therefore, before processing the input in the neural network, resizing or rescaling (these terms can be used interchangeably) is applied to the input so that its size S can be processed reasonably by the neural network.

number

number

[0330] As already mentioned above, several different methods can be used for this resizing. For example, it is possible to enlarge or reduce the size of the input to match an integer multiple of the product of all downsampling ratios of the neural network 2100. Reducing the size can be achieved in different ways, for example, by cropping the input (which essentially involves removing sample values ​​from the input) or by applying interpolation. When interpolation is applied, instead of two adjacent samples (or more), a single new sample value (e.g., the mean) representing these two samples can be used, thereby reducing the overall size of the input by 1. The more samples that are interpolated, the smaller the size of the input can be.

[0331] When increasing the size S of the input, interpolation can be used. In this case, "intermediate" or new samples may be generated by taking the average of two adjacent samples, separating these adjacent samples, and inserting a new sample between them. Alternatively, padding can be used, which involves including additional samples with a specific value in the input to increase its size. This padding can include, for example, padding with zero, or padding with information already available in the input, such as repeating padding or reflection padding as already described above.

[0332] The resizing method actually chosen may depend on specific circumstances, such as the intended output size P of the neural network. If this size P has a certain value, it may not be appropriate to reduce the input size to the nearest smaller integer multiple of the product of the neural network's downsampling ratios, but rather it may be appropriate to increase the input size.

[0333] Following the example above where the product of the downsampling ratios was 64, consider an input with size S of 540. This is not an integer multiple of 64, but 512 and 576 are. If the intention is to provide an output of size P=8, it is not appropriate to increase the size to 576. In that case, the size S of the input is rather...

number

[0334] Furthermore, it may be a user choice to increase the input size to avoid information loss, or to reduce the input size during encoding when the encoded picture must be as small as possible. Additionally, when processing a picture, the encoder performing the encoding method may try multiple resizing methods and select the most appropriate one to ensure that high quality can be obtained when decoding the bitstream containing the encoded picture.

[0335] To consider these options, Figure 20 shows a method for encoding a picture according to one embodiment.

[0336] An input, or an input somehow related to this picture (e.g., a pre-processed or otherwise modified input), has a size S (e.g., corresponding to the number of samples in the picture) and is received in step 2210 by the encoder or neural network 2100 in Figure 19. Depending on additional information such as the user's selection of a resizing method, the intended output size P, or other instructions described further below, a resizing method to be used during encoding may be obtained in step 2220. In the next step, using this resizing method, the size S of the input is resized by applying this resizing method.

number

number

number

[0337] In this disclosure, size

number

[0338] In some embodiments, the resizing method may be obtained depending on the input size S and information related to the neural network. This information may include, for example, the downsampling ratio of one or more downsampling layers of the neural network, or a number representing the product of the downsampling ratios of all downsampling layers of the neural network. Furthermore, the information may include the intended output size P of the neural network and the product of one or more downsampling ratios or the downsampling ratios of all downsampling layers.

[0339] This information can be used to determine how the size S should be changed, if it should be changed. For example, suppose the input has a size S = 512. The information provided can indicate that the output needs to have a size P = 8. Furthermore, the product of all downsampling ratios in the downsampling layer could be 64. Multiplying 8 by 64 equals 512, and therefore it can be determined that no change in the input size is necessary when applying resizing. In that case, step 2230 can include the resizing being identical, which means that no change in the input size is applied.

[0340] Instead, consider the case where the input has a size of 540, as exemplified above. If the output P has a size of 8, even though it is theoretically possible to increase or decrease the size of the input, this might result in choosing a resizing method that reduces the size of the input to 512.

[0341] If the intended output size P is not specified, increasing or decreasing the size S (as the first step in selecting the resizing method) may be chosen, for example, to minimize the modifications applied to the original input with size S. This may involve calculating the difference between the input size S and the nearest smaller integer multiple and the nearest larger integer multiple of the product of all downsampling ratios of all downsampling layers of the neural network. This is a function

number

[0342] For example, a value (which represents the difference between the nearest larger integer multiple of the product of the downsampling ratios of all downsampling layers and the input size S)

number

number

[0343] Depending on which of these values ​​C and F is larger, or which of the absolute values ​​|C| and |F| is larger, a resizing method may be selected that involves increasing or decreasing the size S. For example, if F is smaller than C, the input size S is closer to the nearest smaller integer multiple of the product of all the downsampling ratios, and reducing the input size S to this nearest smaller integer multiple results in less modification to the original input in terms of size reduction or increase. The same applies if value C is smaller than value F. In that case, increasing the size to the nearest larger integer multiple of the product of all the downsampling ratios results in less modification to the original input size S.

[0344] Furthermore, the intended output size P may be provided in the form of an index indicating an entry in a pre-stored lookup table, such as a LUT, which has multiple entries, each entry indicating a different output size. By providing this instruction, a size P can be selected, from which an appropriate resizing method can be chosen, as already illustrated above.

[0345] As part of obtaining the resize method, the size S of the input is used to size

number

[0346] In addition, one or more instructions (e.g., as part of additional information) may be provided specifying a resizing method to be selected, and based on one of these instructions, a resizing method may be selected instead of 2220.

[0347] Once the resize method is obtained, size S will be used to size

number

[0348] The output can then be binarized to provide a bitstream. Alternatively, further processing may be performed, such as including information about the applied resizing method, for example, one or more instructions regarding a selected resizing method. After including or adding this information, the output and information of the neural network can be binarized to obtain a bitstream. The bitstream can then be transferred, for example, to a decoder, in which the bitstream may be decoded to reconstruct the picture using the provided information in addition to the encoded picture in the bitstream.

[0349] Figure 21 provides further examples regarding instructions on which resizing method to apply.

[0350] Figure 21 shows several ellipses 2310, 2320, 2330, 2340, and 2350. Each of these ellipses constitutes an instruction that may or may not be provided to the encoder to obtain the resizing method in step 2220 of Figure 20. The numbers within these ellipses constitute the instruction value and its corresponding reference sign for ease of explanation. The instruction value can be understood as referring to the value that each instruction may or may not have. Specifically, each instruction may potentially have multiple different values, but it is understood that each instruction can actually take only one of these different values. For example, the first instruction can take either the value 2311 or the value 2312, but cannot take both at the same time.

[0351] In some embodiments, all of these instructions may be provided in the information provided to the encoder, regardless of their actual values. In some embodiments, one or more of these instructions may only be present if the preceding instruction takes a specific value. This will be discussed in more detail below.

[0352] Figure 21 shows the first instruction 2310. This instruction can take two values, for example. A first value 2311 may indicate that a resizing method including padding or cropping of the input should be applied. A further value 2312 may indicate that interpolation should be applied as a resizing method (regardless of whether the size should be increased or decreased in the resizing). Advantageously, the first instruction 2310 may be provided in the form of a flag having a size of 1 bit, where the first value 2311 (e.g., 0) indicates that padding or cropping is used, and the second value (e.g., 1) 2312 indicates that interpolation is used.

[0353] Depending on the actual value of the first instruction 2310, the resizing method can be determined in a way that allows the encoding to proceed by applying the resizing. For example, if the value of the first instruction 2310 indicates (by value 2311) that padding or cropping should be used based on further information such as the input size S and the intended output size P, then whether padding or cropping should be applied can be determined during step 2220 in Figure 20, without this necessarily being signaled in an additional instruction. This is because, given that the input size S is known and the downsampling ratio of the downsampling layer of the neural network is fixed, the intended output size P can only be obtained in one way: by applying padding to increase the input size S, or by applying cropping to decrease the input size.

number

number

[0354] The method by which the input is padded may be arbitrary or may be determined as appropriate by the encoder.

[0355] In one embodiment, if the value of the first instruction 2310 indicates that interpolation should be used, a second instruction 2320 may be provided. This second instruction 2320 may take a first value 2321 indicating that the size S of the input should be increased by using interpolation, and a second value 2322 of the second instruction may indicate that the size of the input should be decreased. Depending on which value this instruction takes, the size of the input may be increased or decreased.

[0356] Similar to the first instruction, the second instruction also has only two options: increasing or decreasing the size S of the input using interpolation. Therefore, it can be advantageously provided in the form of a flag having a size of 1 bit. These two options can be encoded with a single bit, thereby reducing the amount of information.

[0357] Furthermore, if the first instruction 2310 uses its value 2312 to indicate that interpolation should be applied as a resizing method, a third instruction 2313 may be provided. This third instruction is shown here to have multiple values ​​from 2323 to 2326. Each of these values ​​may point to or indicate an interpolation filter to be applied during interpolation (regardless of, or possibly depending on, the value of the second instruction 2320). For example, the third instruction 2330 may have a value provided as an index indicating an entry in a lookup table that may be available to the encoder or encoding method. In this lookup table, each entry may specify an interpolation filter, and by using the index, the entry in the lookup table can be identified, and accordingly, the interpolation filter can be inferred without explicitly including the interpolation filter or its value in the third instruction 2330. On the other hand, the third instruction 2330 may explicitly specify an interpolation filter by one or more of its values ​​from 2323 to 2326.

[0358] In other cases where the first instruction 2310 indicates (along with the value 2311) that padding or cropping should be used, a fourth instruction 2314 may be given. This fourth instruction may also take a different value, one value 2313 indicating that padding should be used for resizing, and a second value 2314 indicating that cropping should be used. This also specifies whether to enlarge the size of the input (using padding) or decrease it (using cropping). Similar to the first and second instructions, the third instruction may also be provided in the form of a flag having a size of 1 bit, for example, 0 indicating that padding should be used and 1 indicating that cropping should be applied.

[0359] In some embodiments, if the fourth instruction indicates that padding should be applied (value 2313), a fifth instruction may be provided. This fifth instruction 2350 may indicate, based on its values ​​2331-2333, whether zero padding, reflection padding, repeating padding, or other padding methods should be used for padding. Thus, the fourth and fifth instructions specify the amount of padding to be applied during resizing.

[0360] However, the choice of which padding mode is applied may be left open and may not be explicitly indicated in step 2220 of Figure 20; therefore, the fifth instruction may not be present.

[0361] Alternatively, instead of the fifth instruction 2350, the information regarding the padding to be used may be contained within the fourth instruction 2340 itself. Assuming the three exemplary padding methods mentioned above (padding with zeros, reflection padding, and repetition padding), and further taking the option of cropping, this creates four values ​​for the fourth instruction 2340, which can specify whether padding or cropping mode should be applied. This can be encoded into an instruction having a size of 2 bits, thus representing the four values. Thus, this information can also be provided in an instruction having a relatively small size.

[0362] As described above in Figure 21, if the value of the first instruction 2310 indicates that interpolation should be applied, then the second and third instructions may exist. If the value of the first instruction 2310 indicates that padding or cropping should be used instead, then the second and / or third instructions may not exist, thereby further reducing the amount of information. Similarly, if the first instruction 2310 indicates that interpolation should be used, then the fourth and fifth instructions may not exist in order to keep the size small. Alternatively, all of the above instructions may exist anyway. However, since processing the first instruction 2310 makes the information on whether to use interpolation, padding, or cropping in resizing available, the values ​​of each of the other instructions are no longer relevant and can therefore be set to 0 by default or to any other reasonable value.

[0363] By processing instructions and potential further information regarding the input size and / or downsampling ratio and / or intended output size P of the neural network's downsampling layer, the encoder can determine the resizing method to be applied in step 2220 of Figure 20.

[0364] The embodiment referenced in Figure 21 may be used in an encoder to obtain a method for resizing in step 2220, but the instructions presented in Figure 21 may also be included in a bitstream containing the output of a neural network. This information is then made available to the decoder, which can use this information to apply appropriate resizing during decoding, as described below, thereby ensuring that a reliable reconstruction of the picture is obtained.

[0365] For instructions, please also refer to Figures 13, 14, and 15, which refer to the corresponding instructions. In this context, the first instruction may be the instruction indicated by methodIdx. The second instruction may be the instruction indicated by SCIdx, and the third instruction may be the instruction indicated by RFIdx. Therefore, everything described above in Figures 13 to 15 also applies to the first, second, and third instructions referenced in Figure 21.

[0366] As shown in Figure 21, the instructions described above are explained as existing in accordance with the values ​​of other instructions. For example, the existence of instruction 2320 was explained as depending on the value of instruction 2310, which is shown as the first instruction.

[0367] Alternatively, the existence of each of the first through fifth instructions independently of the other instructions is also included in this disclosure.

[0368] In this context, naming instructions as the first, second, third, etc., is used solely for the purpose of easier identification of different instructions. Since these can be provided as independent instructions, they can also be referred to simply as "instructions." Furthermore, the numbering of instructions as first, second, etc., is not intended to restrict these instructions to a specific order in which they occur, for example, in a bitstream. Rather, it is simply a naming convention for different instructions to enable easier identification.

[0369] Furthermore, according to some embodiments, (additional) instructions are provided instead of, or in addition to, these first to fifth instructions, which enable the acquisition of a method for resizing from a table.

[0370] This instruction may be, or may contain, an index pointing to an entry in the lookup table. This lookup table LUT may contain multiple entries, each specifying the method of resizing. Entries may exist in the LUT specifying that padding, cropping, or interpolation should be used. In addition, or alternatively, the LUT may include entries specifying the specific type of padding (reflection padding, repeating padding, or zero padding) that each entry should use. Additionally or alternatively, the LUT may include entries specifying that interpolation should be used, entries specifying that interpolation should be used to increase or decrease the size by resizing, and / or entries specifying filters to be used during interpolation.

[0371] For example, a LUT may include four entries for padding / cropping: one specifying cropping, one specifying zero padding, one specifying repeating padding, and one specifying reflection padding. Furthermore, the table may include one or more entries for interpolation to be used to enlarge the input size by resizing. Each of these entries can specify a different interpolation filter, which may include bilinear, bicubic, Lanczos3, Lanczos5, Lanczos8, and N-tab filters, or any other filter or any other number of different filters.

[0372] In certain embodiments, this may include having six entries (one for each filter) specifying different ways to increase the size by interpolation. Furthermore, the LUT may also provide six entries for decreasing the size by interpolation, each entry specifying the corresponding filter to be used in the interpolation.

[0373] Therefore, the index may be provided to take 16 different values ​​corresponding to 16 different entries in the LUT (four for padding method and cropping, and six each for interpolation to increase size using a specific filter and interpolation to decrease size using a specific filter). The LUT may be available to the encoder so that the encoder can determine the resizing method to be applied depending on the indicated value.

[0374] Instructions containing an index to the LUT, like the other instructions mentioned above, can be given to the encoder, for example, in a bitstream, in addition to or along with the picture to be encoded. Alternatively, the instructions may be derived from user input specifying, for example, a resizing method to be applied by one or more inputs.

[0375] Figure 22 shows a schematic diagram of a neural network 2400 that may be part of a decoder receiving a bitstream representing a picture encoded for decoding. The input to the neural network is shown as 2401 and may be related to the output 2105 of the neural network 2100 shown in Figure 19.

[0376] The general structure of neural network 2400 may be comparable to the structure of neural network 2100 shown in Figure 19. Similar to Figure 19, neural network 2400 may include multiple layers that process the inputs they receive, such as layers 2410, 2420, 2430, and 2440. In this context, input 2401 may be processed by layers, each providing outputs 2402, 2403, and 2404 which are used as inputs for the next layer of the neural network, and finally, after input 2401 has been processed by all layers of neural network 2400, output 2405, which may be a decoded picture, is obtained.

[0377] For this purpose, the neural network 2400 includes an upsampling layer that applies upsampling to the inputs it receives. This can be thought of as the inverse operation of downsampling applied in the downsampling layer shown in Figure 19, and is usually associated with the upsampling ratio u of the corresponding upsampling layer. Specifically, this upsampling ratio can be a natural number greater than 1 such that when the input, e.g., input 2401, is processed by the upsampling layer 2410 of the neural network 2400, the size increases by the upsampling ratio in at least one of its dimensions. This can be achieved, for example, by applying deconvolution to the input as an inverse transformation to convolution, as illustrated in Figure 19. Upsampling can generally be a property of a layer that performs the transformation to its input. For example, the layer may be a convolutional layer or an activation layer (e.g., consisting of normalized linear units) having the upsampling property. A layer having this property is generally referred to as an upsampling layer in this application.

[0378] The output is obtained by processing the input 2401 with all the upsampling layers of the neural network 2400. The size T of the input 2401 and the size of the intermediate output 2405 provided by the last upsampling layer 2440 are determined by the upsampling applied by each of the upsampling layers.

number

number

number

[0379] The relationship between the size T of input 2401 and the size of the output is:

number

[0380] To illustrate upsampling, please note the following:

[0381] If the input size T is 8 and the neural network 2400 includes 6 upsampling layers, each with an upsampling ratio u=2, then the intermediate output, for example output 2405, is of size

number

[0382] As explained above regarding Figures 19 to 21, during the processing of the input by the encoder, the size S of the input received by the encoder is used.

number

number

[0383] However, even when applying an upsampling layer with the same upsampling ratio as the encoder's downsampling layer, the output obtained as the neural network in the decoder corresponds to the product of size P (equal to T) and the upsampling ratios of all the upsampling layers. Therefore, the output obtained as the neural network 2400 is generally a size that does not necessarily match the size S of the original input to the encoder.

number

number

number

number

number

number

number

number

number

[0384] Therefore, when a picture is processed by the decoder's neural network, it is usually not yet reconstructed. The cascading application of upsampling layers to the input in the decoder makes it impossible to achieve certain target sizes in the output. For example, if the total upsampling ratio of the decoder is K and the input size is T, the size of the decoder's intermediate output may, in one example, be equal to K x T. This means that only output sizes that are multiples of K can be achieved by this decoder neural network. However, if it is desirable to make the output size equal to the encoder's input size S, this may not be possible, especially if S is not a multiple of K. This will result in a potential loss of information (intermediate size).

number

number

[0385] Therefore, in some embodiments of the present disclosure, after processing an input having size T in at least one dimension using all upsampling layers of the neural network, a resize may be applied to the intermediate output obtained from processing using all upsampling layers of the neural network, the resize being the size of the intermediate output.

number

number

[0386] This intermediate output may be explicitly output by the neural network, or more specifically, by the last layer of the neural network. After obtaining this output, resizing may be applied. Alternatively, resizing may be applied, for example, as part of the last layer of the neural network, while the neural network is still processing the input. Resizing is applied to the size

number

number

number

[0387] On the other hand, size

number

number

number

number

number

number

number

number

[0388] The resizing applied can be done in various ways, similar to encoding, including interpolation, cropping, padding, and increasing or decreasing the size.

number

number

[0389] In this regard, Figure 23 shows a method 2500 according to one embodiment for decoding a bitstream. In the first step 2510, an input having size T is received, for example, as a bitstream encoded picture or some preprocessed form of this bitstream. In the next step 2520 (the temporal order of which may differ as described below), the applicable resizing method is, for example, the size discussed above.

number

[0390] In the next step 2530 of this method, an input having size T may be processed by a neural network. This involves processing the input sequentially by each of the upsampling layers of the neural network, thereby, in step 2540, size

number

number

number

number

[0391] After obtaining this intermediate output in step 2540, the resizing method determined or obtained in step 2520 is used in step 2550.

number

number

number

number

number

number

number

number

[0392] size

number

number

[0393] As explained above, in step 2520, the resizing method to be applied in step 2540 may be obtained. This can be efficient if information about the resizing method to be selected is encoded or provided within the bitstream. When processing or parsing the bitstream, this information may be obtained when the input is received, from which the resizing method to be applied may be obtained. However, the resizing method is size

number

[0394] As already mentioned above, the resizing method to apply is the size that can be provided as the output size.

number

number

number

number

number

number

number

number

number

[0395] In addition, information regarding which resizing method to apply may already be provided in the bitstream or additional bitstream in the form of one or more instructions.

[0396] In this regard, Figure 24 shows exemplary embodiments of instructions that may be provided to a decoder implementing a decoding method, either as part of the bitstream or in an additional bitstream, to enable obtaining an applicable resizing method. These instructions may be provided in the bitstream by the encoder encoding the picture, thereby ensuring that the decoder applies an appropriate resizing method with the appropriate information when decoding the bitstream to obtain the decoded picture.

[0397] In this regard, much of what has been described in relation to Figure 21 also applies to one or more instructions provided to the decoder. Specifically, a first instruction 2610 may be provided as part of the bitstream. The value of the first instruction 2610 may indicate whether padding or cropping should be used as the resizing method, or whether interpolation (value 2612) should be used for resizing (2611). Depending on which value the first instruction 2610 takes, a second instruction 2620 and a third instruction 2630, or a fourth instruction 2640 and a fifth instruction 2650, as already described in relation to Figure 21, may be provided in accordance with what has been described in relation to Figure 21.

[0398] As shown in Figure 24, the above-mentioned instructions are explained as existing based on the values ​​of other instructions. For example, the existence of instruction 2620 was explained as depending on the value of instruction 2610, which is shown as the first instruction.

[0399] Alternatively, the disclosure also includes the possibility that each of the first through fifth instructions may exist independently of the other instructions. In this context, naming them as first, second, third, etc., is used here solely for the purpose of easier identification of the different instructions. Since they may be provided as independent instructions, they may also be referred to as “instructions.” Furthermore, the numbering of the indications as first, second, etc., is not intended to limit these indications to a specific order in which they arise. Rather, it is considered simply a naming convention for different instructions that allows for easier identification.

[0400] Furthermore, according to some embodiments, (additional) instructions may be provided instead of, or in addition to, these first to fifth instructions, which would enable the acquisition of a method for resizing from a table.

[0401] This instruction may be, or may include, an index pointing to an entry in the lookup table. This lookup table LUT may contain multiple entries, each specifying the method of resizing. Entries may exist in the LUT specifying that padding, cropping, or interpolation should be used. In addition, or alternatively, the LUT may include entries specifying the specific padding (reflection padding, repeating padding, or zero padding) to be used for each entry. Additionally or alternatively, the LUT may include entries specifying that interpolation should be used, entries specifying that interpolation should be used to increase or decrease the size of the intermediate output by resizing, and / or entries specifying the filters to be used during interpolation.

[0402] For example, a LUT may include four entries for padding / cropping: one specifying cropping, one specifying zero padding, one specifying repeating padding, and one specifying reflection padding. Furthermore, the table may include one or more entries for interpolation, which should be used to increase the size of the intermediate output by resizing. Each of these entries can specify a different interpolation filter, which may include bilinear, bicubic, Lanczos3, Lanczos5, Lanczos8, and N-tab filters, or any other filter or any other number of different filters.

[0403] In certain embodiments, this may include having six entries (one for each filter) that specify different ways of increasing the size of the intermediate output by interpolation. Furthermore, six entries may be provided within the LUT to decrease the size of the intermediate output by interpolation, each entry specifying the corresponding filter to be used in the interpolation.

[0404] Therefore, the index may be provided to take 16 different values ​​corresponding to 16 different entries in the LUT (four for padding and cropping, and six each for interpolation to enlarge the size using a specific filter and interpolation to decrease the size using a specific filter). The LUT may be available to the decoder so that, depending on the indicated value, the decoder can determine how to resize the image.

[0405] Instructions containing an index to the LUT, like the other instructions mentioned above, may be provided to the decoder, for example, in addition to the bitstream encoding the picture, or as part of the bitstream encoding the picture.

[0406] As mentioned above, one or more of these instructions and / or additional information, for example, the intended size

number

[0407] In this regard, it should be noted that the information provided in one or more instructions to the decoder may be identical to the information provided in one or more instructions to the encoder according to Figure 21. Therefore, in some embodiments, these one or more instructions may be copied to the bitstream by the encoder. This informs the decoder of which operations the encoder has applied. If the encoder applies cropping to the input before processing the downsampling layer of the neural network, the decoder will have a size

number

number

[0408] With this in mind, in one embodiment, the instructions shown or described in relation to Figure 24 indicate the opposite or inverse of those applied by the encoder when encoding the picture. With this in mind, when the encoder encodes the picture and provides instructions to the bitstream, these instructions may be obtained by inverting them from the instructions described with respect to Figure 21, for example, by inverting the values ​​of flags insofar as it relates to whether increasing or decreasing the size should be used.

[0409] Figure 25 shows an encoder 2700 for encoding a picture. The encoder includes one or more processors 2701 adapted to implement a neural network and a transmitter 2702 for outputting a bitstream, the neural network including multiple layers, each including at least one downsampling layer adapted to apply downsampling to the input in the order in which the picture passes through the neural network. The encoder 2700 and in particular one or more processors 2701 are,

number

[0410] Furthermore, the encoder may include a receiver 2703 for receiving a picture or data associated with a picture.

[0411] Figure 26 shows an embodiment of a decoder 2800 for decoding a bitstream representing a picture, the decoder 2800 comprising a receiver 2801 for receiving a bitstream and one or more processors 2802 configured to implement a neural network, the neural network comprising a plurality of layers, each having at least one upsampling layer adapted to apply upsampling to the input in the order in which the bitstream passes through the neural network, and a transmitter 2803 for outputting the decoded picture, the decoder

number

[0412] The encoder embodiment shown in Figure 25 and the decoder embodiment shown in Figure 26 are intended to be adapted to implement all embodiments mentioned above with respect to picture encoding (for the encoder) or bitstream decoding (for the decoder), specifically the embodiments described in Figures 19 to 24.

[0413] The encoder and decoder shown in Figures 25 and 26 can be implemented in any technically reasonable manner. The encoder and / or decoder may be implemented using hardware and software components running on the hardware, the software components performing the functions described above. Alternatively, dedicated hardware may be provided to implement specific functions. Similarly, the encoder and / or decoder may be implemented using virtual devices, including virtual processors.

[0414] For primary and secondary color components, the code streams can be analyzed independently and reconstructed using modules consisting of the same sequence of neural network layers, differing only in the size of the input tensor and the number of tensor channels.

[0415] Decoded hyperplier tensor

number

[0416] The hyperdecoder generates explicit predictions that are input to the multi-stage context model - MCM, which is a multi-stage neural network process with configured residuals as input.

number

number

number

[0417] For the purposes of this specification, the following terms and definitions apply.

[0418] Padding layer

[0419] The padding layer is Padd(H in ,W in ,d,s d ) is shown as, where H in ,W in s are the height and width of the tensor. d d is the stride of the progressive convolution, and d is the depth of the convolutional layer in the deep learnable encoder. The padding layer is sized [C,h]. d-1 ,w d-1 It takes a tensor of size [C,s d ,h d ,s d ,wd outputs a tensor of h d =ceil(h d-1 / s d ); w d =ceil(w d-1 / s d ), h0 = H in , w0 = W in It is. By default, padding is done by replication. Different models of padding can be specified. (For example, padding by zeros).

[0420] Cropping layer

[0421] The cropping layer is shown as Crop(H in , W in , d, s d ), where H in , W in are the height and width of the tensor output to the composite transformation, s d is the stride of the progressive transposed convolution, and d is the depth of the convolutional layer in the deep learning-enabled reconstruction process. The cropping layer receives a tensor of size [C, s d , h d , s d , w d and outputs a tensor of size [C, h d-1 , w d-1 , where h d =ceil(h d-1 / s d ); w d =ceil(w d-1 / s d ), h0 = H in , w0 = W in It is. Padding is done by discarding redundant elements.

[0422] Convolutional layer

[0423] The 2D convolution is

Number

[0424] Inversion and convolution

[0425] Transposition convolution is,

number

[0426] Quantized convolution

[0427] Two-dimensional quantized convolution is,

number

[0428] Quantized transpose convolution

[0429] Two-dimensional quantized transpose convolution is,

number

[0430] Pixel shuffle layer

[0431] The pixel shuffle layer, also known as subpixel convolution, is denoted as PixelShuffle(s), where s>1 is the upscaling factor. This layer has a shape of [C in ,h in ,w in The elements in the tensor input of ] have shape [C out ,h out ,w out Rearrange it into the tensor output of ], and here h out =s·h in; w out =s·w in ;C out =C in / s 2 That is the case.

[0432] Residual activation unit

[0433] The residual activation unit is ResAU(K ver ×K hor This is shown as [C,h]. This layer has a size [C,h]. k ,w k It receives a tensor of ], performs the sequence of steps shown in Figure 27A, and then outputs a tensor of the same size. Here,

number

number

[0434] residual activation Residual activation is ResA(K ver ×K hor This is shown as [C,h]. This layer has a size [C,h]. k ,w k It receives a tensor of ], performs the sequence of steps shown in Figure 27B, and then outputs a tensor of the same size. Here,

number

[0435] Residual nonlocal attention block

[0436] Residual nonlocal attention block is,

number

number

number

number

number

[0437] Residual block

[0438] The residual block is shown as RB. This layer has a size [C,h k ,w k It receives a tensor of size ], performs the sequence of steps shown in Figure 27D, and then outputs a tensor of the same size.

[0439] Lightweight residual block

[0440] Lightweight residual blocks are shown as LRBs. This layer has a size [C,h k ,w k It receives a tensor of size ], performs the sequence of steps shown in Figure 27E, and then outputs a tensor of the same size.

[0441] Normalized Linear Unit

[0442] The normalized linear unit is,

number

number

[0443] Leaky Normalized Linear Unit

[0444] Leaky normalized linear units are

number

number

[0445] opIdx is an identifier for the operating point, where 0 means the "base" operating point and 1 means the "high" operating point.

[0446] ABS calculation

[0447] ABS calculations are

number

number

[0448] Skip Model Process

[0449] This is also called the SKIP mode decoder process, or the SKIP process, or the decoder-side SKIP operation. In the decoder, the input to the skip mode process is: - 1D sequences [num_res_elements] for forming "stream-y" after decoding by me-tANS - mask_skip[num_skip_params,C,h4,w4] That is the case.

[0450] The output of this process is,

number

[0451] The output of the reversible decoding process is a 1D array {s k} is such that its size is equal to the total number of "1"s in the maskAggregate[C,h4,w4] tensor.

[0452] In other words, the maskAggregate[C,h4,w4] tensor is a residual tensor.

number

[0453] The residual skip mode process in the decoder is as follows:

number

[0454] In one embodiment of the present invention, tensor boundary processing is used in multi-stage context modeling (MCM) to reduce the amount of data processed in a neural network and improve coding efficiency. When tensor sizes are not uniform and downsampling convolution or any other type of layer that reduces the size by half is expected in the NN structure, a padding layer must precede and a cropping layer must follow the deconvolution / upsampling layer. The present invention adds a padding layer in the MCM structure before the downsampling layer or a layer that functions the same as a downsampling layer, and adds a cropping layer in the MCM structure after the upsampling layer or a layer that functions the same as an upsampling layer, thereby avoiding non-integer sized tensors in any processing step, avoiding loss of device interoperability (since fractional sized tensors are undefined and can therefore be treated differently by different devices and processors), reducing the amount of data processed in the neural network (and memory usage as well), and improving coding efficiency.

[0455] One embodiment of the present invention discloses a method for processing images using a neural network (NN) shown in Figure 28, wherein the NN includes a multi-stage context model (MCM), and the MCM comprises multiple MCMs. k The model includes a first down-shuffle layer, a second down-shuffle layer, and an up-shuffle layer, with a first padding layer before the first down-shuffle layer, a second padding layer before the second down-shuffle layer, and a cropping layer after the up-shuffle layer. Method 2800 is... 2810. A step to obtain a first tensor, where the first tensor is the output of the skip model process, and 2820. A step of obtaining a second tensor, wherein the second tensor is the output of the hyperdecoder, 2830. The step of padding the first tensor using the first padding layer, 2840. A step of obtaining a reshuffled first tensor by down-shuffling the padded first tensor based on the first down-shuffle layer, 2850. The step of padding the second tensor using a second padding layer, 2860. A step of obtaining a reshuffled second tensor by down-shuffling the padded second tensor based on the second down-shuffle layer, 2870. Multiple MCMs k Based on the model, the steps include processing the reshuffled first tensor and the reshuffled second tensor to obtain the latent space tensor, MCM k The model is stage or MCM k Also called stages, the number of stages may be equal to 8, but is not limited to 8; the number of stages may be equal to 6 or less, or 10 or more. 2880. A step of obtaining a reshuffled latent space tensor by up-shuffling the latent space tensor based on the up-shuffle layer, 2890. The method includes the step of cropping a reshuffled latent space tensor based on a cropping layer to obtain a reconstructed latent tensor.

[0456] Furthermore, the exact design of the MCM structure may change in the future, and the MCM structure may include more or fewer stages, more or fewer down-shuffle or up-shuffle layers, and the MCM structure may include several other layers for changing the size or shape of the tensor. However, regardless of how the MCM structure changes, it can be understood that there must be a padding layer before the downsampling / down-shuffle layer, or a layer with the same function as the downsampling layer, and a cropping layer after the upsampling / up-shuffle layer, or a layer with the same function as the upsampling layer. Thus, non-integer sized tensors can be avoided at any processing step, device interoperability can be avoided (since fractional sized tensors are undefined and can therefore be treated differently by different devices and processors), the amount of data processed in the neural network (and memory usage as well) can be reduced, and coding efficiency can be improved.

[0457] In one embodiment, the number of tensor slice inputs in the first down-shuffle layer is 2, and the number of tensor slice inputs in the second down-shuffle layer is 1.

[0458] In one embodiment, the number of inputs to the tensor slice in the up-shuffle layer is 2.

[0459] In one embodiment, the first padding layer is Padd(H in ,W in ,d,s d ) is shown as, where H in ,W in s are the height and width of the first tensor, and d d is the stride of the progressive convolution, and d is the depth of the convolutional layer in a deep-learnable encoder.

[0460] In one embodiment, the cropping layer is Crop(H in ,W in,d,s d ) is shown as, where H in ,W in s are the height and width of the tensor output to the composite transformation, and d d is the stride of the progressive transpose convolution, and d is the depth of the convolutional layer in the deep-learnable reconstruction process.

[0461] In one embodiment, the first padding layer has a stride of 2 and a depth of 5.

[0462] In one embodiment, the second padding layer has a stride of 2 and a depth of 5.

[0463] In one embodiment, the cropping layer has a stride of 2 and a depth of 5.

[0464] In one embodiment, the first tensor is a reconstructed residual tensor, and the second tensor is an explicit prediction tensor.

[0465] One embodiment of the present invention discloses a neural network (NN), the NN includes a multi-stage context model (MCM), and the MCM is a plurality of MCMs k The model includes a first down-shuffle layer, a second down-shuffle layer, and an up-shuffle layer, with a first padding layer preceding the first down-shuffle layer, a second padding layer preceding the second down-shuffle layer, and a cropping layer following the up-shuffle layer, where the first and second down-shuffle layers are used to change the size or shape of the tensor, and the up-shuffle layer is used to change the size or shape of the tensor.

[0466] In one embodiment, the number of tensor slice inputs in the first down-shuffle layer is 2, and the number of tensor slice inputs in the second down-shuffle layer is 1.

[0467] In one embodiment, the number of inputs to the tensor slice in the up-shuffle layer is 2.

[0468] In one embodiment, the first padding layer is Padd(H in ,W in ,d,s d ) is shown as, where H in ,W in s are the height and width of the tensor, and d d is the stride of the progressive convolution, and d is the depth of the convolutional layer in a deep-learnable encoder.

[0469] In one embodiment, the cropping layer is Crop(H in ,W in ,d,s d ) is shown as, where H in ,W in s are the height and width of the tensor output to the composite transformation, and d d is the stride of the progressive transpose convolution, and d is the depth of the convolutional layer in the deep-learnable reconstruction process.

[0470] In one embodiment, the first padding layer receives a tensor having a first size and outputs a tensor having a second size.

[0471] In one embodiment, the first padding layer and the second padding layer are created by replication.

[0472] In one embodiment, MCM k The model takes k from 0 to k-1 as input and uses the previous MCM k The model's output tensor is used.

[0473] One embodiment of the present invention discloses an encoder for encoding a picture, the encoder comprising a receiver for receiving a picture and one or more processors configured to implement a neural network (NN), the NN comprising a multi-stage context model (MCM), and the MCM comprising multiple MCMs kThe model includes an MCM further comprising a first down-shuffle layer, a second down-shuffle layer, and an up-shuffle layer, wherein the first down-shuffle layer is preceded by a first padding layer, the second down-shuffle layer is preceded by a second padding layer, and the up-shuffle layer is preceded by a cropping layer, and the encoder further comprises a transmitter for outputting a bitstream, and the encoder is adapted to perform the method according to any of the embodiments described above.

[0474] One embodiment of the present invention discloses a decoder for decoding a bitstream representing a picture, the decoder comprising a receiver for receiving the bitstream and one or more processors configured to implement a neural network (NN), the NN comprising a multi-stage context model (MCM), the MCM comprising multiple MCMs k The model includes an MCM further comprising a first down-shuffle layer, a second down-shuffle layer, and an up-shuffle layer, wherein the first down-shuffle layer is preceded by a first padding layer, the second down-shuffle layer is preceded by a second padding layer, and the up-shuffle layer is preceded by a cropping layer, and the decoder further comprises a transmitter for outputting the decoded picture, the decoder is adapted to perform any one of the methods of the embodiments described above.

[0475] In one embodiment, the MCM process and some related terms are described in detail below. • Multistage Context Modeling (MCM)

[0476] The input to this MCM process is,

number

[0477] The output of this MCM process is:

number

[0478] The MCM process consists of the following steps:

number

[0479] Figure 29 shows an example of a multi-stage context modeling process. The MCM process is an iterative process. Later stages use elements of the previously obtained output tensor as input.

[0480] Another example of a multi-stage context modeling process is shown in Figure 29A.

[0481] In this example, the multistage MCM consists of four MCMs. k The model includes a multi-stage MCM which further includes one down-shuffle layer and one up-shuffle layer, with a padding layer before the down-shuffle layer and a cropping layer after the up-shuffle layer.

[0482] This multi-stage MCM can be used to carry out image processing methods, and the method is A step of obtaining an input tensor, wherein the input tensor may be the output of a residual decoder, and the input tensor may be called a reconstructed residual tensor. A step of padding a first tensor with a padding layer before a down-shuffle layer, wherein the first tensor is either an input tensor or a tensor obtained by processing the input tensor. After the second tensor is output from the up-shuffle layer, the second tensor is cropped using a cropping layer. Includes.

[0483] Based on Figure 29A, the first tensor is padded, down-shuffled, and the down-shuffled first tensor is obtained, and then the re-shuffled first tensor is used for four MCMs. kThe model is input, and four MCMs k It can be understood that the model's output is up-shuffled by using an up-shuffle layer, and a second tensor is output. In other words, the re-shuffled first tensor is one of four MCMs. k The input to the model consists of four MCMs. k The output of the model is the input to the up-shuffle layer, and the second tensor is the output of the up-shuffle layer. (4 MCMs) k The model further includes other inputs, which are tensor outputs from the hyperdecoder.

[0484] In a particular embodiment, the input to this process is

number

[0485] The output of this process is,

number

[0486] This method consists of the following process:

number

[0487] This is a recursive process in which subsequent stages use previously obtained elements of the output tensor as input. The data flow is shown by following the arrows, with MCM stages 1-3 using the output of the previous stage.

[0488] The multi-stage context modeling process in Figures 29 and 29A is flexible in many respects, and it can be understood that certain components of the embodiments / architectures shown in Figures 29 and 29A can be interchangeable. For example, there can be eight stages as in Figure 29, or there can be four stages. Also, the number of stages is not limited; there may be six or fewer stages, or there may be ten or more stages. The number of down-shuffle layers or up-shuffle layers is not limited; Figure 29 has two down-shuffle layers and a padding layer, or there may be only one down-shuffle layer and one padding layer. The MCM structure may include more or fewer down-shuffle layers or up-shuffle layers, and the MCM structure may include several other layers for changing the size or shape of the tensor.

[0489] Down shuffle operation

[0490] The input to this process is, M - Number of tensor slices a[MC,h4,w4] - 3D tensor That is the case.

[0491] The output of this process is,

number

[0492] In a down-shuffle operation, the input slice is first divided into M-channel dimensions, and then, for each slice, the elements of the tensor are grouped into four groups: 0, 1, 2, and 3 (note that the group numbers are in zigzag order, not raster order), and these groups are then reshuffled into the channel dimension.

[0493] The down-shuffling operation, like a downsampling convolution with a stride of 2, alters the spatial size of the tensor, so a padding layer (h4,w4,1,2) is placed before this down-shuffling process, as shown in Figure 29. Zero padding is performed. In other words, down-shuffling is used to change the size or shape of a tensor.

[0494] Up-shuffle operation

[0495] This process is the reverse of the down-shuffle operation.

[0496] The input to this process is,

number

[0497] The output of this process is - A reshuffled 3D tensor a[MC,h4,w4] with the same elements. That is the case.

[0498] In the up-shuffle operation, the input slice is first divided into M slices by the channel dimension, and for each slice, the elements of the tensor are reshuffled in a zigzag order, then the number of channels is reduced by 4 and the spatial dimension is increased by 2.

[0499] The up-shuffle operation, like the inverse convolution with stride 2, changes the spatial size of the tensor, so this up-shuffle process is followed by a cropping layer (h4,w4,1,2). In other words, up-shuffling is used to change the size or shape of a tensor.

[0500] ChannelNet

[0501] The input to this process is,

number

[0502] The output of this process is

number

[0503] The process of the channel net is shown in Figure 30. First, the input proceeds to an up-shuffle process, followed by a 3x3 set of three stride 1 convolutions, where the first convolution doubles the number of channels.

number

[0504] Stage 0 of Multistage Context Modeling

[0505] The input to this process is,

number

[0506] The output of this process is,

number

[0507] The process is as follows:

number

[0508] Stage 1 of Multistage Context Modeling

[0509] The input to this process is,

number

[0510] The output of this process is,

number

[0511] The process is as follows:

number

[0512] Stage 2 of Multistage Context Modeling

[0513] The input to this process is,

number

[0514] The output of this process is,

number

[0515] The process is as follows:

number

[0516] Stage 3 of Multistage Context Modeling

[0517] The input to this process is,

number

[0518] The output of this process is,

number

[0519] The process is as follows:

number

[0520] Stage 4 of Multistage Context Modeling

[0521] The input to this process is,

number

[0522] The output of this process is,

number

[0523] The process is as follows:

number

[0524] Stage 5 of Multistage Context Modeling

[0525] The input to this process is,

number

[0526] The output of this process is,

number

[0527] The process is as follows:

number

[0528] Stage 6 of Multistage Context Modeling

[0529] The input to this process is,

number

[0530] The output of this process is,

number

[0531] The process is as follows:

number

[0532] Stage 7 of Multistage Context Modeling

[0533] The input to this process is,

number

[0534] The output of this process is,

number

[0535] The process is as follows:

number

[0536] MCM design elements

[0537] The three types of operations that change the size or shape of a tensor are 1) down-shuffle, 2) up-shuffle, and 3) per-channel padding. The two subnetworks are the channel network and the predictive fusion network.

[0538] In one embodiment of the present invention, tensor boundary processing is used in a hyperscale decoder to reduce the amount of data processed in a neural network and improve coding efficiency. Tensor boundary processing means adding a padding layer before a downsampling / downshuffle layer, or a layer with the same function as a downsampling / downshuffle layer, and adding a cropping layer after an upsampling / upshuffle layer, or a layer with the same function as an upsampling / upshuffle layer. This ensures that the tensor size is an integer at any step, and thus uncertainty is avoided (uncertain processes cause platform dependency and eliminate device interoperability). The hyperscale decoder outputs parameters for the entropy decoder, so it must have bit-accurate behavior; otherwise, what is parsed from the bitstream bits cannot be correctly interpreted. The convolution has a parameter s called the "stride". When the stride is 1, the height and width of the input tensor and output tensor are the same. However, if s > 1, h_out = h_in / s and w_out = w_in / s. Assuming h_in is 33 and s=2, the size of the output tensor is h_out = 33 / 2 = 16.5. The tensor height is the number of elements and must be an integer. Maintaining the specification equation h_out = h_in / s, which results in a fraction, introduces uncertainty, with some implementing h as 16 (rounding 16.5) and others as 17. This is problematic, as different devices perform different operations, and decoders will crash decoding streams coming from other devices (lack of interoperability between devices).

[0539] One embodiment of the present invention discloses a method for processing an image using a neural network (NN) as shown in Figure 31, wherein the NN includes a hyperscale decoder, the hyperscale decoder includes a base operating point and a high operating point, and with respect to the base operating point, the hyperscale decoder includes a first quantized transposed convolutional layer, followed by a first cropping layer and a first normalized linear unit, a first quantized convolutional layer, followed by a second normalized linear unit, a second quantized transposed convolutional layer, followed by a second cropping layer and a third normalized linear unit, and a second quantized convolutional layer. For high operating points, the hyperscale decoder includes two quantized convolutional layers, each followed by a pixel shuffle layer, a cropping layer, and a normalized linear unit, and then includes three quantized convolutional layers, with two of the three quantized convolutional layers followed by a normalized linear unit, and method 3100 is, 3110. Steps to obtain the input tensor, 3120. Steps to obtain the size of the input tensor, 3130. Steps to obtain the operating point indicator, 3140. A step of deciding to process the input tensor using the base operating point or high operating point based on the operating point indicator, 3150. Includes the step of outputting the processed tensor.

[0540] A hyperscale decoder may include two processing pipelines, one of which is the base operating point and the other is the high base operating point. In some other embodiments, the base operating point may also be referred to as the base profile, baseline, base pipeline, base channel, base subnetwork, or some other name, and correspondingly, the high operating point may also be referred to as the high profile, high line, high pipeline, high channel, high subnetwork, or some other name.

[0541] Furthermore, the exact design of the hyperscale decoder may change in the future, and it may include more or fewer quantized transposed convolutional layers, and it may include some other layers to change the size or shape of the tensor. However, regardless of how the hyperscale decoder changes, it can be understood that there must be a padding layer before the downsampling / downshuffle layer, or a layer that functions the same as the downsampling layer, and a cropping layer after the upsampling / upshuffle layer, or a layer that functions the same as the upsampling layer.

[0542] In one embodiment, both the first quantized transposed convolutional layer and the second quantized transposed convolutional layer have a kernel size of 4 × 4 and a stride of 2.

[0543] In one embodiment, the first cropping layer has a stride of 2 and a depth of 6.

[0544] In one embodiment, the second cropping layer has a stride of 2 and a depth of 5.

[0545] In one embodiment, the cropping layer following the first of two quantized convolutional layers at the high operating point has a stride of 2 and a depth of 6, and the second of the two quantized convolutional layers at the high operating point has a stride of 2 and a depth of 5.

[0546] In one embodiment, the processed tensor is a hyperscale decoder standard deviation tensor.

[0547] In one embodiment, when the operating point indicator is equal to 0, the input tensor is processed using the base operating point.

[0548] In one embodiment, when the operating point indicator is equal to 1, the high operating point is used to process the input tensor.

[0549] In one embodiment, the first quantized convolutional layer has a kernel size of 3 × 3.

[0550] In one embodiment, the pixel shuffle layer is configured to change the number of channels from 4C to C.

[0551] One embodiment of the present invention discloses a neural network (NN) which includes a hyperscale decoder, the hyperscale decoder includes a base operating point and a high operating point, and with respect to the base operating point, the hyperscale decoder includes a first quantized transposed convolutional layer, followed by a first cropping layer and a first normalized linear unit, a first quantized convolutional layer, followed by a second normalized linear unit, a second quantized transposed convolutional layer, followed by a second cropping layer and a third normalized linear unit, and a second quantized convolutional layer. For high operating points, the hyperscale decoder includes two quantized convolutional layers, each followed by a pixel shuffle layer, a cropping layer, and a normalized linear unit, and then three quantized convolutional layers, two of which are followed by normalized linear units.

[0552] One embodiment of the present invention discloses an encoder for encoding a picture, the encoder comprising a receiver for receiving a picture and one or more processors configured to implement a neural network (NN), the NN comprising a hyperscale decoder, the hyperscale decoder comprising a base operating point and a high operating point, the hyperscale decoder comprising a first quantized transposed convolutional layer, followed by a first cropping layer and a first normalized linear unit, a first quantized convolutional layer, followed by a second normalized linear unit, a second quantized transposed convolutional layer, followed by a second cropping layer and a third normalized linear unit, and a second quantized convolutional layer. For high operating points, the hyperscale decoder includes two quantization convolutional layers, each of which is followed in order by a pixel shuffle layer, a cropping layer, and a normalization linear unit, and then includes three quantization convolutional layers, with two of the three quantization convolutional layers followed by a normalization linear unit, and the encoder further includes a transmitter for outputting a bitstream, and the encoder is adapted to perform the method according to any one of the embodiments described above.

[0553] One embodiment of the present invention discloses a decoder for decoding a bitstream representing a picture, the decoder comprising a receiver for receiving the bitstream and one or more processors configured to implement a neural network (NN), the NN comprising a hyperscale decoder, the hyperscale decoder comprising a base operating point and a high operating point, the hyperscale decoder comprising a first quantized transposed convolutional layer, followed by a first cropping layer and a first normalized linear unit, a first quantized convolutional layer, followed by a second normalized linear unit, a second quantized transposed convolutional layer, followed by a second cropping layer and a third normalized linear unit, and a second quantized convolutional layer. For a high operating point, the hyperscale decoder includes two quantization convolutional layers, each of which is followed by a pixel shuffle layer, a cropping layer, and a normalization linear unit, and then includes three quantization convolutional layers, with two of the three quantization convolutional layers followed by a normalization linear unit, and the decoder further includes a transmitter for outputting the decoded picture, and the decoder is adapted to perform any one of the embodiments described above.

[0554] The hyperscale decoder is further disclosed as follows:

[0555] Hyperscale Decoder

[0556] An example of a hyperscale decoder is shown in Figure 32.

[0557] The input to the hyperscale decoder is:

number

[0558] The output standard deviation tensor σ [C,h4,w4] of the hyperscale decoder.

[0559] In the scalable hyperdecoder, all operations are integers, the accumulator in all calculations is within the 32-bit integer range, and model parameters are quantized to 8-bit integers. This ensures bit-accurate behavior of this neural network module. The hyperscale decoder uses special types of operations: quantized convolution and quantized transposed convolution. For each quantized convolution in the process, a clipping value {d k} and descaling shift parameter {p k} is specified

number

[0560] Note - The magnitude of the weights incorporated in the quantization model is 2 15 The combination of shift and clipping values ​​does not exceed -1, and ensures that the quantized convolution register is 32 bits or less.

[0561] Depending on the operating point indicator (opIdx), the hyperscale decoder performs the following sequence of steps:

[0562] In the hyperscale decoder, at the base operating point (opIdx=0), the number of channels is C for all hidden layers. The hyperscale decoder starts with a quantized transpose convolution with a kernel size of 4x4 and a stride of 2, followed by a cropping layer (stride 2, depth 6) and a normalized linear unit. Next, there is a quantized convolution with a kernel size of 3x3 and a stride of 1, followed by a normalized linear unit. Then there is another quantized transpose convolution with a kernel size of 4x4 and a stride of 2, followed by a cropping layer (stride 2, depth 5) and a normalized linear unit, as well as a quantized convolution with a kernel size of 3x3 and a stride of 1.

[0563] For a high operating point (opIdx=1), two sets of quantized convolutions with kernel size 3x3 and stride 1 increase the number of channels to 4C, followed by a pixel shuffle (stride 2) and cropping layers (stride 2, correspondingly depths 6 and 5) that return the number of channels to C, completing a normalized linear unit with channel C. Then there are three quantized convolutions with kernel size 3x3 and stride 1, two of which are followed by a normalized linear unit.

[0564] In one embodiment of the present invention, tensor boundary processing is used in a synthetic transform network to reduce the amount of data processed by the neural network and improve encoding efficiency. Tensor boundary processing means adding a padding layer before the downsampling layer or a layer with the same function as the downsampling layer, and adding a cropping layer after the upsampling layer or a layer with the same function as the upsampling layer. This ensures that the tensor size is an integer without introducing uncertainty at any step of the processing. If the cropping layer is not properly configured in the synthetic transform, the reconstructed picture size will differ from the encoded picture size.

[0565] One embodiment of the present invention discloses a method for processing an image using a neural network (NN) shown in Figure 33, wherein the NN includes a composite transform network, the composite transform network includes a connected layer configured to connect a principal tensor and an auxiliary tensor as input tensors, a base operating point, and a high operating point, and with respect to the base operating point, the composite transform network includes a lightweight residual block, a first transposed convolutional layer combined with a first cropping layer and a first residual activation unit, a second transposed convolutional layer combined with a second cropping layer and a second residual activation unit, a first convolutional layer combined with a third residual activation unit, a second convolutional layer, and a first pixel shuffle layer and a third cropping layer, For high operating points, the composite transformation network includes two residual blocks, followed by a third transposed convolutional layer combined with a fourth cropping layer and a first residual activation, a fourth transposed convolutional layer combined with a fifth cropping layer and a second residual activation, followed by a third convolutional layer, and a residual nonlocal attention block combined with a sixth cropping layer and a third residual activation, and is completed by a fifth transposed convolutional layer and a subsequent seventh cropping layer, and method 3300 is, 3310. The steps of concatenating the main tensor and the auxiliary tensor to obtain the input tensor, 3320. Steps to obtain the operating point indicator, 3330. Steps to obtain the size of the input tensor, 3340. A step of deciding to process the input tensor using the base operating point or high operating point based on the operating point indicator, 3350. Steps to output the processed tensor, Includes.

[0566] A composite transformation network can be understood to include two processing pipelines, one of which is the base operating point and the other is the high base operating point. In some other embodiments, the base operating point may also be referred to as the base profile, baseline, base pipeline, base channel, base subnetwork, or some other name, and correspondingly, the high operating point may also be referred to as the high profile, high line, high pipeline, high channel, high subnetwork, or some other name.

[0567] Furthermore, the exact design of the composite transformation network may change in the future. For example, the composite transformation network may include more or fewer quantized transpose convolutional layers, more or fewer quantized convolutional layers, and more or fewer pixel shuffle layers. The composite transformation network may also include several other layers for changing the size or shape of the tensors. However, regardless of how the composite transformation network is modified, it is important to understand that a padding layer must be present before downsampling / downshuffle layers, or layers with the same function as downsampling / downshuffle layers, and a cropping layer must be present after upsampling / upshuffle layers, or layers with the same function as upsampling / upshuffle layers.

[0568] In one embodiment, the first cropping layer has a stride of 2 and a depth of 4.

[0569] In one embodiment, the second cropping layer has a stride of 2 and a depth of 3.

[0570] In one embodiment, the third cropping layer has a stride of 4 and a depth of 1.

[0571] In one embodiment, the fourth cropping layer has a stride of 2 and a depth of 4.

[0572] In one embodiment, the fifth cropping layer has a stride of 2 and a depth of 3.

[0573] In one embodiment, the sixth cropping layer has a stride of 2 and a depth of 2.

[0574] In one embodiment, the seventh cropping layer has a stride of 2 and a depth of 1.

[0575] In one embodiment, when the operating point indicator is equal to 0, the input tensor is processed using the base operating point.

[0576] In one embodiment, when the operating point indicator is equal to 1, the high operating point is used to process the input tensor.

[0577] In one embodiment, the principal tensor is a reconstructed latent space tensor.

[0578] One embodiment of the present invention discloses a neural network (NN) comprising a synthetic transform network, the synthetic transform network comprising a connected layer configured to concatenate a principal tensor and an auxiliary tensor as input tensors, a base operating point, and a high operating point, wherein with respect to the base operating point, the synthetic transform network comprises a lightweight residual block, a subsequent first transposed convolutional layer combined with a first cropping layer and a first residual activation unit, a second transposed convolutional layer combined with a second cropping layer and a second residual activation unit, a first convolutional layer combined with a third residual activation unit, and a subsequent first pixel shuffle layer and a third cropping layer. For high operating points, the composite transform network includes two residual blocks, followed by a third transposed convolutional layer combined with a fourth cropping layer and a first residual activation, a fourth transposed convolutional layer combined with a fifth cropping layer and a second residual activation, followed by a third convolutional layer, followed by a second pixel shuffle layer, and a residual nonlocal attention block combined with a sixth cropping layer and a third residual activation, and is completed by a fifth transposed convolutional layer and a subsequent seventh cropping layer.

[0579] One embodiment of the present invention discloses an encoder for encoding a picture, the encoder comprising a receiver for receiving a picture and one or more processors configured to implement a neural network (NN), the NN comprising a composite transform network, the composite transform network comprising a concatenation layer configured to concatenate a principal tensor and an auxiliary tensor as input tensors, a base operating point and a high operating point, the composite transform network comprising a lightweight residual block, a subsequent first transposed convolutional layer combined with a first cropping layer and a first residual activation unit, a second transposed convolutional layer combined with a second cropping layer and a second residual activation unit, a first convolutional layer combined with a third residual activation unit, a second convolutional layer, and a subsequent first pixel shuffle layer and a third cropping layer. For high operating points, the composite transform network includes two residual blocks, followed by a third transposed convolutional layer combined with a fourth cropping layer and a first residual activation, a fourth transposed convolutional layer combined with a fifth cropping layer and a second residual activation, followed by a third convolutional layer, followed by a second pixel shuffle layer, and a residual nonlocal attention block combined with a sixth cropping layer and a third residual activation, and terminates with a fifth transposed convolutional layer and a seventh cropping layer. The encoder further includes a transmitter for outputting a bitstream, and the encoder is adapted to perform the method according to any one of the embodiments described above.

[0580] One embodiment of the present invention discloses a decoder for decoding a bitstream representing a picture, the decoder comprising a receiver for receiving the bitstream and one or more processors configured to implement a neural network (NN), the NN comprising a composite transform network, the composite transform network comprising a connected layer configured to concatenate a principal tensor and an auxiliary tensor as input tensors, a base operating point and a high operating point, the composite transform network comprising a lightweight residual block, a subsequent first transposed convolutional layer combined with a first cropping layer and a first residual activation unit, a second transposed convolutional layer combined with a second cropping layer and a second residual activation unit, a first convolutional layer combined with a third residual activation unit, a second convolutional layer, and a subsequent first pixel shuffle layer and a third cropping layer. For high operating points, the composite transform network includes two residual blocks, followed by a third transposed convolutional layer combined with a fourth cropping layer and a first residual activation, a fourth transposed convolutional layer combined with a fifth cropping layer and a second residual activation, a third convolutional layer combined with a second pixel shuffle layer, and a residual nonlocal attention block combined with a sixth cropping layer and a third residual activation, terminating with a fifth transposed convolutional layer and a subsequent seventh cropping layer, and the decoder further includes a transmitter for outputting the decoded picture, and the decoder is adapted to perform any one of the methods of the embodiments described above.

[0581] An example of a composite transformation network is shown in Figure 34.

[0582] Learning-based reconstruction (also known as synthetic transformation) consists of two pipelines with identical neural network architectures, except for the input size and number of channels.

[0583] The input for analysis and transformation is:

number

[0584] The output of the analysis transformation is:

number

[0585] Composite transformations are,

number

[0586] For the base operating point (opIdx=0), the number of channels is C+C. d Following a single lightweight residual block, a series of two transposed convolutions with kernel size 4x4 are combined with a cropping layer (stride 2, depths 4 and 3, respectively) and a residual activation unit with kernel size 3x3. The number of output channels in the transposed convolutions are C1 and C2, respectively. The stride of both transposed convolutions is 2. The next step in the process is a normal convolution with kernel size 3x3, stride 1, and an unchanging number of channels C2, combined with a residual activation unit (kernel size 3x3). The number of channels is then increased from C2 to 16C. in There is a 3x3 convolution with stride 1 that increases up to . The next step is a pixel shuffle of the output with stride 4, which is the number of channels C in This is done to ensure that it has [the necessary properties]. The process is completed in the cropping layer (stride 4, depth 1).

[0587] For high operating point (opIdx=1), the number of channels is C+C. dFollowing two residual blocks, a series of two transpose convolutions with kernel size 3x3 are followed, combined with a cropping layer (stride 2, depths 4 and 3, respectively) and a residual activation of kernel size 3x3. The number of output channels in both transpose convolutions is C. The stride is 2 for both transpose convolutions. The next step in the process is a normal convolution with kernel size 3x3, stride 1, and output number of channels 4C. This is done to ensure that the next step, which is a pixel shuffle of the output of stride 2, has number of channels C. Then, a residual nonlocal attention block (

number

[0588] Embodiments of the present invention disclose a decoder for decoding a bitstream representing a picture, the decoder comprising one or more processors for implementing a neural network (NN), the one or more processors being adapted to perform the method according to any one of the embodiments described above.

[0589] Embodiments of the present invention disclose an encoder for encoding a picture, the encoder comprising one or more processors implementing a neural network (NN), the one or more processors being adapted to perform the method according to any one of the embodiments described above.

[0590] Embodiments of the present invention disclose a computer program product that, when executed on a computer system, includes computer-executable instructions that cause the computer system to perform a method according to any one of the embodiments described above.

[0591] One embodiment of the present invention discloses a neural network (NN), the NN includes a multi-stage context model (MCM), and the MCM is a plurality of MCMs k The model includes, further comprising one or more down-shuffle layers and one or more up-shuffle layers, with a padding layer before each of the one or more down-shuffle layers and a cropping layer after each of the one or more up-shuffle layers.

[0592] In one embodiment, the NN further includes a hyperscale decoder, the hyperscale decoder includes a baseline and a highline, the baseline includes two or more quantized transposed convolutional layers, each of which is followed by a cropping layer and a normalized linear unit, and the highline includes two or more quantized convolutional layers, each of which is followed by a pixel shuffle layer, a cropping layer and a normalized linear unit in that order.

[0593] In one embodiment, the NN further comprises a composite transformation network, the composite transformation network comprising a baseline and a high line, the baseline comprising two or more transposed convolutional layers, each of the two or more transposed convolutional layers followed by a cropping layer and a residual activation unit, each of the one or more convolutional layers followed by a pixel shuffle layer and a cropping layer, the high line comprising two or more transposed convolutional layers, each of the two or more transposed convolutional layers followed by a cropping layer and residual activation, and after the convolutional layers followed by a residual nonlocal attention block combined with a pixel shuffle layer and a cropping layer.

[0594] One embodiment of the present invention discloses an encoder for encoding a picture, the encoder comprising a receiver for receiving a picture, a transmitter for outputting a bitstream, and one or more processors configured to implement a neural network according to any one of the above embodiments.

[0595] One embodiment of the present invention discloses a decoder for decoding a bitstream representing a picture, the decoder comprising a receiver for receiving the bitstream, a transmitter for outputting the decoded picture, and one or more processors configured to implement a neural network according to any one of the above embodiments.

[0596] One embodiment of the present invention discloses a method for processing a picture using a neural network (NN), wherein the NN includes a multi-stage context model (MCM), and the MCM comprises multiple MCMs. k The model includes, further comprising one or more down-shuffle layers and one or more up-shuffle layers, with a padding layer before each of the one or more down-shuffle layers and a cropping layer after each of the one or more up-shuffle layers, and the method is Steps to obtain the input tensor, A step of padding a first tensor with a padding layer before each of one or more down-shuffle layers, wherein the first tensor is an input tensor or a tensor obtained by processing the input tensor, The steps include: after a second tensor is output from each of one or more up-shuffle layers, the second tensor is cropped using a cropping layer; Includes.

[0597] In one embodiment, the NN further includes a hyperscale decoder, the hyperscale decoder includes a baseline and a highline, the baseline includes two or more quantized transposed convolutional layers, each of which is followed by a cropping layer and a normalized linear unit, the highline includes two or more quantized convolutional layers, each of which is followed by a pixel shuffle layer, a cropping layer and a normalized linear unit, and the method is as follows: Steps to obtain the operating point indicator, The steps include determining to process the third tensor using a baseline or high operating line based on the operating point indicator, Includes.

[0598] In one embodiment, the NN further comprises a composite transformation network, the composite transformation network comprising a baseline and a high line, the baseline comprising two or more transposed convolutional layers, each of the two or more transposed convolutional layers followed by a cropping layer and a residual activation unit, each of the one or more convolutional layers followed by a pixel shuffle layer and a cropping layer, the high line comprising two or more transposed convolutional layers, each of the two or more transposed convolutional layers followed by a cropping layer and residual activation, and after the convolutional layer, a residual nonlocal attention block combined with a pixel shuffle layer and a cropping layer, the method is The steps include obtaining a second input tensor by concatenating the main tensor and the auxiliary tensor, Steps to obtain the operating point indicator, The steps include deciding to process the second input tensor using a baseline or high line based on the operating point indicator, Includes.

[0599] In one embodiment, when the operating point indicator is equal to 0, the baseline is used to process the second input tensor.

[0600] In one embodiment, when the operating point indicator is equal to 1, the high line is used to process the second input tensor. Mathematical operators

[0601] The mathematical operators used in this application are similar to those used in the C programming language. However, the results of integer division and arithmetic shift operations are more precisely defined, and additional operations such as exponentiation and real-valued division are defined. The numbering and counting conventions generally start from 0, for example, “1st” is equal to the 0th, “2nd” is equal to the 1st, and so on. Arithmetic operators The following arithmetic operators are defined as follows: [Table 1] Logical operators The following logical operators are defined as follows:

number

number

number

number

number

number

number

[0602] • Order of operations If the precedence in an expression is not explicitly indicated by parentheses, the following rules apply: - Operations with higher priority are evaluated before any operations with lower priority. - Operations with the same priority are evaluated sequentially from left to right.

[0603] The table below specifies the order of operations from highest to lowest, with higher positions in the table indicating higher priority.

[0604] For operators also used in the C programming language, the precedence used herein is the same as that used in the C programming language.

[0605] Table: Order of operations from highest priority (top of table) to lowest priority (bottom of table) [Table 2]

[0606] Text description of logical operations

[0607] In text, the format is as follows: if (condition 0) statement 0 else if (condition 1) statement 1 ... else / * Useful notes regarding the remaining conditions * / statement n The statements of logical operations, which are mathematically described, can be explained as follows: ...The following applies: - If condition 0, statement 0 - If not, and condition 1 is met, then statement 1 - - Otherwise (useful notes regarding the remaining conditions), statement n

[0608] Each "If...Otherwise, if...Otherwise," statement in the text is introduced with "If." immediately following "...as follows" or "...the following applies." The last condition in "If...Otherwise, if...Otherwise," is always "Otherwise,...". Interleaved "If.Otherwise, if.Otherwise," statements can be identified by matching "...as follows" or "...the following applies" with the trailing "Otherwise, .".

[0609] In this text, logical operation statements are in the following format:

number

[0610] In text, the format is as follows: if (condition 0) statement 0 if (condition 1) statement 1 The statements of logical operations, which are mathematically described using , can be explained as follows: If condition 0, statement 0 If condition 1 is met, statement 1

[0611] While embodiments of the present invention have been described primarily in relation to video coding, it should be noted that embodiments of the coding system 10, encoder 20, and decoder 30 (and correspondingly system 10), as well as other embodiments described herein, may be configured for still image processing or coding, i.e., processing or coding of individual pictures independent of any preceding or consecutive pictures, such as video coding. Generally, when picture processing coding is limited to a single picture 17, only the interpretation units 244 (encoder) and 344 (decoder) may not be available. All other functions (also referred to as tools or techniques) of the video encoder 20 and video decoder 30 may equally be used for still image processing, such as residual calculation 204 / 304, transformation 206, quantization 208, inverse quantization 210 / 310, (inverse) transformation 212 / 312, segmentation 262 / 362, intra prediction 254 / 354, and / or loop filtering 220, 320, as well as entropy coding 270 and entropy decoding 304. In general, embodiments of the present disclosure may also be applied to other source signals, such as audio signals.

[0612] For example, embodiments of encoder 20 and decoder 30, and functions described herein with reference to encoder 20 and decoder 30, may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, functions may be stored as one or more instructions or codes on a computer-readable medium or transmitted over a communication medium and executed by a hardware-based processing unit. Computer-readable mediums may include computer-readable storage mediums corresponding to tangible media such as data storage mediums, or communication mediums including any medium that enables the transfer of computer programs from one location to another according to a communication protocol, for example. Thus, computer-readable mediums may generally correspond to (1) non-transient tangible computer-readable storage mediums, or (2) communication mediums such as signals or carrier waves. Data storage mediums may be any available medium that can be accessed by one or more computers or one or more processors to retrieve instructions, codes, and / or data structures for implementing the techniques described herein. Computer program products may include computer-readable mediums.

[0613] As an example, and not an limitation, such computer-readable storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other media that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Any connection is also appropriately referred to as a computer-readable medium. For example, if instructions are transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of a medium. However, it should be understood that computer-readable storage media and data storage media do not include connections, carriers, signals, or other temporary media, but instead refer to non-temporary tangible storage media. As used herein, the terms "disk" and "disc" include compact discs (CDs), laser discs, optical discs, digital multipurpose discs (DVDs), floppy disks (registered trademark), and Blu-ray discs, where a "disk" typically reproduces data magnetically, and a "disc" reproduces data optically using a laser. Any combination of the above should also be included within the scope of computer-readable media.

[0614] Instructions may be executed by one or more processors, such as digital signal processors (DSPs), general-purpose microprocessors, application-specific integrated circuits (ASICs), field-programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits. Therefore, the term “processor” as used herein may refer to any of the aforementioned structures or any other structure suitable for implementing the techniques described herein. Furthermore, in some embodiments, the functions described herein may be provided within dedicated hardware and / or software modules configured for encoding and decoding, or incorporated into a composite codec. The techniques may also be fully implemented in one or more circuits or logic elements.

[0615] The techniques of this disclosure can be implemented in a wide variety of devices or apparatus, including wireless handsets, integrated circuits (ICs), or sets of ICs (e.g., chipsets). Various components, modules, or units are described in this disclosure to highlight the functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require implementation by different hardware units. Rather, as described above, various units may be combined in a codec hardware unit or provided by a set of interoperable hardware units, including one or more processors as described above, along with suitable software and / or firmware.

Claims

1. A method for processing a picture using a neural network, NN, wherein the NN includes a multi-stage context model, MCM, and the MCM comprises multiple MCMs k The model includes, the MCM further includes a first down-shuffle layer and an up-shuffle layer, the first padding layer is before the first down-shuffle layer and the cropping layer is after the up-shuffle layer, and the method is The first step is to obtain the tensor, A step of obtaining a second tensor, wherein the second tensor is the output of the hyperdecoder, The steps include: padding the first tensor using the first padding layer; The steps include: down-shuffling the padded first tensor based on the first down-shuffling layer to obtain a re-shuffled first tensor; The MultipleMCM k The steps include processing the reshuffled first tensor and the second tensor based on the model to obtain a latent space tensor, The steps include: up-shuffling the latent space tensor based on the up-shuffling layer to obtain a re-shuffled latent space tensor; The steps include: cropping the reshuffled latent space tensor based on the cropping layer to obtain a reconstructed latent tensor; Methods that include...

2. The method according to claim 1, wherein the number of inputs to the tensor slice of the first down-shuffle layer is 2.

3. The method according to claim 1 or 2, wherein the number of inputs to the tensor slice of the up-shuffle layer is 2.

4. The first padding layer is Padd(H in ,W in ) is shown as, where H in ,W in The method according to any one of claims 1 to 3, wherein is the height and width of the first tensor.

5. The method according to claim 4, wherein the first padding layer has a stride of 2 and a depth of 5.

6. The cropping layer is shown as Crop(H in ,W in ), where H in ,W in is the height and width of the tensor output to the synthesis transformation, the method according to any one of claims 1 to 5.

7. The method according to claim 6, wherein the cropping layer has a stride of 2 and a depth of 5.

8. The method according to any one of claims 1 to 7, wherein the first tensor is a reconstructed residual tensor and the second tensor is an explicit prediction tensor.

9. A neural network, NN, wherein the NN includes a multi-stage context model, MCM, and the MCM comprises multiple MCMs. k The model includes, the MCM further includes a first down-shuffle layer and an up-shuffle layer, the first down-shuffle layer having a first padding layer, and the up-shuffle layer having a cropping layer, The first down-shuffle layer is used to change the size or shape of the tensor, and the up-shuffle layer is used to change the size or shape of the tensor. Neural network, NN.

10. The neural network according to claim 9, wherein the number of inputs to the tensor slice in the first down-shuffle layer is 2.

11. The neural network according to claim 9 or 10, wherein the number of inputs to the tensor slice in the up-shuffle layer is 2.

12. The first padding layer is dd(H in ,W in ) is shown as, where H in ,W in The NN according to any one of claims 9 to 11, wherein is the height and width of the first tensor.

13. The aforementioned cropping layer is Crop(H in ,W in ) is shown as, where H in ,W in The NN according to any one of claims 9 to 12, wherein is the height and width of the tensor output to the composite transformation.

14. The neural network according to any one of claims 9 to 13, wherein the first padding layer receives a tensor having a first size and outputs a tensor having a second size.

15. The NN according to any one of claims 9 to 14, wherein the first padding layer is formed by replication.

16. MCM k The model is MCM before k is 0 to k-1 k The neural network according to any one of claims 9 to 15, which uses the output tensor of the model as input.

17. A method for processing a picture using a neural network, NN, wherein the NN includes a hyperscale decoder, the hyperscale decoder includes a base operating point and a high operating point, and with respect to the base operating point, The hyperscale decoder includes a first quantized transposed convolutional layer, followed by a first cropping layer and a first normalized linear unit, a first quantized convolutional layer, followed by a second normalized linear unit, a second quantized transposed convolutional layer, followed by a second cropping layer and a third normalized linear unit, and a second quantized convolutional layer. Regarding the aforementioned high operating point, The hyperscale decoder includes two quantized convolutional layers, each of which is followed in order by a pixel shuffle layer, a cropping layer, and a normalized linear unit, and then includes three quantized convolutional layers, with two of the three quantized convolutional layers followed by a normalized linear unit. The aforementioned method, Steps to obtain the input tensor, The steps include obtaining the size of the input tensor, Steps to obtain the operating point indicator, The steps include determining, based on the operating point indicator, to process the input tensor using the base operating point or the high operating point, The steps include: outputting the processed tensor, Methods that include...

18. The method according to claim 17, wherein both the first quantized transposed convolutional layer and the second quantized transposed convolutional layer have a kernel size of 4 × 4 and a stride of 2.

19. The method according to claim 17 or 18, wherein the first cropping layer has a stride of 2 and a depth of 6.

20. The method according to any one of claims 17 to 19, wherein the second cropping layer has a stride of 2 and a depth of 5.

21. The method according to any one of claims 17 to 20, wherein the cropping layer following the first of the two quantized convolutional layers at the high operating point has a stride of 2 and a depth of 6, and the second of the two quantized convolutional layers at the high operating point has a stride of 2 and a depth of 5.

22. The method according to any one of claims 17 to 21, wherein the processed tensor is a hyperscale decoder standard deviation tensor.

23. A method for processing a picture using a neural network, NN, wherein the NN includes a composite transform network, the composite transform network includes a connected layer configured to connect a principal tensor and an auxiliary tensor as input tensors, a base operating point, and a high operating point, and with respect to the base operating point, The composite transformation network includes a lightweight residual block, a subsequent first transposed convolutional layer combined with a first cropping layer and a first residual activation unit, a second transposed convolutional layer combined with a second cropping layer and a second residual activation unit, a first convolutional layer combined with a third residual activation unit, a second convolutional layer, and a subsequent first pixel shuffle layer and a third cropping layer. Regarding the aforementioned high operating point, The composite transformation network includes two residual blocks, followed by a third transposed convolutional layer combined with a fourth cropping layer and a first residual activation, a fourth transposed convolutional layer combined with a fifth cropping layer and a second residual activation, a third convolutional layer, followed by a second pixel shuffle layer, and a residual nonlocal attention block combined with a sixth cropping layer and a third residual activation, and terminates with a fifth transposed convolutional layer and a subsequent seventh cropping layer. The aforementioned method, The steps include: concatenating the main tensor and the auxiliary tensor to obtain the input tensor, Steps to obtain the operating point indicator, The steps include obtaining the size of the input tensor, The steps include determining, based on the operating point indicator, to process the input tensor using the base operating point or the high operating point, The steps include: outputting the processed tensor, Methods that include...

24. The method according to claim 23, wherein the first cropping layer has a stride of 2 and a depth of 4.

25. The method according to claim 23 or 24, wherein the second cropping layer has a stride of 2 and a depth of 3.

26. The method according to any one of claims 23 to 25, wherein the third cropping layer has a stride of 4 and a depth of 1.

27. The method according to any one of claims 23 to 26, wherein the fourth cropping layer has a stride of 2 and a depth of 4.

28. The method according to any one of claims 23 to 27, wherein the fifth cropping layer has a stride of 2 and a depth of 3.

29. The method according to any one of claims 23 to 28, wherein the sixth cropping layer has a stride of 2 and a depth of 2.

30. The method according to any one of claims 23 to 29, wherein the seventh cropping layer has a stride of 2 and a depth of 1.

31. A neural network, NN, comprising a hyperscale decoder, the hyperscale decoder comprising a base operating point and a high operating point, wherein, with respect to the base operating point, the hyperscale decoder comprises a first quantized transposed convolutional layer, followed by a first cropping layer and a first normalized linear unit, a first quantized convolutional layer, followed by a second normalized linear unit, a second quantized transposed convolutional layer, followed by a second cropping layer and a third normalized linear unit, and a second quantized convolutional layer. With respect to the high operating point, the hyperscale decoder includes two quantized convolutional layers, each of which is followed in order by a pixel shuffle layer, a cropping layer, and a normalized linear unit, and then there are three quantized convolutional layers, two of which are followed by normalized linear units, forming a neural network (NN).

32. A neural network, NN, wherein the NN includes a composite transform network, the composite transform network includes a coupling layer configured to link a principal tensor and an auxiliary tensor as input tensors, a base operating point, and a high operating point, wherein with respect to the base operating point, the composite transform network includes a lightweight residual block, a subsequent first transposed convolutional layer combined with a first cropping layer and a first residual activation unit, a second transposed convolutional layer combined with a second cropping layer and a second residual activation unit, a first convolutional layer combined with a third residual activation unit, a second convolutional layer, and a subsequent first pixel shuffle layer and a third cropping layer. With respect to the high operating point, the composite transformation network includes two residual blocks, followed by a third transposed convolutional layer combined with a fourth cropping layer and a first residual activation, a fourth transposed convolutional layer combined with a fifth cropping layer and a second residual activation, a third convolutional layer, followed by a second pixel shuffle layer, followed by a sixth cropping layer and a third residual activation, and terminates with a fifth transposed convolutional layer and a seventh cropping layer, forming a neural network (NN).

33. An encoder for encoding a picture, comprising a receiver for receiving a picture and one or more processors configured to implement a neural network, NN, wherein the NN includes a multi-stage context model, MCM, and the MCM comprises multiple MCMs k An encoder comprising a model, wherein the MCM further comprises a first down-shuffle layer and an up-shuffle layer, wherein the first down-shuffle layer has a first padding layer before it and the up-shuffle layer has a cropping layer, and the encoder further comprises a transmitter that outputs a bitstream, and the encoder is adapted to perform the method according to any one of claims 1 to 8.

34. An encoder for encoding a picture, the encoder comprising a receiver for receiving a picture and one or more processors configured to implement a neural network, NN, the NN comprising a hyperscale decoder, the hyperscale decoder comprising a base operating point and a high operating point, the hyperscale decoder comprising, with respect to the base operating point, a first quantized transposed convolutional layer, a subsequent first cropping layer and a first normalized linear unit, a first quantized convolutional layer, a subsequent second normalized linear unit, a second quantized transposed convolutional layer, a subsequent second cropping layer and a third normalized linear unit, and a second quantized convolutional layer. With respect to the high operating point, the hyperscale decoder includes two quantization convolutional layers, each of the two quantization convolutional layers followed in order by a pixel shuffle layer, a cropping layer, and a normalization linear unit, and then three quantization convolutional layers, two of the three quantization convolutional layers followed by a normalization linear unit, and the encoder further includes a transmitter for outputting a bitstream, and the encoder is adapted to perform the method according to any one of claims 17 to 22.

35. An encoder for encoding a picture, the encoder comprising a receiver for receiving a picture and one or more processors configured to implement a neural network, NN, wherein the NN comprises a composite transform network, the composite transform network comprising a concatenation layer configured to concatenate a principal tensor and an auxiliary tensor as input tensors, a base operating point, and a high operating point, wherein with respect to the base operating point, the composite transform network comprises a lightweight residual block, a subsequent first transposed convolutional layer combined with a first cropping layer and a first residual activation unit, a second transposed convolutional layer combined with a second cropping layer and a second residual activation unit, a first convolutional layer combined with a third residual activation unit, a second convolutional layer, and a subsequent first pixel shuffle layer and a third cropping layer. With respect to the high operating point, the composite transformation network includes two residual blocks, followed by a third transposed convolutional layer combined with a fourth cropping layer and a first residual activation, a fourth transposed convolutional layer combined with a fifth cropping layer and a second residual activation, followed by a third convolutional layer, followed by a second pixel shuffle layer, and a residual nonlocal attention block combined with a sixth cropping layer and a third residual activation, and terminates with a fifth transposed convolutional layer and a seventh cropping layer. The encoder further includes a transmitter for outputting a bitstream, and the encoder is adapted to perform the method according to any one of claims 23 to 30.

36. An encoder for encoding a picture, the encoder comprising one or more processors implementing a neural network, NN, wherein the one or more processors are adapted to perform the method according to any one of claims 1 to 8, or any one of claims 17 to 22, or any one of claims 23 to 30.

37. A decoder for decoding a bitstream representing a picture, the decoder comprising a receiver for receiving the bitstream and one or more processors configured to implement a neural network, NN, wherein the NN comprises a multi-stage context model, MCM, and the MCM comprises a plurality of MCMs k A decoder comprising a model, wherein the MCM further comprises a first down-shuffle layer and an up-shuffle layer, wherein the first down-shuffle layer is preceded by a first padding layer and the up-shuffle layer is preceded by a cropping layer, and the decoder further comprises a transmitter for outputting a decoded picture, wherein the decoder is adapted to perform the method according to any one of claims 1 to 8.

38. A decoder for decoding a bitstream representing a picture, wherein the decoder includes a receiver for receiving the bitstream and one or more processors configured to implement a neural network, NN, wherein the NN includes a hyperscale decoder, the hyperscale decoder includes a base operating point and a high operating point, and with respect to the base operating point, the hyperscale decoder includes a first quantized transposed convolutional layer, followed by a first cropping layer and a first normalized linear unit, a first quantized convolutional layer, followed by a second normalized linear unit, a second quantized transposed convolutional layer, followed by a second cropping layer and a third normalized linear unit, and a second quantized convolutional layer, With respect to the high operating point, the hyperscale decoder includes two quantization convolutional layers, each of the two quantization convolutional layers followed in order by a pixel shuffle layer, a cropping layer, and a normalization linear unit, and then three quantization convolutional layers, two of the three quantization convolutional layers followed by a normalization linear unit, and the decoder further includes a transmitter for outputting a decoded picture, and the decoder is adapted to perform the method according to any one of claims 17 to 22.

39. A decoder for decoding a bitstream representing a picture, the decoder comprising a receiver for receiving the bitstream and one or more processors configured to implement a neural network, NN, wherein the NN comprises a composite transform network, the composite transform network comprising a connected layer configured to concatenate a principal tensor and an auxiliary tensor as input tensors, a base operating point, and a high operating point, wherein with respect to the base operating point, the composite transform network comprises a lightweight residual block, a subsequent first transposed convolutional layer combined with a first cropping layer and a first residual activation unit, a second transposed convolutional layer combined with a second cropping layer and a second residual activation unit, a first convolutional layer combined with a third residual activation unit, a second convolutional layer, and a subsequent first pixel shuffle layer and a third cropping layer. With respect to the high operating point, the composite conversion network includes two residual blocks, followed by a third transposed convolutional layer combined with a fourth cropping layer and a first residual activation, a fourth transposed convolutional layer combined with a fifth cropping layer and a second residual activation, followed by a third convolutional layer combined with a second pixel shuffle layer, and a residual nonlocal attention block combined with a sixth cropping layer and a third residual activation, and is completed by a fifth transposed convolutional layer and a seventh cropping layer, the decoder further includes a transmitter for outputting the decoded picture, and the decoder is adapted to perform the method according to any one of claims 23 to 30.

40. A decoder for decoding a bitstream representing a picture, wherein the decoder includes one or more processors for implementing a neural network, NN, the one or more processors are adapted to perform the method according to any one of claims 1 to 8, or any one of claims 17 to 22, or any one of claims 23 to 30.

41. A computer-readable storage medium comprising a computer-executable instruction, wherein, when executed on a computing system, the computer-executable instruction causes the computing system to perform the method according to any one of claims 1 to 8, or any one of claims 17 to 22, or any one of claims 23 to 30.

42. A neural network, NN, wherein the NN includes a multi-stage context model, MCM, and the MCM comprises multiple MCMs. k A neural network, NN, comprising a model, wherein the MCM further comprises one or more down-shuffle layers and one or more up-shuffle layers, each of the one or more down-shuffle layers having a padding layer before it and each of the one or more up-shuffle layers having a cropping layer after it.

43. The NN according to claim 42, further comprising a hyperscale decoder, the hyperscale decoder comprising a baseline and a high line, the baseline comprising two or more quantized transposed convolutional layers, each of the two or more quantized transposed convolutional layers followed by a cropping layer and a normalized linear unit, the high line comprising two or more quantized convolutional layers, at least one of the two or more quantized convolutional layers followed in order by a pixel shuffle layer, a cropping layer and a normalized linear unit.

44. The NN according to claim 42 or 43, further comprising a composite transformation network, the composite transformation network comprising a baseline and a high line, the baseline comprising two or more transposed convolutional layers, each of the two or more transposed convolutional layers followed by a cropping layer and a residual activation unit, comprising one or more convolutional layers, each of the one or more convolutional layers followed by a pixel shuffle layer and a cropping layer, the high line comprising two or more transposed convolutional layers, each of the two or more transposed convolutional layers followed by a cropping layer and residual activation, and a residual nonlocal attention block combined with a pixel shuffle layer and a cropping layer following the convolutional layer.

45. An encoder for encoding a picture, the encoder comprising a receiver for receiving a picture, a transmitter for outputting a bitstream, and one or more processors configured to implement a neural network, NN, as described in any one of claims 42 to 44.

46. A decoder for decoding a bitstream representing a picture, wherein the decoder includes a receiver for receiving the bitstream, a transmitter for outputting the decoded picture, and one or more processors configured to implement a neural network, NN, as described in any one of claims 42 to 44.

47. A method for processing a picture using a neural network, NN, wherein the NN includes a multi-stage context model, MCM, and the MCM comprises multiple MCMs k The model includes, the MCM further includes one or more down-shuffle layers and one or more up-shuffle layers, each of the one or more down-shuffle layers has a padding layer in front of it and each of the one or more up-shuffle layers has a cropping layer following it, and the method is Steps to obtain the input tensor, A step of padding a first tensor with a padding layer before each of the one or more down-shuffle layers, wherein the first tensor is the input tensor or a tensor obtained by processing the input tensor. The process includes the step of cropping the second tensor using a cropping layer after the second tensor has been output from each of the one or more up-shuffle layers, method.

48. The NN further includes a hyperscale decoder, the hyperscale decoder includes a baseline and a high line, the baseline includes two or more quantized transposed convolutional layers, each of the two or more quantized transposed convolutional layers followed by a cropping layer and a normalized linear unit, the high line includes two or more quantized convolutional layers, each of the two or more quantized convolutional layers followed by a pixel shuffle layer, a cropping layer and a normalized linear unit, the method further includes Steps to obtain the operating point indicator, The step of determining, based on the operating point indicator, to process a third tensor using the baseline or the high operating line, The method according to claim 47.

49. The NN further comprises a composite transformation network, the composite transformation network comprising a baseline and a high line, the baseline comprising two or more transposed convolutional layers, each of the two or more transposed convolutional layers followed by a cropping layer and a residual activation unit, and one or more convolutional layers, each of the one or more convolutional layers followed by a pixel shuffle layer and a cropping layer, the high line comprising two or more transposed convolutional layers, each of the two or more transposed convolutional layers followed by a cropping layer and a residual activation unit, and one or more convolutional layers, each of the one or more convolutional layers followed by a residual nonlocal attention block combined with a pixel shuffle layer and a cropping layer, the method further comprises The steps include obtaining a second input tensor by concatenating the main tensor and the auxiliary tensor, Steps to obtain the operating point indicator, The step of determining, based on the operating point indicator, to process the second input tensor using the baseline or the high line, method.

50. The method according to any one of claims 47 to 49, wherein the second input tensor is processed using the baseline when the operating point indicator is equal to 0.

51. The method according to any one of claims 47 to 50, wherein the second input tensor is processed using the high line when the operating point indicator is equal to 1.