Low complexity enhancement video coding

By combining a low-complexity enhanced video coding framework with enhancement layers, and using small transform kernels and sparse residual processing, the shortcomings of existing video encoders in terms of complexity and bandwidth saving are addressed, achieving flexible and efficient video coding suitable for a variety of application scenarios.

CN114503573BActive Publication Date: 2026-06-19V NOVA INT LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
V NOVA INT LTD
Filing Date
2020-03-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing scalable video encoders such as SVC and SHVC are insufficient in terms of complexity and bandwidth savings, making it difficult to meet the needs of modern video delivery, especially in situations where resources are limited and ecosystem upgrades are difficult.

Method used

A low-complexity enhanced video coding framework is adopted, which reduces computational complexity and improves coding efficiency by combining a basic codec with at least two enhancement layers, using small transform kernels and sparse residual processing techniques, and allows for parallel processing of enhancement layers.

Benefits of technology

It provides a flexible video encoding solution with low computational complexity and resource consumption, suitable for a variety of application scenarios, including OTT transmission and live UHD broadcasting, compatible with the existing ecosystem, and reduces encoding and decoding costs.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114503573B_ABST
    Figure CN114503573B_ABST
Patent Text Reader

Abstract

Describe an example of low-complexity video enhancement coding. Describe the encoding and decoding methods, and the corresponding encoders and decoders. The enhancement coding can operate on a base layer that provides basic encoding and decoding. Spatial scaling can be applied across different layers. Encoding only the base layer may be applied to the full video at a lower resolution. The enhancement coding operates on a computed set of residuals. The residual set is computed for multiple layers, which may represent different scaling levels in one or more dimensions. Describe several encoding and decoding components or tools, which may involve the application of transform, quantization, entropy coding, and temporal buffering. At the example decoder, the encoded base stream and one or more encoded enhancement streams can be independently decoded and combined to reconstruct the original video.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to video coding technology. More specifically, it relates to methods and systems for encoding and decoding video data. In certain instances, the methods and systems can be used to generate compressed representations for streaming and / or storage. Background Technology

[0002] Typical video codecs operate using a single-layer, block-based approach, whereby the raw signal is processed by several encoding tools to produce an encoded signal, which can then be reconstructed through a corresponding decoding process. For simplicity, encoding and decoding algorithms or processes are often referred to as "codecs"; the term "codec" is used to encompass one or more encoding and decoding processes designed according to a common framework. Typical codecs include (but are not limited to) MPEG-2, AVC / H.264, HEVC / H.265, VP8, VP9, ​​and AV1. Other codecs currently being developed by international standards organizations such as MPEG / ISO / ITU and industry consortia such as the Alliance for Open Media (AoM) also exist.

[0003] In recent years, adaptations to single-layer block-based methods have been proposed. For example, there exists a class of codecs that use multi-layer block-based methods. These codecs are often referred to in the video coding industry as "scalable" codecs. They typically replicate the operations performed by single-layer block-based methods across several layers, where the set of layers is obtained by downsampling the original signal. In some cases, the efficiency of single-layer block-based methods can be achieved by reusing information from lower layers to encode (and decode) upper layers. These scalable codecs intend to provide scalability to operators in the sense that they need to guarantee that the quality of the downscaled decoded signal (e.g., a lower resolution signal) meets the quality requirements of existing services, and that the quality of the unscaled decoded signal (e.g., a higher resolution signal) is comparable to that produced by the corresponding single-layer codec.

[0004] An example of a “scalable” codec is Scalable Video Coding-SVC (see, for example, “The Scalable Video Coding Extension of the H.264 / AVC Standard,” H. Schwarz and M. Wien, IEEE Signal Processing Journal, March 2008, which is incorporated herein by reference). SVC is a scalable form of the Advanced Video Coding Standard-AVC (AVC is also known as H.264). In SVC, each scalable layer is processed using the same AVC-based single-layer process, and the upper layer receives information from the lower layer (e.g., inter-layer predictions containing residual and motion information), which is used for coding at the upper layer to reduce the amount of coding information at that layer. Conversely, for decoding, the SVC decoder needs to receive various overhead information and decode the lower layer in order to be able to decode the upper layer.

[0005] Another example of a scalable codec is the scalable extension of the High Efficiency Video Coding Standard (HEVC) – SHVC (see, for example, "Overview of SHVC: Scalable Extensions of the High Efficiency Video Coding Standard," J. Boyce, Y. Ye, J. Chen, and A. Ramasubramonian, IEEE Transactions on Circuits and Systems for Video Technologies, Vol. 26, No. 1, January 2016, which is incorporated herein by reference). Similar to SVC, SHVC uses the same HEVC-based process for each scalable layer, but it allows lower layers to use either AVC or HEVC. In SHVC, the upper layer also receives information from the lower layer during its encoding process (e.g., interlayer processing includes motion information and / or upsampling of the lower layer as additional reference pictures for upper layer encoding) to reduce the amount of encoded information at the upper layer. Again, similar to SVC, the SHVC decoder needs to receive various overhead information and decode the lower layer in order to be able to decode the upper layer.

[0006] Both SVC and SHVC can be used to encode data in multiple streams at different quality levels. For example, SVC and SHVC can be used to encode, for instance, SD (Standard Definition) and HD (High Definition) streams or HD and UHD (Ultra High Definition) streams. The base stream (at the lowest quality level) is typically encoded such that its quality is the same as if it were encoded separately as a single stream from any higher-level streams. Both SVC and SHVC can be considered primarily as a collection of parallel copies of a common encoder and decoder architecture, where the outputs of these parallel copies are multiplexed and demultiplexed, respectively.

[0007] More specifically, within the example SVC encoding, a UHD stream (e.g., a series of images) can be downsampled to generate an HD stream. The UHD stream and HD stream are then encoded separately using an AVC encoder. Although this example describes a two-layer encoder (for encoding two streams: the UHD stream and the HD stream), an SVC encoder can have n layers (where n > 2), with each layer operating as an independent AVC encoder.

[0008] According to standard AVC encoding, the AVC encoder of each SVC layer encodes each pixel block using either inter-frame prediction (where different frames are used to estimate the value of the current frame) or intra-frame prediction (where other blocks within the same frame are used to estimate the value of a given block in the same frame). These pixel blocks are often referred to as "macroblocks". Inter-frame prediction involves performing motion compensation, which involves determining the motion between pixel blocks in previous frames and corresponding pixel blocks in the current frame. Both inter-frame and intra-frame prediction within a layer involve calculating so-called "residuals". These "residuals" are the differences between pixel blocks in the data stream of a given layer and corresponding pixel blocks within the same layer determined using inter-frame or intra-frame prediction. Thus, these "residuals" are the differences between the current pixel block in the layer and either: 1) a prediction of the current pixel block based on one or more pixel blocks within the frame that are not the current pixel block (e.g., typically adjacent pixel blocks within the same layer); or 2) a prediction of the current pixel block within the layer based on information from other frames within the layer (e.g., using motion vectors).

[0009] In SVC, although implemented as a collection of parallel AVC encoders, some efficiency can be gained by reusing information obtained for lower-quality streams (e.g., HD streams) to encode higher-quality streams (e.g., UHD streams). This reuse of information involves a process called "inter-layer signaling." It should be noted that this differs from "inter-frame" and "intra-frame" prediction, which are "intra-layer" coding methods. For example, without inter-layer signaling, the total bandwidth BW of an SVC stream is... Tot It can be expressed as BW Tot =BW HD +BW UHD BW HD It is the bandwidth associated with sending an encoded HD stream individually, and BW UHD This refers to the bandwidth associated with sending an encoded UHD stream separately (assuming no information sharing between different streams). However, by using interlayer communication, the bandwidth (BW) of the UHD stream can be increased compared to sending the UHD stream separately from the HD stream. UHD This can be reduced. Typically, by using inter-layer communication, the total bandwidth can be reduced, making...

[0010] In SVC, inter-layer communication can include one of three types of information: inter-layer intra-prediction (where upsampled pixel blocks from the HD stream are used in intra-prediction for the UHD stream), inter-layer residual prediction (which involves computing the residual calculated for the HD stream after upsampling and the residual calculated for the UHD stream for a given pixel block), and inter-layer motion compensation (which involves performing motion compensation for the UHD stream using motion compensation parameters determined for the HD stream).

[0011] Similar to SVC being a scalable extension of AVC, SHVC is a scalable extension of HEVC. AVC involves dividing a frame into macroblocks (typically 16×16 pixels in size). A given macroblock can be predicted from other macroblocks within the frame (intra-prediction) or from macroblocks in previous frames (inter-prediction). HEVC's macroblock-like structure is the coding tree unit (CTU), which can be larger than a macroblock (e.g., up to 64×64 pixels in size), and it is further divided into coding units (CUs). HEVC offers several improvements over AVC, including improved motion vector determination, motion compensation, and intra-prediction, which can allow for better data compression compared to AVC. However, the "scalable" aspect of HEVC is very similar to that of AVC; that is, both use the concept of parallel coding streams, thereby gaining some efficiency through inter-layer information exchange. For example, SHVC also provides inter-layer communication that includes inter-layer intra-prediction, inter-layer residual prediction, and inter-layer motion compensation. Similar to SVC, different quality levels, such as HD and UHD, are encoded by parallel layers and then combined in the stream for decoding.

[0012] Despite the availability of SVC and SHVC, the utilization of scalable codecs has consistently fallen short of expectations. One reason for this is the complexity of these solutions and their modest bandwidth savings. Within the video delivery field, many leading industry experts believe that currently available solutions cannot address the challenges of delivering video in the 21st century. These industry experts encompass a wide range of entities, from vendors and traditional broadcasters to satellite providers and over-the-top (OTT) service providers such as social media companies.

[0013] Generally, video service providers need to navigate complex ecosystems. The choice of video codecs is often based on a variety of factors, including maximum compatibility with their existing ecosystems and the cost of deploying the technology (both resource and monetary costs). Once a choice is made, it is difficult to change the codec without further large-scale investments in equipment and time. Currently, upgrading an ecosystem is difficult without a complete replacement. Furthermore, the resource costs and complexities of delivering an increasing number of services using decentralized infrastructure, such as so-called “cloud” configurations, are becoming critical issues for service operators of all sizes. This is compounded by the growth of low-resource, battery-powered edge devices (e.g., nodes in the so-called Internet of Things). All these factors need to be balanced between the need to reduce resource consumption (e.g., becoming more environmentally friendly) and the need to scale (e.g., increasing the number of users and services offered).

[0014] Many of the codecs used in comparisons were developed when large-scale commercial hardware was unavailable. This is no longer the case. Large-scale data centers offer inexpensive, general-purpose data processing hardware. This contradicts traditional video coding solutions that require custom hardware to operate effectively. Summary of the Invention

[0015] Various aspects of the invention are set forth in the appended independent claims. Certain variations of the invention are then set forth in the appended supplementary claims. Attached Figure Description

[0016] Examples of the invention will now be described with reference to the accompanying drawings, using only examples.

[0017] Figure 1 This is a schematic diagram of the encoder based on the first example.

[0018] Figure 2 This is a schematic diagram of the decoder based on the first instance.

[0019] Figure 3A This is a schematic diagram of an encoder based on the first variant of the second instance.

[0020] Figure 3B This is a schematic diagram of an encoder of a second variant based on the second instance.

[0021] Figure 4 This is a schematic diagram of the encoder based on the third example.

[0022] Figure 5A This is a schematic diagram of the decoder based on the second instance.

[0023] Figure 5B This is a schematic diagram of the first variant of the decoder based on the third instance.

[0024] Figure 5C This is a schematic diagram of a second variant of the decoder based on the third instance.

[0025] Figure 6A This is a schematic diagram illustrating an example of a 4x4 coding unit for residuals.

[0026] Figure 6B It is a schematic diagram showing how the coding units can be arranged into pieces.

[0027] Figures 7A to 7C It is a schematic diagram showing possible color plane layouts.

[0028] Figure 8 This is a flowchart demonstrating the method for configuring bitstreams.

[0029] Figure 9A It is a schematic diagram showing how a color plane can be broken down into multiple layers.

[0030] Figures 9B to 9J It is a schematic diagram illustrating various sampling methods.

[0031] Figures 10A to 10I It is a schematic diagram illustrating various methods for entropy encoding of quantized data.

[0032] Figures 11A to 11C It is a schematic diagram illustrating different time modes.

[0033] Figure 12A and 12B This is a schematic diagram illustrating the components used for application time prediction based on the instance.

[0034] Figure 12C and 12D It is a schematic diagram illustrating how time signaling is related to coding units and pieces.

[0035] Figure 12E This is a schematic diagram illustrating an instance state machine used for run-length encoding.

[0036] Figure 13A and 13B It is a flowchart showing the method of application time processing based on an instance.

[0037] Figures 14A to 14C It is a schematic diagram illustrating an example of cloud control.

[0038] Figure 15 It is a schematic diagram illustrating the residual weighting based on an instance.

[0039] Figures 16A to 16DIt is a schematic diagram illustrating the calculation of the predicted average element based on various instances.

[0040] Figure 17A and 17B This is a schematic diagram illustrating a rate controller that can be applied to one or more of the first and second level enhancement coding.

[0041] Figure 18 This is a schematic diagram illustrating the rate controller according to the first example.

[0042] Figure 19 This is a schematic diagram illustrating the rate controller according to the second example.

[0043] Figures 20A to 20D It is a schematic diagram illustrating the various aspects of quantification that can be used in the example.

[0044] Figure 21A and 21B It is a schematic diagram illustrating different bitstream configurations.

[0045] Figures 22A to 22D It is a schematic diagram illustrating different aspects of the sampler on an example neural network.

[0046] Figure 23 It is a schematic diagram illustrating an example of how frames can be encoded.

[0047] Figure 24 This is a schematic diagram of the decoder based on the fourth instance.

[0048] Figure 25 This is a schematic diagram of the encoder based on the fifth example.

[0049] Figure 26 This is a schematic diagram of the decoder based on the fifth instance.

[0050] Figure 27 It is a flowchart indicating the decoding process based on an instance.

[0051] Figures 28A to 28E Displays the parse tree used for the prefix-encoded instance.

[0052] Figure 29A This demonstrates two types of bitstreams that can be used to check the compliance of the decoder.

[0053] Figure 29B Demonstrates an example of a combined decoder.

[0054] Figure 30 Displays the instance positions of the chroma samples for the top and bottom fields of the instance frame. Detailed Implementation

[0055] introduction

[0056] The specific examples described in this paper relate to a framework architecture for a novel video coding technique for flexible, adaptable, efficient, and computationally inexpensive coding. It combines an optional base codec (e.g., AVC, HEVC, or any other current or future codec) with at least two enhancement layers of the encoded data. This framework architecture provides a low-complexity yet flexible method for enhancing video data.

[0057] The specific examples described herein are based on a novel, developed multilayer method. Details of this method are described, for example, in U.S. patents numbered US8,977,065, US8,948,248, US8,711,943, US9,129,411, US8,531,321, US9,510,018, US9,300,980, and US9,626,772, and in PCT applications numbered PCT / EP2013 / 059833, PCT / EP2013 / 059847, PCT / EP2013 / 059880, PCT / EP2013 / 059853, PCT / EP2013 / 059885, PCT / EP2013 / 059886, and PCT / IB2014 / 060716, all of which are incorporated herein by reference. This new multi-layer approach uses a hierarchy of layers, where each layer can be associated with different quality levels, such as different video resolutions.

[0058] Describe an example of low-complexity video enhancement coding. Describe the encoding and decoding methods, and the corresponding encoders and decoders. Enhancement coding can operate on a base layer, which provides basic encoding and decoding. Spatial scaling can be applied across different layers. Base layer encoding alone may be applied to the complete video at a lower resolution. Enhancement coding, on the other hand, operates on a computed set of residuals. The residual set is computed for multiple layers, which may represent different scaling levels in one or more dimensions. Describe several encoding and decoding components or tools, which may involve the application of transforms, quantization, entropy coding, and time buffering. At the example decoder, the encoded base stream and one or more encoded enhancement streams can be independently decoded and combined to reconstruct the original video.

[0059] The general structure of the example encoding scheme presented in this paper uses a downsampled source signal encoded by a basic codec, adds first-level correction data to the decoded output of the basic codec to generate a corrected image, and then adds another level of enhancement data to the upsampled form of the corrected image.

[0060] The encoded stream described herein can be considered as comprising a base stream and an enhancement stream. The enhancement stream may have multiple layers (e.g., two are described in the example). The base stream may be decoded by a hardware decoder, while the enhancement stream may be adapted to a software processing implementation with appropriate power consumption.

[0061] The specific example described in this article has a structure that provides multiple degrees of freedom, thereby allowing for greater flexibility and adaptability for many situations. This means that the encoding format is suitable for many use cases, including OTT transmission, live streaming, live UHD broadcasting, and more.

[0062] Although the decoded output of the base codec is not intended for viewing, it is a fully decoded video at a lower resolution, thus making the output compatible with existing decoders and usable as a lower resolution output where appropriate.

[0063] The following description depicts specific instance architectures used for video encoding and decoding. These architectures employ a small number of simple encoding tools to reduce complexity. When combined collaboratively, they offer improved visual quality compared to full-resolution images encoded with a basic codec, while also allowing for greater flexibility in how they can be used.

[0064] The instance described here provides a solution to the emerging need for increasingly lower power consumption, helping to reduce the computational cost of encoding and decoding while improving performance. The instance described here can operate as a software layer on existing infrastructure to achieve the desired performance. The instance provides a solution that is compatible with the existing (and future) video streaming and delivery ecosystem, while achieving video encoding at a lower computational cost than could previously be achieved with a simple upgrade. Combining the coding efficiency of the latest codecs with the reduced processing power of the instance described here can improve the technological landscape for the adoption of next-generation codecs.

[0065] The specific examples described in this document are based on residual operations. Residuals can be calculated by comparing two image or video signals. In one case, residuals are calculated by comparing frames from the input video stream with frames from the reconstructed video stream. In the case of a Level 1 enhancement stream as described herein, residuals can be calculated by comparing a downsampled input video stream with a first video stream encoded by a base encoder and then decoded by a base decoder (e.g., the first video stream simulates the decoding and reconstruction of the downsampled input video stream at the decoder). In the case of a Level 2 enhancement stream as described herein, residuals can be calculated by comparing an input video stream (e.g., at a higher quality level or resolution than the downsampled or base video stream) with a second video stream reconstructed from the upsampled form of the first video stream plus a set of decoded Level 1 residuals (e.g., the second video stream simulates both the decoded base stream and the Level 1 enhancement stream, reconstructing the video stream at a lower or downsampled quality level, and then upsampled this reconstructed video stream). This is, for example, in... Figures 1 to 5C It is displayed in the middle.

[0066] In a specific instance, the residual can therefore be viewed as an error or difference at a particular quality level or resolution. In the described instance, there are two quality levels or resolutions and therefore two sets of residuals (Level 1 and Level 2). Each set of residuals described herein models a different form of error or difference. For example, Level 1 residuals typically correct for characteristics of the base encoder, such as artifacts introduced by the base encoder as part of the encoding process. In contrast, Level 2 residuals typically correct for the combined effects introduced by transitions in the quality level and the differences introduced by Level 1 corrections (e.g., artifacts generated by the Level 1 encoding pipeline at a wider spatial scale, such as a region of 4 or 16 pixels). This means that the following is not self-evident: an operation performed on one set of residuals will necessarily provide the same effect to the other set of residuals; for example, each set of residuals may have different statistical patterns and correlation sets.

[0067] In the example described in this paper, the residuals are encoded by an encoding pipeline. This may include transform, quantization, and entropy encoding operations. It may also include residual grading, weighting and filtering, and time processing. These pipelines are shown in... Figure 1And in 3A and 3B. The residual is then transmitted to the decoder, for example as layer 1 and layer 2 enhancement streams, which can be combined with the base stream as a hybrid stream (or transmitted separately). In one case, a bit rate is set for the hybrid data stream including the base stream and the two enhancement streams, and then different adaptive bit rates are applied to individual streams based on the data being processed to meet the set bit rate (e.g., high-quality video perceived with low artifact levels can be constructed by adaptively assigning bit rates to different individual streams (even at the frame-by-frame level) so that constrained data can be used by the individual stream that is most perceptibly influential, which can change as the image data changes).

[0068] The set of residuals described in this paper can be considered sparse data, for example, in many cases there is no difference for a given pixel or region, and the resulting residual value is zero. When looking at the distribution of the residuals, many probability masses are assigned to small residual values ​​located close to zero, such as for some video values ​​of -2, -1, 0, 1, 2, etc., which occur most frequently. In some cases, the distribution of residual values ​​is symmetrical or approximately symmetrical about 0. In some test video cases, the distribution of residual values ​​is found to have a shape similar to a logarithmic or exponential distribution about 0 (e.g., symmetrical or approximately symmetrical). The exact distribution of the residual values ​​can depend on the content of the input video stream.

[0069] The residual can be processed into a two-dimensional image, such as a difference image. In this way, the sparsity of the data can be seen to involve features visible in the residual image, such as “points,” small “lines,” “edges,” and “corners.” These features have been found to be generally not perfectly correlated (e.g., spatially and / or temporally). These features have properties that differ from those of the image data from which they originate (e.g., the pixel characteristics of the original video signal).

[0070] Because the characteristics of the current residual, which contains transformed residuals in coefficient form, differ from those of the image data from which it originates, it is generally impossible to apply standard coding methods, such as those seen in the traditional Moving Picture Experts Group (MPEG) coding and decoding standards. For example, many contrast schemes use large transforms (e.g., transforms of large pixel regions in a normal video frame). Due to the characteristics of the residuals, such as those described above, using these large transforms of contrast on residual images would be extremely inefficient. For example, encoding small points in a residual image using large blocks of regions designed for normal images would be very difficult.

[0071] The specific instances described in this paper address these problems by alternatively using smaller and simpler transform kernels (e.g., 2×2 or 4×4 kernels – directional decomposition and directional decomposition squared, as presented in this paper). This moves in a different direction from the contrasting video coding methods. Applying these new methods to residual blocks yields compression efficiency. For example, some transforms generate uncorrelated coefficients (e.g., in space) that can be efficiently compressed. While correlations between coefficients can be utilized, for example, for lines in the residual image, these correlations can lead to coding complexity, making implementation difficult on conventional and low-resource devices, and these correlations often generate other complex artifacts that need to be corrected. In the current instance, a transform (Hadamard) different from the contrasting methods is used to encode the correction data and residuals. For example, the transform presented in this paper is far more efficient than transforming larger data blocks using the Discrete Cosine Transform (DCT) (which is the transform used in SVC / SHVC).

[0072] The specific examples described in this paper also consider the temporal and spatial characteristics of residuals. For example, in a residual image, details such as "edges" and "points" observable in the residual "image" exhibit very little temporal correlation. This is because "edges" in a residual image typically do not translate or rotate as they would in a normal video stream. For example, within a residual image, "edges" can actually change shape over time; for instance, head turning can be captured within multiple residual image "edges," but this head turning may not move in a standard manner (because the "edges" reflect a composite difference that depends on factors such as illumination, scaling factors, coding factors, etc.). These temporal aspects of residual images, such as residual "videos" that include sequential residual "frames" or "pictures," are typically different from the temporal aspects of conventional images, such as normal video frames (e.g., in the Y, U, or V planes). Therefore, it is not obvious how conventional coding methods can be applied to residual images; in fact, it has been found that contrasting video coding schemes and standard motion compensation methods are ineffective at encoding residual data (e.g., in a useful way).

[0073] The AVC layer within an SVC may involve calculating the data mentioned in the contrast criteria as "residuals." However, these contrast "residuals" are the difference between pixel blocks of the layer's data stream and the corresponding pixel blocks determined using inter-frame prediction or intra-frame prediction. However, these contrast "residuals" are significantly different from the residuals encoded in the current instance. In SVC, a "residual" is the difference between pixel blocks of a frame and the predicted pixel blocks of that frame (predicted using inter-frame prediction or intra-frame prediction). In contrast, the current instance involves calculating the residuals as the difference between a coded block and a reconstructed coded block (e.g., one that has undergone downsampling and subsequent upsampling, and whose encoding / decoding errors have been corrected).

[0074] Furthermore, many contrasting video coding methods attempt to provide temporal prediction and motion compensation as defaults for regular video data. These "built-in" methods may not only fail when applied to sequential residual images, but they may also consume unnecessary processing resources (e.g., resources that could be used while actually corrupting video coding). They may also generate unnecessary bits that consume the assigned bit rate. How to address these problems is not immediately apparent from conventional methods.

[0075] The specific instances described in this paper (e.g., as described in the "Temporal Aspects" section and elsewhere) provide an efficient way to predict temporal features within a residual image. These instances use zero-motion vector predictions to efficiently predict the temporal aspect and movement within the residual. It can be seen that these predictables are used for movement of relatively static features (e.g., applying a second temporal pattern, i.e., inter-frame prediction, to residual features that persist over time), and then a first temporal pattern (e.g., intra-frame prediction) is used for all other features. Therefore, the specific instances described in this paper do not attempt to waste scarce resources and bit rate to predict transiently irrelevant temporal features in the residual "video".

[0076] The specific examples described in this article allow for enhancements to legacy, existing, and future codecs. These examples can thus leverage the capabilities of this code as part of a base layer and provide improvements in the form of enhancement layers.

[0077] The specific examples described in this paper exhibit low complexity. This enables enhancement of the underlying codec in a manner with low computational complexity and / or to achieve general parallelization. If downsampling is used before the underlying codec (e.g., for applications of spatial scalability), reduced computational complexity can be provided to the video signal at the original input resolution compared to using the underlying codec at the original input resolution. This allows for the widespread adoption of ultra-high resolution video. For example, many advantages can be achieved by combining processing lower-resolution input video using a single-layer existing codec with an upsampling pattern that adds detail to the processed video using a simple and small set of highly specialized tools.

[0078] The specific examples described in this paper implement several modular yet specialized video coding tools. Tools are designed to form enhancement layers (comprising two enhancement levels at two distinct points) for a specific type of data: residual data. As described in this paper, residual data is generated by comparing the original data signal with the reconstructed data signal. The reconstructed data signal is generated in a manner different from the contrasting video coding scheme. For example, the reconstructed data signal involves a specific small spatial portion of the input video frame—a coding unit. The set of coding units of a frame can be processed in parallel because, compared to inter-frame and intra-frame prediction in contrasting video coding techniques, residual data is not generated using other coding units of the frame or other coding units of other frames. Although temporal processing can be applied, this is applied at the coding unit level using previous data from the current coding unit. There are no interdependencies between coding units.

[0079] The specific video coding tools described in this paper are particularly well-suited for processing sparse residual data. Due to different generation methods, the residual data used in this paper exhibits properties different from those of the contrasting video coding techniques. As illustrated in the figure, the specific instance described in this paper provides one or two enhancement layers for processing residual data. The residual data is generated by taking the difference between a reference video frame (e.g., the source video) and the underlying decoding pattern of the video (e.g., depending on the layer, with or without upsampling). The resulting residual data is sparse information, typically edges, points, and details, which is then processed using small transforms designed to handle sparse information. These small transforms can be scale-invariant, for example, having integer values ​​in the range {-1, 1}.

[0080] The specific examples described in this paper allow for the efficient use of existing codecs. For instance, a base encoder is typically applied at a lower resolution (e.g., compared to the original input signal). A base decoder is then used to decode the output of the base encoder at this lower resolution, and the resulting decoded signal is used to generate decoded data. Because of this, the base codec operates on a smaller number of pixels, thus allowing the codec to operate at a higher quality level (e.g., a smaller quantization step size) and to use its own internal encoding tools more efficiently. It also consumes less power.

[0081] The specific examples described in this paper provide a flexible and adaptive encoding process. For instance, the configuration of the enhancement layers allows the overall encoding process to be flexible against typical coding artifacts introduced by codecs based on traditional Discrete Cosine Transform (DCT) blocks that can be used in the base layers. The first enhancement layer (Layer 1 Residual) corrects for artifacts introduced by the base codec, while the second enhancement layer (Layer 2 Residual) adds detail and sharpness to the corrected upsampled form of the signal. The correction level can be adjusted by controlling the bit rate until a form with maximum fidelity and lossless encoding is provided. Generally, the worse the base reconstruction, the more the first enhancement layer can contribute to correction (e.g., in the form of coded residual data output by said layer). Conversely, the better the base reconstruction, the more bit rate can be allocated to the second enhancement layer (Layer 2 Residual) to sharpen the video and add fine details.

[0082] The specific instances described in this article provide agnostic base layer enhancements. For example, these instances can be used to enhance any base codec, from existing codecs such as MPEG-2, VP8, AVC, HEVC, VP9, ​​and AV1 to future codecs under development such as EVC and VVC. This is possible because the enhancement layer operates on the decoded form of the base codec, and therefore it can be used with any format since it does not require any information about how the base layer has been encoded and / or decoded.

[0083] As described below, the specific instance described herein allows for the parallelization of enhancement layer encoding. For example, the enhancement layer does not perform any form of inter-block (i.e., between blocks) prediction. A small (2×2 or 4×4) independent transform kernel is applied to the image on the residual data layer. Because no predictions are made between blocks, each 2×2 or 4×4 block can be processed independently and in parallel. Furthermore, processing each layer separately allows for block decoding and layer decoding in massively parallel processing.

[0084] For the instance described here, errors introduced by the encoding / decoding process and the downsampling / upsampling process can be corrected separately to regenerate the original video on the decoder side. Therefore, the size of the encoded residual and the encoded corrected data is smaller than the input video itself, and thus the encoded residual and the encoded corrected data can be sent to the decoder more efficiently than the input video (and therefore more efficiently than the UHD streams compared to the SVC and SHVC methods).

[0085] In a further comparison with SVC and SHVC, the specific instance described involves sending the encoded residual and correction data to the decoder instead of the encoded UHD stream itself. In contrast, in SVC and SHVC, both HD and UHD images are encoded as separate video streams and sent to the decoder. The instance described here can allow for a significant reduction in the overall bit rate used to send the encoded data to the decoder, for example, making... In these cases, the total bandwidth used to send both HD and UHD streams may be less than the bandwidth required by the comparative standard to send only UHD streams.

[0086] The instance described here further allows for parallel processing of coding units or blocks rather than sequential processing. This is because the instance described here does not apply intra-frame prediction; there is a very limited spatial correlation between the spatial coefficients of different blocks, whereas SVC / SHVC implements intra-frame prediction. This is more efficient than the contrasting approach of SVC / SHVC, which involves sequential processing of blocks (e.g., because UHD streams depend on predictions of individual pixels from HD streams).

[0087] The enhanced coding described in the examples in this paper can be viewed as an enhanced codec that encodes and decodes residual data streams. This differs from the contrasting SVC and SHVC implementations, where the encoder receives video data as input at each spatial resolution level, and the decoder outputs video data at each spatial resolution level. Thus, the contrasting SVC and SHVC can be viewed as a parallel implementation of a set of codecs, each with a video input / video output coding structure. On the other hand, the enhanced codec described in this paper receives residual data at each spatial resolution level and also outputs residual data. For example, in SVC and SHVC, the outputs at each spatial resolution level are not summed to generate the output video—this would be meaningless.

[0088] It should be noted that, in this example, references to levels 1 and 2 will be treated as arbitrary labels for the enhanced sub-layers, or may be referred to by different names (e.g., using a reverse numbering system, where levels 1 and 2 are labeled as level 1 and level 0 respectively, with the underlying "level 0" base layer being level 2).

[0089] Definitions and Terms

[0090] In some instances described in this article, the following terms are used.

[0091] "Access Unit" - This refers to a collection of Network Abstraction Layer (NAL) units that are related to each other according to specified classification rules. These can be encoded images (i.e., frames) that are consecutive in decoding order and contain video (in some cases, exactly one).

[0092] "Base Layer" - This refers to the layer concerning the encoded base image, where "base" refers to the codec that receives the processed input video data. It may involve a portion of the underlying bitstream.

[0093] "Bitstream" - This is a sequence of bits that can be supplied as a NAL unit stream or a byte stream. It can form a representation of encoded images and associated data, thereby forming one or more encoded video sequences (CVS).

[0094] A “block” is an MxN (M columns by N rows) array of samples, or an MxN array of transform coefficients. The terms “coding unit” or “coding block” are also used to refer to an MxN array of samples. These terms can be used to refer to a set of pixels (e.g., the values ​​of pixels in a particular color channel), a set of residual features, a set of values ​​representing processed residual features, and / or a set of encoded values. The term “coding unit” is sometimes used to refer to a coding block of luminance samples or chrominance samples in an image with three sample arrays, or a coding block of samples in a black-and-white image or an image encoded using three separate color planes and a syntax structure for encoding the samples.

[0095] A “byte” is an 8-bit sequence in which the leftmost and rightmost bits represent the most significant and least significant bits, respectively, when the sequence is written or read as a bit value.

[0096] "Byte-alignment" - A bit or byte or syntax element is considered byte-aligned when a position in the bit stream is an integer multiple of 8 bits from the position of the first bit in the bit stream, and when the positions of bits, bytes, or syntax elements appearing in the bit stream are byte-aligned.

[0097] "Byte stream" - This can be used to refer to the encapsulation of a NAL unit stream containing a start code prefix and NAL units.

[0098] "Chroma" - This is used as an adjective to specify a sample array or a single sample representing a color signal. It can be one of two color difference signals associated with, for example, primary colors represented by the symbols Cb and Cr. It can also refer to a channel within a set of color channels that provides information about the coloring of an image. The term chroma is used instead of chrominance to avoid implying the use of linear light transmission characteristics often associated with the term chrominance.

[0099] "Cluster" - This refers to the entropy-encoded portion of data containing quantized transform coefficients belonging to a coefficient group.

[0100] "Encoded image" - This refers to the set of encoded units that represent an image.

[0101] "Encoded base image" - This can refer to the encoded representation of an image encoded using a base encoding process that is separate from (and often different from) the enhancement encoding process.

[0102] "Represented by encoding" - Data elements are represented in their encoded form.

[0103] "Coefficient Group (CG)" - refers to a syntactic structure containing encoded data associated with a specific set of transform coefficients (i.e., the set of transformed residual values).

[0104] "Component" or "color component" - this is used to refer to an array or a single sample from a set of color component arrays. A color component may include a lightness and two chromaticity components and / or red, green, and blue (RGB) components. Color components may not have a one-to-one sampling frequency; for example, the components may constitute an image in a 4:2:0, 4:2:2, or 4:4:4 color format. The specific examples described herein may also refer to a single monochrome (e.g., lightness or grayscale) image where there is a single array or a single sample of the array that constitutes the monochrome image.

[0105] "Data block" - This refers to a syntax structure containing bytes corresponding to a data type.

[0106] "Decoded base image" - This refers to a decoded image derived by decoding an encoded base image.

[0107] "Decoded Image" - A decoded image can be derived by decoding an encoded image. A decoded image can be a decoded frame or a decoded field. A decoded field can be a decoded top field or a decoded bottom field.

[0108] "Decoded Image Buffer (DPB)" - This refers to a buffer that holds decoded images for reference or output reordering.

[0109] "Decoder" - a device or apparatus that embodies the decoding process.

[0110] "Decoding order" - This can refer to the order in which grammatical elements are processed during the decoding process.

[0111] "Decoding process" - This refers to the process of reading a bitstream and deriving a decoded image from it.

[0112] "Range-avoidance byte" - This is used in a specific instance to refer to a byte equal to 0x03 that can exist within a NAL cell. The race-avoidance byte can be used to ensure that a sequence of consecutive byte alignment bytes in a NAL cell does not contain a start code prefix.

[0113] "Encoder" - a device or apparatus that embodies the encoding process.

[0114] "Encoding process" - this refers to the process of generating a bit stream (i.e., an encoded bit stream).

[0115] "Enhancement Layer" - This refers to the layer containing encoded enhancement data used to enhance the "base layer" (sometimes called the "base"). It may involve the planar portion of the bitstream that includes residual data. The singular term is used to refer to encoding and / or decoding processes that differ from the "base" encoding and / or decoding processes.

[0116] "Enhancement Sublayer" - In some instances, an enhancement layer comprises multiple sublayers. For example, the first and second levels described below are "enhancement sublayers" of layers considered as enhancement layers.

[0117] "Field" - This term is used in a specific context to refer to a combination of alternating frames. A frame consists of two fields: a top field and a bottom field. The term "field" can be used in the context of interlaced video frames.

[0118] "Video frame" - In some instances, a video frame may comprise a frame consisting of a luminance sample array in monochrome format or a luminance sample array and two corresponding chrominance sample arrays. The luminance and chrominance samples may be supplied in 4:2:0, 4:2:2, and 4:4:4 color formats (and others). A frame may consist of two fields: a top field and a bottom field (these terms may be used in the context of interlaced video, for example).

[0119] "Group of Pictures (GOP)" - This term refers to a collection of consecutive coded base pictures, starting with an intraframe picture. The coded base pictures provide a reference order for the augmentation data used for those pictures.

[0120] "Instant Decoding and Refreshing (IDR) Image" - This refers to an image whose NAL unit contains a global configuration data block.

[0121] "Inverse transform" - this refers to part of the decoding process that transforms the set of transform coefficients into residuals.

[0122] "Layer" - This term is used in specific instances to refer to one of a set of grammatical structures in a non-branching hierarchical relationship, such as when referring to the "base" and "enhancement" layers or two (sub)"layers of the enhancement layer.

[0123] "Luminance" - This term is used as an adjective to specify an array or single sample representing, for example, the brightness of a primary color or a monochromatic signal. Luminance samples can be represented by the symbol or subscript Y or L. The term "luma" is used instead of "luminance" to avoid implying the use of linear light transmission characteristics often associated with the term "luminance." Sometimes the symbol L is used instead of the symbol Y to avoid confusion with the symbol y, such as that used for vertical positions.

[0124] "Network Abstraction Layer (NAL) Unit (NALU)" - This is a syntax structure containing the data type to be followed and an indication of the bytes containing the data in the form of a raw byte sequence payload (RBSP - see definition below).

[0125] "Network Abstraction Layer (NAL) Unit Stream" - a sequence of NAL units.

[0126] "Output Order" - This is used in a specific instance to refer to the order in which the decoded images are output from the decoded image buffer (for decoded images to be output from the decoded image buffer).

[0127] "Segmentation" - This term is used in a specific context to refer to dividing a set into subsets. It can be used to refer to a situation where every element of the set is in exactly one of the subsets.

[0128] "Plane" - This term is used to refer to a set of data related to color components. For example, a plane may include the Y (lightness) or Cx (chroma) plane. In some cases, monochrome video may have only one color component, and therefore an image or frame may include one or more planes.

[0129] "Picture" is used as a common term for fields or frames. In some cases, the terms frame and picture are used interchangeably.

[0130] "Random access" - This is used in a specific context to refer to the action of starting the decoding process for a bitstream at a point other than the beginning of the stream.

[0131] "Raw Byte Sequence Payload (RBSP)" - An RBSP is a syntax structure containing an integer number of bytes encapsulated within NAL units. An RBSP can be empty or consist of a data bit string containing syntax elements followed by an RBSP stop bit and then zero or more zero-equal subsequent bits. RBSPs can be used to prevent byte scattering as needed, in conjunction with contention.

[0132] "Raw Byte Sequence Payload (RBSP) Stop Bit" - This is a bit that can be set to 1 and is included in the Raw Byte Sequence Payload (RBSP) after the data bit string. The position of the end of the data bit string within the RBSP can be identified by searching for the RBSP stop bit (which is the last non-zero bit in the RBSP) from the end of the RBSP.

[0133] "Reserved" - This term can refer to the value of a syntactic element that is not used in the bitstream described herein but is reserved for future use or expansion. The term "reserved zero" can refer to a reserved bit value that is set to zero in an instance.

[0134] "Residual" - this term is defined in other instances below. It generally refers to the difference between a reconstructed form of a sample or data feature and a reference to the same sample or data feature.

[0135] "Residual plane" - This term is used to refer to a set of residuals organized in a planar structure, such as a color component plane. A residual plane may include multiple residuals (i.e., residual cells) that may be array elements having a certain value (e.g., an integer value).

[0136] "Run-length encoding" - This is a method for encoding sequences of values, where consecutive occurrences of the same value are represented as a single value along with its number of occurrences.

[0137] "Source" - This term is used in specific instances to describe video material or some of its properties before encoding.

[0138] "Start Code Prefix" - This refers to a unique three-byte sequence equal to 0x000001 embedded in the byte stream as a prefix to each NAL unit. The position of the start code prefix can be used by the decoder to identify the beginning of a new NAL unit and the end of a previous NAL unit. Race conditions for the start code prefix can be prevented by including a race-preventing byte within the NAL unit.

[0139] "Data Bit String (SODB)" - This term refers to a sequence of bits representing a certain number of syntactic elements present in the original byte sequence payload before the stop bits of the original byte sequence payload. In SODB, the leftmost bit is considered the first and most significant bit, and the rightmost bit is considered the last and least significant bit.

[0140] "Syntax element" - This term can be used to refer to the elements of data represented in a bitstream.

[0141] "Syntactic structure" - this term can be used to refer to zero or more syntactic elements that exist together in a bitstream in a specific order.

[0142] "Piece" - This term is used in specific instances to refer to a rectangular area of ​​blocks or coding units within a particular image, such as a region of a frame containing multiple coding units, where the size of the coding unit is set based on the applied transform.

[0143] "Transform coefficient" or (only "coefficient") - this term refers to the value produced when a transform is applied to the residual or data derived from the residual (e.g., processed residual). It can be a scalar quantity considered to be in the transform domain. In one case, an M*N coding unit can be flattened into an M*N one-dimensional array. In this case, the transform can include the multiplication of the one-dimensional array with an M*N transform matrix. In this case, the output can include another (flattened) M*N one-dimensional array. In this output, each element can involve a different "coefficient," for example, for a 2×2 coding unit, there can be four different types of coefficients. Thus, the term "coefficient" can also be associated with a specific index in the inverse transform portion of the decoding process, for example, representing a specific index in the aforementioned one-dimensional array of the transformed residual.

[0144] "Video Coding Layer (VCL) NAL Unit" - This is a common term for NAL units that have a reserved value NalUnitType and are classified as VCLNAL units in some instances.

[0145] In addition to the terms mentioned above, the following abbreviations are sometimes used:

[0146] CG - Coefficient Group; CPB - Coated Picture Buffer; CPBB - Basic Coated Picture Buffer; CPBL - Enhanced Coated Picture Buffer; CU - Coding Unit; CVS - Coated Video Sequence; DPB - Decoded Picture Buffer; DPBB - Basic Decoded Picture Buffer; DUT - Decoder Under Test; HBD - Hypothetical Basic Decoder; HD - Hypothetical Demultiplexer; HRD - Hypothetical Reference Decoder; HSS - Hypothetical Stream Scheduler; I - Intra-Frame; IDR - Instant Decoder Refresh; LSB - Least Significant Bit; MSB - Most Significant Bit; NAL - Network Abstraction Layer; P - Prediction; RBSP - Raw Byte Sequence Payload; RGB - Red, Green, Blue (also usable as GBR - Green, Blue, Red - i.e., reordered RGB); RLE - Run-Length Encoding; SEI - Supplemental Enhancement Information; SODB - Data Bit String; SPS - Sequence Parameter Set; and VCL - Video Coding Layer.

[0147] Instance encoders and decoders

[0148] First Example Encoder - General Architecture

[0149] Figure 1 The first example encoder 100 is shown. The components shown can also be implemented as steps in the corresponding encoding process.

[0150] In encoder 100, the input full-resolution video 102 is received and processed to generate various encoded streams. At downsampling component 104, the input video 102 is downsampled. The output of downsampling component 104 is received by a base codec including base encoder 102 and base decoder 104. A first encoded stream (encoded base stream) 116 is generated by feeding the downsampled form of the input video 102 to the base codec (e.g., AVC, HEVC, or any other codec). At first subtraction component 120, a first residual set is obtained by taking the difference between the reconstructed base codec video output by base decoder 104 and the downsampled form of the input video (i.e., as output by downsampling component 104). Layer 1 encoding component 122 is applied to the first residual set output by first subtraction component 120 to generate a second encoded stream (encoded Layer 1 stream) 126.

[0151] exist Figure 1 In this example, the Level 1 encoding component 122 operates in conjunction with an optional Level 1 time buffer 124. This can be used to apply time processing, as described later below. After the first encoding level of the Level 1 encoding component 122, the first encoded stream 126 can be decoded by the Level 1 decoding component 128. A deblocking filter 130 can be applied to the output of the Level 1 decoding component 128. Figure 1 In this process, the output of the deblocking filter 130 is added to the output of the base decoder 114 by the summing component 132 (i.e., added to the reconstructed base codec video) to generate a corrected form of the reconstructed base codec video. The output of the summing component 132 is then upsampled by the upsampling component 134 to produce an upsampled form of the corrected form of the reconstructed base codec video.

[0152] At the second subtraction component 136, the difference between the oversampled form of the reconstructed base coded video (i.e., the output of the oversampled component 134) and the input video 102 is taken. This produces a second residual set. The second residual set output by the second subtraction component 136 is passed to the level 2 encoding component 142. The level 2 encoding component 142 generates a third encoded stream (flowing through the level 2 encoded stream) 146 by encoding the second residual set. The level 2 encoding component 142 can operate together with the level 2 time buffer 144 to apply time processing. One or more of the level 1 encoding component 122 and the level 2 encoding component 142 can apply residual selection, as described below. This is shown as being controlled by the residual mode selection component 150. The residual mode selection component 150 can receive the input video 102 and apply residual mode selection based on the analysis of the input video 102. Similarly, level 1 time buffer 124 and level 2 time buffer 144 can operate under the control of time selection component 152. Time selection component 152 can receive one or more of the input video 102 and the output of downsampling component 104 to select a time mode. This will be explained in more detail in a later example.

[0153] First Instance Decoder - General Architecture

[0154] Figure 2 A first example decoder 200 is shown. The components shown can also be implemented as steps in the corresponding decoding process. Decoder 200 receives three encoded streams: encoded base stream 216, encoded level 1 stream 226, and encoded level 2 stream 246. These three encoded streams correspond to the steps of a decoding process. Figure 1 The encoder 100 generates three streams. Figure 2 In this instance, the three encoded streams are received together with a header 256 containing further decoding information.

[0155] The encoded base stream 216 is encoded by a base codec corresponding to the one used in encoder 100 (e.g., corresponding to the base codec used in encoder 100). Figure 1 The basic decoder 218 of the basic decoder 114 in the middle decodes. At the first summing component 220, the output of the basic decoder 218 is combined with the decoded first residual set obtained from the encoded layer 1 stream 226. Specifically, the layer 1 decoding component 228 receives the encoded layer 1 stream 226 and decodes the stream to produce the decoded first residual set. The layer 1 decoding component 228 may use the layer 1 time buffer 230 to decode the encoded layer 1 stream 226. Figure 2 In this example, the output of the level 1 decoding component 228 is passed to the deblocking filter 232. The level 1 decoding component 228 can be similar to that of... Figure 1 The encoder 100 uses a level 1 decoding component 128. The deblocking filter 232 may also be similar to the deblocking filter 130 used by the encoder 100. Figure 2 In this process, the output of the deblocking filter 232 forms a decoded first residual set, which is combined with the output of the base decoder 218 by the first summing component 220. The output of the first summing component 220 can be considered as a reconstruction at correction level 1, wherein the decoded first residual set corrects the output of the base decoder 218 at a first resolution.

[0156] At upsampling component 234, the combined video is upsampled. Upsampling component 234 may implement a modified upsampling as described with respect to a later example. The output of upsampling component 234 is further combined with the decoded second residual set obtained from the encoded level 2 stream 246. Specifically, level 2 decoding component 248 receives the encoded level 2 stream 246 and decodes the stream to produce the decoded second residual set. The decoded second residual set output by level 2 decoding component 248 is combined with the output of upsampling component 234 by summing component 258 to produce decoded video 260. Decoded video 260 includes... Figure 1 The decoded representation of input video 102. The layer 2 decoding component 248 may also use the layer 2 time buffer 250 to apply time processing. One or more of the layer 1 time buffer 230 and the layer 2 time buffer 250 may operate under the control of the time selection component 252. The time selection component 252 is shown to receive data from header 256. This data may include data for performing time processing at one or more of the layer 1 time buffer 230 and the layer 2 time buffer 250. The data may indicate the time pattern applied by the time selection component 252, as described with reference to a later example.

[0157] Second example: Encoder - Encoder Subprocessing and Timing Prediction

[0158] Figure 3A and 3B Different variations of the second instance encoders 300 and 360 are shown. The second instance encoders 300 and 360 may include... Figure 1 The first instance of the encoder 100 implementation scheme. Figure 3A and 3B In the examples, the encoding steps of the stream are described in more detail to provide examples of how the steps can be performed. Figure 3A The first variant is shown, which has a time prediction provided only in the second level of the enhancement process (i.e., relative to level 2 encoding). Figure 3B A second variation of time prediction is shown, performed during a process with two enhancement levels (i.e., level 1 and level 2).

[0159] exist Figure 3A In the middle, the encoded basic stream 316 is roughly as described above, relative to... Figure 1The explained process is as follows: Input video 302 is downsampled (i.e., downsampling is applied to input video 102 by downsampling component 304 to generate a downsampled input video). The downsampled video is then encoded using a base codec, specifically by the base encoder 312 of the base codec. The encoding operation applied to the downsampled input video by the base encoder 312 generates an encoded base stream 316. The base codec may also be referred to as a first codec because it may differ from the second codec used to generate the enhanced streams (i.e., encoded level 1 stream 326 and encoded level 2 stream 346). Preferably, the first or base codec is a codec suitable for hardware decoding. Figure 1 The output of the base encoder 312 (i.e., the encoded base stream 316) is received by the base decoder 314 (e.g., a portion of which forms the base codec or provides decoding operations for the base codec), which outputs a decoded form of the encoded base stream. The operations performed by the base encoder 312 and the base decoder 314 may be referred to as a base layer or base level. The base layer or level may be implemented separately from enhancement or second layers or levels, and the enhancement layer or level indicates and / or controls the base layer or level (e.g., the base encoder 312 and the base decoder 314).

[0160] As relative to Figure 1 As mentioned, an enhancement layer or level may include two levels that produce two corresponding streams. In this context, a first enhancement level (described herein as "Level 1") provides a set of corrected data that can be combined with the decoded form of the base stream to generate a corrected image. This first enhancement stream in Figure 1 As shown in Figure 3, it is stream 326 after encoding level 1.

[0161] To generate a level 1 encoded stream, the encoded base stream is decoded; that is, the output of the base decoder 314 provides the decoded base stream. For example... Figure 1 In the first subtraction component, a difference is then created between the decoded base stream and the undersampled input video (i.e., the output of the undersampled component 304). This involves applying a subtraction operation to both the undersampled input video and the decoded base stream to generate a first set of residuals. Here, the term "residual" is used in the same manner known in this art, referring to the error between a reference frame and a desired frame. Here, the reference frame is the decoded base stream, and the desired frame is the undersampled input video. Therefore, the residuals used in the first enhancement layer can be considered as corrected video because they 'correct' the decoded base stream to the undersampled input video used in the base coding operation.

[0162] Generally, as used herein, the term "residual" refers to the difference between the values ​​of a reference array or reference frame and the actual array or frame of data. The array can be a one-dimensional or two-dimensional array representing coding units. For example, a coding unit can be a 2×2 or 4×4 set of residual values ​​corresponding to a region of similar size to an input video frame. It should be noted that in this generalized example, the nature of the encoding operation performed and the input signal is unknown. References to "residual data" as used herein refer to data derived from the residual set, such as the residual set itself or the output of a set of data processing operations performed on the residual set. Throughout this specification, generally, a residual set contains multiple residuals or residual elements, each corresponding to a signal element, i.e., an element of the signal or original data. The signal can be an image or video. In these examples, the residual set corresponds to an image or frame of video, where each residual is associated with a pixel of the signal, which is a signal element.

[0163] However, it should be noted that the “residual” described herein is significantly different from the “residual” generated in contrast techniques such as SVC and SHVC. In SVC, the term “residual” is used to refer to the difference between a pixel block of a frame and a predicted pixel block of the frame, where the predicted pixel block is predicted using inter-frame prediction or intra-frame prediction. In contrast, the present example involves calculating the residual as the difference between a coding unit and a reconstructed coding unit, which is, for example, a coding unit that has undergone downsampling and subsequent upsampling and whose encoding / decoding errors have been corrected. In the described example, the base codec (i.e., base encoder 312 and base decoder 314) may include a different codec than the enhancement codec, for example, the base stream and enhancement stream are generated by different sets of processing steps. In one case, the base encoder 312 may include an AVC or HEVC encoder and thus internally generate residual data for generating the encoded base stream 316. However, the process used by the AVC or HEVC encoder is different from the process used to generate the encoded level 1 and level 2 streams 326, 346.

[0164] Return to Figure 3A and 3B The output of subtraction component 320, corresponding to the difference in the first residual set, is then encoded to generate an encoded level 1 stream 326 (i.e., the encoding operation is applied to the first residual set to generate a first enhanced stream). Figure 3A and 3B In the example implementation, the encoding operation includes several steps, each of which is optional and preferred and provides specific benefits. Figure 3A and 3B The text describes a series of components that implement these sub-operations, and these sub-operations can be considered as implementing... Figure 1The levels 1 and 2 shown are encoded as 122 and 142, respectively. Figure 3A and 3B In general, the sub-operation includes a residual grading mode step, a transformation step, a quantization step, and an entropy coding step.

[0165] For level 1 encoding, the level 1 residual selection or grading component 321 receives the output of the first subtraction component 320. The level 1 residual selection or grading component 321 is shown to be controlled by the residual mode grading or selection component 350 (e.g., to be used with...). Figure 1 (Similar configuration method). In Figure 3A In this process, the grading is performed by the residual mode grading component 350 and applied by the level 1 selection component 321, which selects or filters the first set of residuals based on the grading performed by the residual mode grading component 350 (e.g., based on analysis of the input video 102 or other data). Figure 3B In this arrangement, the reverse is applied, such that general residual mode selection control is performed by residual mode selection component 350, but the grading is performed at each enhancement level (e.g., compared to grading based on input video 102). Figure 3B In one instance, the grading can be performed by the Level 1 residual pattern grading component 321 based on the analysis of the first residual set output by the first subtraction component 320.

[0166] Generally, the second instance encoders 300, 360 identify whether a residual grading mode is selected. This can be performed by the residual mode grading or selection component 350. If a residual grading mode is selected, this can be indicated by the residual mode grading or selection component 350 to the level 1 residual selection or grading component 321 to perform a residual grading step. A residual grading operation can be performed on the first residual step to generate a graded residual set. The graded residual set can be filtered so that not all residuals are encoded into the first enhancement stream 326 (or correction stream). Residual selection may include selecting a subset of received residuals to pass through for further encoding. Although the current instance describes a “grading” operation, this can be considered as a general filtering operation performed on the first residual set (e.g., the output of the first subtraction component 320), i.e., the level 1 residual selection or grading component 321 is an implementation of a general filtering component that can modify the first residual set. Filtering can be viewed as setting certain residual values ​​to zero, so that the input residual values ​​are filtered out and do not form part of the encoded layer 1 stream 326.

[0167] exist Figure 3A and 3BIn this process, the output of the Level 1 residual selection or grading component 321 is then received by the Level 1 transformation component 322. The Level 1 transformation component 322 applies a transformation to the first residual set, or the graded or filtered first residual set, to generate a transformed residual set. The transformation operation may be applied to the first residual set or the filtered first residual set, depending on whether a grading mode is selected, to generate the transformed residual set. The Level 1 quantization component 323 is then applied to the output of the Level 1 transformation component 322 (i.e., the transformed residual set) to generate a quantized residual set. Entropy coding is applied by the Level 1 entropy coding component 325, which applies entropy coding operations to the quantized residual set (or data derived from this set) to generate a first enhanced level stream, i.e., an encoded Level 1 stream 326. Thus, in the Level 1 layer, the first residual set is transformed, quantized, and entropy-coded to produce the encoded Level 1 stream 326. Further details of possible implementations of the transformation, quantization, and entropy coding are described in later examples. Preferably, the entropy encoding operation can be a Huffman encoding operation or a run-length encoding operation, or both. Optionally, a control operation can be applied to the quantized set of residuals to correct for the effects of the grading operation. This can be applied by a tier 1 residual mode control component 324, which can operate under the control of the residual mode grading or selection component 350.

[0168] As described above, the enhanced stream may include a first enhancement level and a second enhancement level (i.e., levels 1 and 2). The first enhancement level can be considered as a corrected stream. The second enhancement level can be considered as another enhancement level that converts the corrected stream into the original input video. Another or second enhancement level is created by encoding another or second set of residuals, which is the difference between the upsampled form of the reconstructed level 1 video output by the summing component 332 and the input video 302. Upsampling is performed by the upsampling component 334. The second set of residuals is generated by subtraction applied by the second subtraction component 336, which takes the input video 302 and the output of the upsampling component 334 as inputs.

[0169] exist Figure 3A and 3B In this process, the first residual set is encoded by a level 1 encoding procedure. Figure 3A and 3BIn this example, the process includes a Level 1 transform component 322 and a Level 1 quantization component 323. Prior to upsampling, an inverse quantization component 327 and an inverse transform component 328 are used to decode the encoded first residual set. These components are used to simulate the (Level 1) decoding components that can be implemented at the decoder. Thus, the quantized (or controlled) residual set derived by applying the Level 1 transform component 322 and the Level 1 quantization component 323 is inverse-quantized and inverse-transformed, and then a deblocking filter 330 is applied to generate the decoded first residual set (i.e., inverse quantization is applied to the quantized first residual set to generate the dequantized first residual set; inverse transform is applied to the dequantized first residual set to generate the detransformed first residual set; and deblocking filtering is applied to the detransformed first residual set to generate the decoded first residual set). The deblocking filter 330 is optional depending on the applied transform and may include applying a weighted mask to each block of the detransformed first residual set.

[0170] At summing component 332, the decoded base stream, as output by base decoder 314, is combined with the decoded first residual set, as received from deblocking filter 330 (i.e., a summation operation is performed on the decoded base stream and the decoded first residual set to generate a recreated first stream). Figure 3A and 3B As shown, the combination is then upsampled by upsampling component 334 (i.e., the upsampling operation is applied to the recreated first stream to generate an upsampled recreated stream). The upsampled stream is then compared with the input video at the second summing component 336, thus creating a second residual set (i.e., the difference operation is applied to the upsampled recreated stream to generate another residual set). The second residual set is then encoded into a level 2 enhanced stream 346 (i.e., the encoding operation is then applied to the other or second residual set to generate another or second enhanced encoded stream).

[0171] Similar to the encoded Level 1 stream, the encoding applied to the residuals of the second set (Level 2) may include several operations. Figure 3A The display includes a Level 2 residual selection component 340, a Level 2 transformation component 341, a Level 2 quantization component 343, and a Level 2 entropy encoding component 345. Figure 3B A similar set of components is shown, but in this variation, the Level 2 residual selection component 340 is implemented as a Level 2 residual grading component 340, which is controlled by the residual mode selection component 350. As discussed above, grading and selection can be performed based on the input video 102 and one or more of the individual first and second residual sets. Figure 3AFurthermore, a level 2 time buffer 345 is provided, which subtracts the contents of the level 2 transformation component 341 from the output of the level 2 transformation component 341 via a third subtraction component 342. In other instances, the third subtraction component 342 may be located elsewhere, after the level 2 quantization component 343. Figure 3A and 3B The Level 2 encoding shown includes steps of grading, temporal prediction, transformation, quantization, and entropy encoding. Specifically, the second instance encoder 200 can identify whether a residual grading mode is selected. This can be performed by one or more of the residual grading or selection component 350 and individual Level 2 selection and grading components 340. If a residual grading or filtering mode is selected, the residual grading step can be performed by one or more of the residual grading or selection component 350 and individual Level 2 selection and grading components 340 (i.e., a residual grading operation can be performed on the second residual set to generate a second graded residual set). The second graded residual set can be filtered such that not all residuals are encoded into the second enhancement stream (i.e., the graded Level 2 stream 346). The second residual set or the second graded residual set is then transformed by the Level 2 transformation component 341 (i.e., a transformation operation is performed on the second graded residual set to generate a second transformed residual set). As shown in the connection between the output of summing component 332 and layer 2 transformation component 341, the transformation operation can utilize the predicted coefficients or predicted averages derived from the recreated first flow before upsampling. Other instances of this predicted average calculation are described in other examples; further information can be found elsewhere in this document. In layer 2, the transformed residuals (predicted in time or otherwise) are then quantized and entropy-encoded in a manner described elsewhere (i.e., quantization is applied to the transformed residual set to generate a second quantized residual set; and entropy encoding is applied to the quantized second residual set to generate the enhanced flow of the second layer).

[0172] Figure 3A A variation of the second instance encoder 200 is shown, in which timing prediction is performed as part of the level 2 encoding process. Timing prediction is performed using a timing selection component 352 and a level 2 timing buffer 345. The timing selection component 352 determines the timing processing mode, as described in more detail below, and accordingly controls the use of the level 2 timing buffer 345. For example, if no timing processing will be performed, the timing selection component 352 can instruct the content of the level 2 timing buffer 345 to be set to 0.

[0173] Figure 3B A variant of the second instance encoder 200 is shown, in which time prediction is performed as part of both the level 1 and level 2 encoding processes. Figure 3BIn addition to the level 2 time buffer 345, a level 1 time buffer 361 is also provided. Although not illustrated, other variations in which time processing is performed at level 1 instead of level 2 are also possible.

[0174] When time prediction is selected, the second instance encoder 200 can further modify the coefficients (i.e., the transformed residuals output by the transform component) by subtracting a corresponding set of coefficients derived from the appropriate time buffer. The corresponding set of coefficients may include a set of coefficients for the same spatial region (e.g., the same coding unit located within the frame) derived from a previous frame (e.g., coefficients for the same region of the previous frame). Subtraction can be applied by subtraction components such as third subtraction components 346 and 362 (for corresponding levels 2 and 1). This time prediction step will be described further with respect to a later example. In summary, when time prediction is applied, the encoded coefficients correspond to the difference between the frame and another frame in the stream. The other frame can be an earlier or later frame in the stream (or a block within a frame). Therefore, instead of encoding the residual between the upsampled recreated stream and the input video, the encoding process can encode the difference between the transformed frame in the stream and the transformed residual of the frame. Thus, entropy can be reduced. Timing predictions can be selectively applied to groups of coding units (referred to herein as “pieces”) based on control information, and the application of timing predictions at the decoder can be achieved by sending additional control information along with the encoded stream (e.g., within the header or as another surface as described in the example later).

[0175] like Figure 3A and 3B As shown in the figure, when the time prediction is active, each transformation coefficient can be:

[0176] Δ=F 当前 -F 缓冲器

[0177] The time buffer can store data associated with previous frames. Temporal prediction can be performed against a single color plane or multiple color planes. Generally, subtraction can be applied to video "frames" as a pro-feature subtraction, where the features of a frame represent transformed coefficients, with the transform applied relative to a specific n-by-n coding unit size (e.g., 2×2 or 4×4). The difference generated by the temporal prediction (e.g., the aforementioned difference) can be stored in a buffer for use in subsequent frames. Therefore, in practice, the residual generated by the temporal prediction is a coefficient residual relative to the buffer. Although... Figure 3A and 3B The demonstration shows that time prediction is performed after the transformation operation, but it can also be performed after the quantization operation. This avoids the need to apply the level 2 inverse quantization component 372 and / or the level 1 inverse quantization component 364.

[0178] Therefore, as Figure 3A and3B As shown and described above, after the encoding process is performed, the output of the second instance encoder 200 is an encoded base stream 316 and one or more enhancement streams, the enhancement streams preferably including an encoded layer 1 stream 326 of a first enhancement layer and an encoded layer 2 stream 346 of another or a second enhancement layer.

[0179] Third instance encoder and second instance decoder - predicted residuals

[0180] Figure 4 The third example encoder 400 is shown. Figure 1 The first instance encoder 100 is a variant. The corresponding reference numerals are used to refer to those from... Figure 1 The corresponding features (i.e., where feature 1xx and Figure 4 Features related to 4xx (in the text). Figure 4 The examples demonstrate in more detail how predicted residuals (e.g., predicted averages) can be applied as part of an upsampling operation. Furthermore, in Figure 4 In this context, the deblocking filter 130 is replaced by a more general configurable filter 430.

[0181] exist Figure 4 In this process, the prediction residual component 460 receives input at level 1 spatial resolution in the form of the output of the first summing component 432. This input includes at least a portion of the reconstructed video at level 1 output by the first summing component 432. The prediction residual component 460 also receives input at level 2 spatial resolution from the upsampling component 434. The input may include lower-resolution features for generating multiple higher-resolution features (e.g., pixels that are then upsampled to generate 4 pixels in a 2×2 block). The prediction residual component 460 is configured to calculate a modifier of the output of the upsampling component 434, which is added to the output of the second summing component 462. The modifier may be calculated to apply the prediction averaging process described in detail in later examples. Specifically, when determining the average difference (e.g., the difference between the calculated average coefficients and the average predicted from the lower level), the following can be used: Figure 4 The components are used to recover the average component outside the level 2 encoding process 442. The output of the second summing component 462 is then used as the oversampled input to the second subtraction component 436.

[0182] Figure 5A This demonstrates how the prediction residual operation can be applied at decoder 500 in the second instance. Similar to... Figure 4 The second instance decoder 500 can be considered as Figure 2 The first instance decoder 200 is a variant. Corresponding reference numerals are used to refer to those from... Figure 2 The corresponding features (i.e., where feature 2xx is related to feature 5xx in Figure 5). Figure 5AThe examples demonstrate in more detail how the predicted residuals (e.g., the predicted average) can be applied at the decoder as part of the upsampling operation. Furthermore, in Figure 5A In this context, the deblocking filter 232 is replaced by a more general configurable filter 532. It should be noted that the predictive residual processing can be applied asymmetrically at the encoder and decoder; for example, the encoder does not need to be based on... Figure 4 Configure to allow decoding, such as Figure 5A As shown in the figure. For example, the encoder may apply a predicted average calculation, as described in U.S. Patent 9,509,990, which is incorporated herein by reference.

[0183] The configuration of the second instance decoder 500 is similar to Figure 4 The third instance encoder 400. A prediction residual component 564 receives a first input representing a level 1 frame from a first summing component 530, and a second input representing an upsampled pattern of a level 1 frame from an upsampling component 534. The inputs are received as lower-level features and a corresponding set of higher-level features. The prediction residual component 564 uses the inputs to calculate a modifier for the output of the upsampling component 534, which is added by the second summing component 562. This modifier may correct for the use of, for example, the prediction average as described in U.S. Patent 9,509,990 or calculated by the third instance encoder 400. The modified upsampled output is then received by a third summing component 558, which performs level 2 correction or enhancement according to the previous instance.

[0184] The use of one or more of the prediction residual components 460 and 564 can implement other instances of "modified upsampling," wherein the "modification" is performed by modifiers calculated by the components and applied by the corresponding summing components. These instances can provide faster computation of the prediction average because the modifiers are added in the reconstructed video space (e.g., the modifiers are applied to the pixels of the reconstructed video rather than to the A, H, V, and D coefficient spaces of the transformed residuals) rather than requiring a conversion to the coefficient space representing the transformed residuals.

[0185] Third instance decoder - sub-operations and timing prediction

[0186] Figure 5B and 5C The corresponding variations of the third instance decoders 580 and 590 are shown. These variations of the third instance decoders 580 and 590 can be implemented accordingly to correspond to... Figure 3A and 3B The third instance encoders 300 and 360 are shown in the image. The third instance decoders 580 and 590 can be considered as derived from... Figure 2 and 4Implementations of one or more of the first and second instance encoders 200, 400. As previously mentioned, similar reference numerals are used where possible to refer to features corresponding to features in earlier instances.

[0187] Figure 5B and 5C Show the brief description above and Figure 2 The following is an example implementation of the decoding process. As can be clearly seen, the decoding steps and components are described in more detail to provide an example of how decoding can be performed at each level. (As...) Figure 3A and 3B , Figure 5B This illustrates a variant where time prediction is used only for the second level (i.e., level 2), and Figure 5C The diagram illustrates a variation where time prediction is used for two levels (i.e., levels 1 and 2). As previously described, another variation (e.g., level 1, but not level 2) is envisioned where signaling information can be used to control the configuration.

[0188] like Figure 5A and 5C As shown in the example, during decoding, the decoder can parse header 556 and configure the decoder based on those headers. The headers may include one or more of global configuration data, picture (i.e., frame) configuration data, and mixed data blocks (e.g., about features or groups of features within a picture). To recreate the input video (e.g., input videos 102, 302, or 402 in previous examples), an instance decoder, such as a third instance decoder, can decode each of the encoded base stream 516, the first enhanced or encoded level 1 stream 526, and the second enhanced or encoded level 2 stream 546. The frames of the streams can be synchronized and then combined to derive the decoded video 560.

[0189] like Figure 5B As shown, the Level 1 decoding component 528 may include a Level 1 entropy decoding component 571, a Level 1 inverse quantization component 572, and a Level 1 inverse transform component 573. These may include... Figure 3A and 3B The decoding patterns of the corresponding Level 1 encoding components 325, 323, and 322. Level 2 decoding component 548 may include Level 2 entropy decoding component 581, Level 2 inverse quantization component 582, and Level 2 inverse transform component 583. These may include... Figure 3A and 3B The decoding patterns of the corresponding level 2 encoding components 344, 343, and 341 are described. In each decoding process, the enhanced stream can use the aforementioned components or operations to undergo entropy decoding, inverse quantization, and inverse transform steps to recreate the residual set.

[0190] To be precise, in Figure 5BIn this configuration, the encoded base stream 516 is decoded by a base decoder 518, which is implemented as part of a base codec 584. It should be noted that the base stream and the enhancement stream typically use different codecs for encoding and decoding, with the enhancement codec operating on residuals (i.e., implementing both level 1 and level 2 encoding and decoding components) and the base codec operating on video at level 1 resolution. Video at level 1 resolution can represent a resolution lower than the resolution where the base codec typically operates (e.g., a downsampled signal in two dimensions could be a quarter of its size), allowing the base codec to operate at high speed. This also highlights the difference from SVC, where a common codec (AVC) is applied to each layer and operates on video data rather than residual data. Even in SHVC, all spatial layers are configured to operate in a video input / video output manner, where each video output represents a different playable video. In this current instance, the enhanced stream does not represent playable video in the conventional sense—the outputs of layer 1 and layer 2 decoding components 528 and 548 (e.g., received by the first summing component 530 and the second summing component 558) are “residual video,” i.e., consecutive frames of residuals of multiple color planes rather than the color planes themselves. This thus allows for much larger bit rate savings than SVC and SHVC, since the enhanced stream will often be 0 (because the quantization difference is often 0), where 0 values ​​can be efficiently compressed using run-length encoding. It should also be noted that in this current instance, following standard intra-frame processing in SVC and SHVC, each coding unit of N by N features (e.g., a 2×2 or 4×4 pixel block that can be flattened into a one-dimensional array) does not depend on the predictions of other coding units involved in the frame. Thus, the encoding and decoding components in the enhanced stream can be applied in parallel to different coding units (e.g., different regions of the frame can be processed efficiently in parallel), because, unlike SVC and SHVC, there is no need to wait for the decoded result of another coding unit to compute subsequent coding units. This means that enhanced codecs can be implemented extremely efficiently on parallel processors, such as shared graphics processing units in computing devices (including mobile computing devices). This parallelism is impossible for the high-complexity processing of SVC and SHVC.

[0191] Return to Figure 5B As in the previous example, an optional filter, such as deblocking filter 532, can be applied to the output of layer 1 decoding component 528 to remove obstructions or other artifacts, and the output of the filter is received by first summing component 530, where the output is added to the output of the base codec (i.e., the decoded base stream). It should be noted that the output of the base codec can resemble a low-resolution video decoded by a conventional codec, but the layer 1 decoding output is a (filtered) first residual set. This differs from SVC and SHVC, where this form of summation is meaningless because each layer outputs a complete video at its corresponding spatial resolution.

[0192] As in Figure 2 In this process, the modified upsampling component 587 receives the corrected reconstruction of the video at level 1 output by the first summing component 530, and upsamples it to generate the upsampled reconstruction. The modified upsampling component 587 can be applied... Figure 4 The modified upsampling shown is illustrated. In other instances, such as when the predicted mean is not used or is not applied in the manner described in U.S. Patent 9,509,990, the upsampling may not be modified.

[0193] exist Figure 5B In this process, time prediction is applied during level 2 decoding. Figure 5B In this example, timing prediction is controlled by timing prediction component 585. In this variant, control information for timing prediction is extracted from the encoded level 2 stream 546, as indicated by the arrows from the stream to the timing prediction component 585. For example... Figure 5A and 5C In other embodiments shown, control information for timing prediction may be transmitted separately from the encoded layer 2 stream 546, for example, in header 556. The timing prediction component 585 controls the use of the layer 2 time buffer 550, for example, it can determine the timing pattern and control time refresh, as described with reference to a later example. The contents of the time buffer 550 can be updated based on data from previous frames of residuals. When the time buffer 550 is applied, the contents of the buffer are added to a second residual set. Figure 5B In the process, at the third summing component 594, the contents of the time buffer 550 are added to the output of the layer 2 decoding component 548. In other instances, the contents of the time buffer can represent any intermediate decoded data set, and therefore, the third summing component 586 can be appropriately shifted to apply the contents of the buffer at the appropriate level (e.g., if the time buffer is applied at the dequantization coefficient level, the third summing component 586 can be located before the inverse transform component 583). The time-corrected second residual set is then combined with the output of the upsampling component 587 by the second summing component 558 to generate the decoded video 560. The decoded video is at a layer 2 spatial resolution, which can be higher than the layer 1 spatial resolution. The second residual set applies corrections to the (examined) upsampled reconstructed video, where the corrections are added back in detail and improve the sharpness of lines and features.

[0194] Figure 5C The third instance decoder variant 590 is shown. In this case, the timing prediction component 585 receives timing prediction control data from the header 556. The timing prediction component 585 controls both level 1 and level 2 timing predictions, but in other instances, separate control components may be provided for the two levels as needed. Figure 5CThis demonstrates how the reconstructed second residual set, input to the second summing component 558, can be fed back to be stored in the level 2 time buffer of the next frame (for clarity, from...). Figure 5B (The feedback is omitted). A Level 1 time buffer 591, operating in a similar manner to the Level 2 time buffer 550 described above, is also shown, and the feedback loop for the buffer is illustrated in this figure. The contents of the Level 1 time buffer 591 are added to the Level 1 residual processing pipeline via a fourth summing component 595. Again, depending on where the time prediction is applied, the location of this fourth summing component 595 can vary along the Level 1 residual processing pipeline (e.g., if the time prediction is applied in the transformed coefficient space, it can be located before the Level 1 inverse transform component 573).

[0195] Figure 5C This demonstrates two ways in which timing control information can be relayed to the decoder. The first method is via header 556, as described above. The second method, which can be used as an alternative or additional relay path, is via data encoded within the residual itself. Figure 5C As an example, this data 592 can be encoded as HH transformed coefficients and therefore extracted after entropy decoding by the entropy decoding component 581. This data can be extracted from the level 2 residual processing pipeline and passed to the time prediction component 585.

[0196] Generally, the enhanced encoding and / or decoding components described herein are of low complexity (e.g., compared to schemes such as SVC and SHVC) and can be implemented in a flexible, modular manner. Additional filtering and other components can be inserted into the processing pipeline, as determined by the desired implementation. Layer 1 and Layer 2 components can be implemented as co-operating copies or different types, further reducing complexity. The base codec can operate as a separate modular black box, and therefore different codecs can be used depending on the implementation.

[0197] The data processing pipeline described herein can be implemented as a series of nested loops along the data dimension. Subtraction and addition can be performed at the planar level (e.g., for each of the set of color planes of a frame) or using multidimensional arrays (e.g., an X×Y×C array, where C is several color channels such as YUV or RGB). In some cases, components can be configured to operate on N×N coding units (e.g., 2×2 or 4×4) and thus can be applied in parallel to the coding units of a frame. For example, the color plane of a frame of an input video can be decomposed into multiple coding units covering the region of the frame. This can create multiple small one-dimensional or two-dimensional arrays (e.g., 2×2 or 4×1 arrays, or 4×4 or 16x1 arrays) in which components are applied. Thus, a reference to the residual set can contain references to a set of small one-dimensional or two-dimensional arrays, each array including integer feature values ​​of a configured bit depth.

[0198] Each or two enhancement streams can be encapsulated into one or more enhancement bitstreams using a set of Network Abstraction Units (NALUs). A NALU is intended to encapsulate the enhancement bitstreams so that the enhancements are applied to the correct underlying reconstructed frames. A NALU may, for example, contain a reference index to the NALU containing the underlying decoded reconstructed frame bitstreams to which the enhancements must be applied. In this way, enhancements can be synchronized to the underlying stream, and frames from each bitstream are combined to produce the decoded output video (i.e., the residual of each frame from the enhancement layer combined with frames from the underlying decoded stream). A group of images can represent multiple NALUs.

[0199] Further description of the processing components

[0200] The preceding text describes how a set of processing components or tools can be applied throughout encoding and / or decoding in each of the enhanced streams (or input video). These processing components can be applied as modular components. They can be implemented as computer program code, i.e., executed by one or more processors, and / or configured as dedicated hardware circuit systems, such as being configured as individual or combined field-programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs). The computer program code may include firmware used by the operating system to provide video rendering services for embedded devices or codecs. An overview of each of the aforementioned tools and their application in... Figures 1 to 5C The functionality within the overall process shown.

[0201] Downsampling: In this example, a downsampling process is applied by downsampling components (e.g., 104, 304, and 404). The downsampling process is applied to the input video to produce a downsampled video to be encoded by the underlying codec. Downsampling can be performed in both the vertical and horizontal directions, or alternatively only in the horizontal direction. The downsampling component can be further described as a downscaler.

[0202] Level 1 (L-1) encoding: to Figure 1 The input to this component, shown as 122, includes a first set of residuals obtained by taking the difference between the decoded output of the underlying codec and the downsampled video. This first set of residuals is then transformed, quantized, and encoded, as described further below.

[0203] Transformation: In some instances, there are two types of transformations that can be used by transformation components (e.g., transformation components 122, 322, and / or 341). The transformation can be a directional decomposition. The transformation can be used to decorrelate the residual values ​​in the coding units (e.g., small N-by-N blocks of features). The transformation can be applied as a matrix transformation, such as matrix multiplication applied to a flattened array representing the coding units.

[0204] In one case, the two types of transforms can correspond to two transform kernels of different sizes. The size of the coding unit can therefore be set based on the size of the transform kernel. The first transform has a 2×2 kernel applied to the 2×2 block of the residual. The resulting coefficients are as follows:

[0205]

[0206] The second transformation has a 4×4 kernel applied to the residual in 4×4 blocks. The resulting coefficients are as follows:

[0207]

[0208] These transformation matrices can include integer values ​​within the following range:

[0209] {-1, 1} or {-1, 0, 1}. This simplifies computation and allows for fast hardware implementations using addition and subtraction. The transformation matrix can include a Hadamard matrix, which has advantageous properties such as orthogonal rows and self-inverse (i.e., the inverse transformation is the same as the forward transformation). If a Hadamard matrix is ​​used, the inverse transformation can be called a transformation because the same matrix can be used for both the forward and inverse transformations.

[0210] In some cases, the transformation may involve applying the predicted residuals (i.e., the use of the predicted average, as described in more detail in a later example).

[0211] Quantization: The set of transformed residuals (referred to herein as “coefficients”) is quantized using quantization components such as component 323 or 343. Inverse quantization components such as components 327, 364, 372, 572, and 582 can reconstruct the prequantized form of the value by multiplying the quantized value by a defined quantization factor. The coefficients can be quantized using a linear quantizer. Linear quantizers can use a variable-size dead zone. Compared to quantization steps and non-centered dequantization offsets, linear quantizers can use dead zones of different sizes. These variations are described in more detail with reference to examples later.

[0212] Entropy coding: A set of quantized coefficients can be encoded using, for example, an entropy encoder such as Component 325 or 344. Two entropy coding schemes exist. In the first scheme, a run-length encoder (RLE) is used to encode the quantized coefficients. In the second scheme, the RLE is first used to encode the quantized coefficients, followed by processing the encoded output using a Huffman encoder.

[0213] Residual Mode (RM) Selection: If a residual (filtering) mode (RM) has been selected, the first set of residuals (i.e., level 1) can be further graded and selected to determine which residuals should be transformed, quantized, and encoded. Residual filtering can be performed by one or more of components 321 or 340, for example, under the control of control components such as 150 or 350. Residual filtering can be performed anywhere in the residual processing pipeline, but preferably it is pre-formed before entropy encoding.

[0214] Time Selection Mode: If the time selection mode is selected, for example by a component such as 152 or 352, the encoder can modify the coefficients (i.e., the transformed residuals or the data derived from these transformed residuals) by subtracting the corresponding coefficients derived from time buffers such as 345 or 361. This can be implemented by time prediction as described below. The decoder can then modify the coefficients by adding the corresponding coefficients derived from time buffers such as one of components 230, 250, 530, 550, or 591.

[0215] Level 1 (L-1) Decoding: This is shown as components 228 and 528. Input to this tool includes encoded Level 1 streams 226 or 526 (i.e., L-1 encoded residuals), which pass through an entropy decoder (e.g., 571), a dequantizer (e.g., 572), and an inverse transform module (e.g., 573). The operations performed by these modules are the inverse operations performed by the modules described above. If a time selection mode is selected, residuals can be partially predicted from the co-located residuals from the time buffer.

[0216] Deblocking and Residual Filtering: In some cases, if a 4×4 transform is used, the decoded residual can be fed into filter modules such as 130, 232, 330, or 535, or a deblocking filter. Deblocking is performed on each block of the inverse transform residual by applying a mask with specified weights. The general structure of the mask is as follows:

[0217]

[0218] Where 0 ≤ α ≤ 1 and 0 ≤ β ≤ 1. The weights can be specified in the control signaling associated with the bit stream or retrieved from local memory.

[0219] Upsampling: The combination of the decoded (and filtered or deblocked, if applicable) first residual set (L-1) with the underlying decoded video is upsampled to generate an upsampled reconstructed video. Upsampling can be performed as described with respect to upsampling components 134, 234, 334, 434, 534, or 587. Examples of possible upsampling operations are described in more detail below. The upsampling method can be optional and signaled in a byte stream. It should be noted that in the examples herein, the term "byte stream" or alternative terms such as stream, bit stream, or NALU stream may be used appropriately.

[0220] Level 2 (L-2) encoding: This is represented by components 142, 442, and 548. The input to this encoding operation includes a second set of residuals (L-2) obtained by the difference between the upsampled reconstructed video and the input video. The second set of residuals (L-2) is then transformed, quantized, and encoded as further described herein. The transformation, quantization, and encoding are performed in the same manner as described with respect to L-1 encoding. If a residual filtering mode has been selected, the second set of residuals is further graded and selected to determine which residuals should be transformed and encoded.

[0221] Predicted Coefficient (or Predicted Average) Mode: If the predicted coefficient mode is selected, the encoder can modify the transformed coefficients C00, also referred to herein as A, the average (for a 4×4 transform, this can be Ax, as described in more detail below). If a 2×2 transform is used, C00 can be modified by subtracting the value of the oversampled residuals of the transformed block from its predicted residuals. If a 4×4 transform is used, C00 can be modified by subtracting the average of the four oversampled residuals of the transformed block from its predicted residuals. The predicted coefficient mode can be implemented at the decoder using modified oversampling as described herein.

[0222] Level 2 (L-2) Decoding: This is shown as components 248 and 548. The input to this decoding includes an encoded second set of residuals (L-2). The decoding process for the second set of residuals involves an entropy decoder (e.g., 581), a dequantizer (e.g., 582), and an inverse transform module (e.g., 583). The operations performed by these components are the inverse operations performed by the encoding components as described above. If a time selection mode is selected, residuals can be predicted in part from the co-located residuals from the time buffer.

[0223] Modified Upsampling: The modified upsampling process comprises two steps, the second of which depends on the signaling received by the decoder. In the first step, the combination of the decoded (and unblocked, if applicable) first residual set (L-1) and the base decoded video (L-1 reconstructed video) is upsampled to generate an upsampled reconstructed video. If a prediction coefficient mode has been selected, the second step is performed. Specifically, the values ​​of elements in the L-1 reconstructed values ​​of the 2×2 blocks from which the upsampled reconstructed video is derived are added to the 2×2 blocks in the upsampled reconstructed video. Generally, modified upsampling can be based on the upsampled reconstructed video and, more generally, on a pre-upsampled reconstructed lower-resolution video, as referenced in [reference missing]. Figure 4 As described.

[0224] Jitter: In some instances, the final jitter level can be selectively applied. Figure 2 and 5AThe decoded video in 5C is 260 or 560. Dithering may include applying a small noise level to the decoded video. Dithering can be applied by adding a range of random or pseudo-random numbers to the decoded video. The range can be configured based on local and / or transmitted parameters. The range can be based on defined minimum and maximum values, and / or defined scaling factors (e.g., the output of a random number generator for a specific range). Dithering can reduce the visual appearance of quantization artifacts, as known in the art.

[0225] Examples of 4×4 residual coding units and mosaicks

[0226] Figure 6A An example 600 is shown of a residual set 610 arranged in a 4×4 coding unit 620. Therefore, there are 16 residual features. The coding unit 620 may include an N-by-N array R of residuals with features R[x][y]. For a 2×2 coding unit, there may be 4 residual features. Transformations can be applied to the coding units, as shown.

[0227] Figure 6B This demonstrates how multiple coding units 640 can be arranged into a set of tiles 650. The set of tiles can collectively cover an entire area of ​​an image or frame. Figure 6B In this example, the tiles consist of an 8×8 array of coding units. If the coding unit is 4×4, this means that each tile has 32×32 elements; if the coding unit is 2×2, this means that each tile has 16×16 elements.

[0228] Example image format

[0229] Figures 7A to 7C Color components can be organized to form frames within an image or video in several ways. In an example, frames of input video 102, 302, and 402 can be referred to as source images, and decoded output videos 260 and 560 can be referred to as decoded images. An encoding process, such as that performed by an encoder, can generally be a bitstream as described in the examples herein, which is transmitted to and received by a decoding process, such as that performed by a decoder. The bitstream may include a combined bitstream generated from at least a base encoded stream, a level 1 encoded stream, a level 2 encoded stream, and a header (e.g., as described in the examples herein). The video source represented by the bitstream can therefore be viewed as a sequence of images in the order of decoding.

[0230] In some instances, the source image and the decoded image each consist of one or more sample arrays. These arrays may include: lightness (monochromatic) components only (e.g., Y); lightness and two chromaticity components (e.g., YCbCr or YCgCo); green, blue, and red components (e.g., GBR or RGB); or other arrays representing color samples of other unspecified monochromatic or tristimulus values ​​(e.g., YZX, also known as XYZ). The specific instances described herein are presented with reference to lightness and chromaticity arrays (e.g., Y, Cb, and Cr arrays); however, those skilled in the art will understand that these instances can be suitably configured to operate using any known or future color representation methods.

[0231] In some instances, the chroma format sampling structure can be specified via `chroma_sampling_type` (e.g., this can be signaled to the decoder). Different sampling formats can have different relationships between different color components. For example: in 4:2:0 sampling, each of the two chroma arrays has half the height and half the width of the luminance array; in 4:2:2 sampling, each of the two chroma arrays has the same height and half the width of the luminance array; and in 4:4:4 sampling, each of the two chroma arrays has the same height and width as the luminance array. Monochrome sampling has only one sample array, nominally considered to be the luminance array. The number of bits required to represent each of the samples in the luminance and chroma arrays in a video sequence can range from 8 to 16 (inclusive), and the number of bits used in the luminance array can differ from the number of bits used in the chroma array.

[0232] Figures 7A to 7C This demonstrates the different sampling types that can be represented by different values ​​of the variable `chroma_sampling_type`. When the value of `chroma_sampling_type` is equal to 0, the nominal vertical and horizontal relative positions of the luminance sample 710 and chrominance sample 720 in the image are... Figure 7A As shown in the image. When the value of chroma_sampling_type is equal to 1, chroma sample 720 and the corresponding luminance sample 710 share the same address, and the nominal positions in the image are as follows. Figure 7B As shown in the image. When the value of chroma_sampling_type is equal to 2, all array samples 710 and 720 share the same address for all cases in the image, and the nominal positions in the image are as follows. Figure 7C As shown in the diagram. In these cases, the variables SubWidthC and SubHeightC indicate how the chromaticity samples are shifted:

[0233] chroma_sampling_type Color format SubWidthC SubHeightC 0 monochrome 1 1 1 4:2:0 2 2 2 4:2:2 2 1 3 4:4:4 1 1

[0234] Instance bitstream processing

[0235] Figure 8 An example method 800 is shown that can be used to process bitstreams encoded using the example encoder or encoding process described herein. Method 800 can be, for example... Figure 2 The implementation of instance decoders, such as 200 or 500, is shown in 5. Method 800 demonstrates instance streams that facilitate the separation of enhanced bitstreams.

[0236] At block 802, method 800 includes receiving an input bit stream. At block 804, the start of a NALU is identified within the received bit stream. This then allows the identification of an entry point at block 806. The entry point can indicate which type of decoding process should be used to decode the bit stream. Next, at block 808, a payload enhancement configuration is determined. The payload enhancement configuration can indicate specific parameters of the payload. The payload enhancement configuration can be signaled once per stream. Optionally, the payload enhancement configuration can be signaled multiple times per picture group or for each NALU. The payload enhancement configuration can be used to extract payload metadata at block 810.

[0237] At box 812, the beginning of a group of pictures (GOP) is identified. Although the term group of pictures is used, it should be understood that this term is used to refer to the structure corresponding to the structure of the base stream, but is not limited to a specific structure for the enhanced stream. That is, the enhanced stream may not have a strictly defined GOP structure, and does not need to strictly adhere to the GOP structure of this technique. If payload metadata is included, it may be included after the payload enhancement configuration and before the group of pictures. The payload metadata may, for example, contain HDR information. After box 812, the GOP can be retrieved. At box 814, if NALU involves the first stream frame, the method may further include retrieving the global payload configuration at box 816. The global payload configuration may indicate parameters of the decoding process; for example, the global payload configuration may indicate whether predictive residual mode or temporal prediction mode is enabled (and should be enabled) in the encoder, and therefore the global payload configuration may indicate whether a certain mode should be used in the decoding method. The global payload configuration may be retrieved once for each GOP. At block 818, method 800 may further include retrieving a set of payload decoder control parameters indicating decoder parameters to be enabled during decoding, such as jitter or upsampling parameters. The payload decoder control parameters may be retrieved per GOP. At block 820, method 800 includes retrieving a payload picture configuration from the bitstream. The payload picture configuration may include parameters for each picture or frame, such as quantization parameters, like stride. The payload picture configuration may be retrieved once per NALU (i.e., once per picture or frame). At block 822, method 800 may then further include retrieving a payload of encoded data that may include encoded data for each frame. The payload of the encoded data may be signaled once per NALU (i.e., once per picture or frame). The payload of the encoded data may include a surface, plane, or layer of data, which may be separated into chunks, as referenced. Figure 9A as well as Figure 21A and 21B As described in the example. After retrieving the payload of the encoded data, NALU may end at box 824.

[0238] If the GOP also ends, the method can continue retrieving a new NALU for the new GOP. If the NALU is not the first stream frame (as is the case here), the NALU can then optionally retrieve the entry point (i.e., an indication of the software version to be used for decoding). The method can then retrieve the payload global configuration, payload decoder control parameters, and payload picture configuration. The method can then retrieve the payload of the encoded data. The NALU will then end.

[0239] If the NALU does not involve the first bit stream frame at box 814, then boxes 828 through 838 can be executed. Optional box 828 can be similar to box 806. Boxes 830 through 838 can be executed in a similar manner to boxes 816 through 824.

[0240] At boxes 840 and 842, after each NALU has ended, if the GOP has not yet ended, method 800 may include retrieving a new NALU from the stream at box 844. For each second and subsequent NALU of each GOP, method 800 may optionally retrieve an entry point indication at box 846 in a manner similar to boxes 806 and 828. Method 800 may then include retrieving payload picture configuration parameters at box 848, and retrieving the payload of encoded data for the NALU at box 850. Boxes 848 to 852 may therefore be performed in a manner similar to boxes 820 to 824 and boxes 834 to 838. The encoded payload data may include patch data.

[0241] As described above, if the NALU is not the last NALU used in a GOP, the method may include retrieving another NALU (e.g., looping through to box 844). If the NALU is the last NALU in a GOP, method 800 may continue to box 854. If more GOPs exist, the method may loop through to box 812 and include retrieving another GOP and proceeding forward to box 814 as previously described. Once all GOPs have been retrieved, the bit stream ends at box 856.

[0242] Encoded payload data in instance form

[0243] Figure 9A This demonstrates how encoded data 900 within an encoded bitstream can be separated into chunks. More precisely, Figure 9A This section demonstrates an instance data structure of the bitstream generated by the enhanced encoder (e.g., level 1 and level 2 encoded data). Multiple planes (nPlanes) are shown. Each plane relates to a specific color component. Figure 9A The example shown is an instance with a YUV color plane (e.g., where the input video frame has three color channels, i.e., three values ​​for each pixel). In this example, the plane is encoded individually.

[0244] The data in each plane is further organized into several levels (nLevels). Figure 9AIn this model, there are two layers for each of enhancement layers 1 and 2. The data for each layer is then further organized into several layers (nLayers). These layers are separate from the base layer and enhancement layer; in this case, they refer to the data for each of the groups of coefficients produced by the transform. For example, a 2×2 transform produces four distinct coefficients, which are then quantized and entropy-encoded, and a 4×4 transform produces sixteen distinct coefficients, which are also quantized and entropy-encoded. In these cases, there are therefore 4 layers and 16 layers, respectively, where each layer represents the data associated with each distinct coefficient. Where the coefficients are referred to as A, H, V, and D coefficients, the layers can be considered as A, H, V, and D layers. In some instances, these “layers” are also called “surfaces” because they can be viewed as “frames” of coefficients in a manner similar to a set of two-dimensional arrays of color components.

[0245] The data in a set of layers can be viewed as “chunks.” Thus, each payload can be considered as being ordered into chunks in a hierarchical manner. That is, each payload is grouped into planes, and then within each plane, each level is grouped into layers, and each layer includes a set of chunks for those layers. A level represents each enhancement level (first or second), and a layer represents a set of transform coefficients. In any decoding process, a method may include retrieving chunks for the two enhancement levels of each plane. A method may include retrieving 4 or 16 layers for each level, depending on the size of the transform used. Therefore, each payload is ordered into a set of chunks for all layers in each level, and then into a set of chunks for all layers in the next level of the plane. Thus, a payload includes a set of chunks for the first level of the next plane, and so on.

[0246] Thus, in the encoding and decoding methods described herein, video images can, for example, be segmented into a hierarchical structure with a specified organization. Each image can consist of three distinct planes organized within the hierarchical structure. The decoding process may attempt to obtain a set of decoded base image planes and a set of residual planes. The decoded base images correspond to the decoded output of the base decoder. The base decoder can be a known or conventional decoder, and therefore, the decoding process of the bitstream syntax and the base decoder can be determined based on the base decoder used. In contrast, residual planes are novel for enhancement layers and can be segmented as described herein. A “residual plane” can include a set of residuals associated with a particular color component. For example, although plane 910 is shown as the YUV plane with respect to the input video, it should be noted that data 920 does not include YUV values, for example, for contrast coding techniques. In fact, data 920 includes encoded residuals derived from data from each of the YUV planes.

[0247] In some instances, the residual plane can be divided into coding units, the size of which depends on the size of the transform used. For example, if a 2×2 directional decomposition transform is used, the coding unit may have a 2×2 size, or if a 4×4 directional decomposition transform is used, the coding unit may have a 4×4 size. The decoding process may include outputting one or more sets of residual surfaces, which are one or more collections of residuals. For example, these may be derived from... Figure 2 The layer 1 decoding component 228 and layer 2 decoding component 248 output the following: A first residual surface set provides a first enhancement layer. A second residual surface set can be another enhancement layer. Each residual surface set can be combined individually or collectively with the reconstructed image derived from the base decoder, for example, as shown in the image. Figure 2 The instance decoder 200 is shown.

[0248] Instance sampling method

[0249] Figures 9B to 9J The following description relates to the implementation of an upsampling component as described in the examples herein (e.g., Figures 1 to 5C Possible upsampling methods used when using upsampling components 134, 234, 334, 434, 534, or 587.

[0250] Figure 9B and 9C Two examples are shown illustrating how a frame to be upsampled can be divided. A reference to a frame can be considered as a reference to one or more planes, for example, data in YUV format. Each frame to be upsampled, called the source frame 910, is divided into two main parts: a central region 910C and a boundary region 910B. Figure 9B This demonstrates example layouts for bilinear and bicubic upsampling methods. Figure 9B In the middle, the boundary region 910B consists of four segments: the top segment 910BT, the left segment 910BL, the right segment 910BR, and the bottom segment 910BB. Figure 9C Show the instance layout that best approximates the upsampling method. Figure 9C In the image, the boundary region 910B consists of two segments: a right segment 910BR and a bottom segment 910BB. In both instances, the segments can be defined by a boundary size parameter (BS), which sets the width of the segment (i.e., the length of the segment extending from the edge of the frame into the source frame). The boundary size can be set to 2 pixels for bilinear and bicubic upsampling methods, or 1 pixel for the nearest-to-nearest method.

[0251] In use, the determination of whether a source frame pixel is located within a specific segment can be performed based on a defined set of pixel indices (e.g., in the x and y directions). Performing differential upsampling based on whether the source frame pixel is within the central region 910C or the boundary region 910B can help avoid boundary effects that may be attributed to discontinuities at the edges of the source frame.

[0252] Closest to topsampling

[0253] Figure 9C Provides an overview of how to upsample frames using the nearest upsampling method. Figure 9C In the image, source frame 920 is upsampled to become destination frame 922. The closest upsampling method upsamples the current source pixel 928 by copying it onto the 2×2 destination grid 924 of the destination pixel, as indicated by arrow 925. The center and edge pixels are shown as 926 and 927, respectively. The destination pixel position is calculated by doubling the index of the source pixel 928 on both axes and progressively adding +1 to each axis to extend the range to cover 4 pixels, as shown below. Figure 9C As shown on the right side. For example, the value of the source pixel 928 with index position (x=6, y=6) is copied to the destination grid 924, which includes pixels with index positions (12, 12), (13, 12), (12, 13), and (13, 13). Each pixel in the destination grid 924 takes the value of the source pixel 928.

[0254] The closest upsampling method is a preferred, fast implementation for embedded devices with limited processor resources. However, the closest method has the disadvantage of potentially needing to correct for blocking or “pixelation” artifacts via level 2 residuals (e.g., this results in more non-zero residual values ​​requiring more bits for transmission after entropy encoding). In some instances described below, bilinear and bicubic upsampling can produce sets of level 2 residuals that can be encoded more efficiently, requiring fewer bits after quantization and entropy encoding. For example, bilinear and bicubic upsampling can generate upsampled outputs that more accurately match the input signal, resulting in smaller level 2 residual values.

[0255] Bilinear upsampling

[0256] Figure 9E , 9F Figure 9G illustrates the bilinear upsampling method. The bilinear upsampling method can be divided into three main steps. The first step involves constructing a 2×2 source grid 930 for source pixels 932 in the source frame. The second step involves performing bilinear interpolation. The third step involves writing the interpolation result to destination pixels 936 in the destination frame.

[0257] Bilinear Upsampling - Step 1: Source Pixel Grid

[0258] Figure 9E An example of the construction of a 2×2 source grid 930 (which may also be called a bilinear grid) is shown. A 2×2 source grid 930 is used instead of source pixel 932 because the bilinear upsampling method performs upsampling by taking into account the values ​​of the three nearest pixels to the base pixel 932B (i.e., the three nearest pixels falling within the 2×2 source grid 930). In this example, the base pixel 932B is located at the bottom right corner of the 2×2 source grid 930, but other locations are possible. During the bilinear upsampling method, the 2×2 source grid 930 can be determined for multiple source frame pixels to iteratively determine the destination frame pixel values ​​for the entire destination frame. The location of the base pixel 932B is used to determine the address of the destination frame pixel.

[0259] Bilinear upsampling - Step 2: Bilinear interpolation

[0260] Figure 9F This illustrates the bilinear coefficient derivation. In this example, bilinear interpolation is a weighted sum of the values ​​of four pixels in the 2×2 source grid 930. The weighted sum is used as the pixel value of the destination pixel 936 in the positive calculation. The specific weights used depend on the position of the specific destination pixel 936 in the 2×2 destination grid 935. In this example, bilinear interpolation applies weights to each source pixel 932 in the 2×2 source grid 930 using the position of the destination pixel 936 in the 2×2 destination grid 935. For example, if calculating the top-left destination pixel (shown as...) Figure 9F If the value is 936 / 936B, then the top-left source pixel value will receive the maximum weighting coefficient 934 (e.g., weighting factor 9), while the bottom-right pixel value (diagonally opposite) will receive the minimum weighting coefficient (e.g., weighting factor 1), and the remaining two pixel values ​​will receive the middle weighting coefficient (e.g., weighting factor 3). Figure 9F Weighted display shown in the 2×2 source grid 930.

[0261] For the pixel to the right of 936 / 936B within the 2×2 destination grid 935, the weighting applied to the weighted sum will change as follows: the top right source pixel value will receive the maximum weighting factor (e.g., weighting factor 9), while the bottom left pixel value (diagonally opposite) will receive the minimum weighting factor (e.g., weighting factor 1), and the remaining two pixel values ​​will receive the middle weighting factor (e.g., weighting factor 3).

[0262] exist Figure 9FIn this process, four destination pixels are calculated based on a 2×2 source grid 930 for a base pixel 932B, but each destination pixel is determined using a different set of weights. These weights can be considered as an upsampling kernel. In this way, there are four different sets of weights applied to the original pixel values ​​within the 2×2 source grid 930 to generate a 2×2 destination grid 935 for the base pixel 932B. After determining the four destination pixel values, another base pixel with a different source grid is selected and the process starts again to determine the next four destination pixel values. This can be iteratively repeated until the pixel values ​​for the entire destination (e.g., upsampled) frame are determined. The next section describes in more detail the mapping of these interpolated pixel values ​​from the source frame to the destination frame.

[0263] Bilinear upsampling - Step 3: Destination pixel

[0264] Figure 9G This presentation provides an overview of the bilinear upsampling method, which includes source frame 940, destination frame 942, interpolation module 944, multiple 2×2 source grids 930 (a, b, c, d, h, j), and multiple 2×2 destination grids 935 (d, e, h, k). Source frame 940 and destination frame 942 have 0-based indices in each column and row for pixel addressing (alternative indexing schemes may be used).

[0265] Generally, each of the weighted averages generated from each 2×2 source grid 930 is mapped to the corresponding destination pixel 936 in the corresponding 2×2 destination grid 935. This mapping uses the source base pixel 932B of each 2×2 source grid 930 to map to the corresponding destination base pixel 936B of the corresponding 2×2 destination grid 942, unlike the closest method. The address of the destination base pixel 936B is calculated according to the following equation (applied to both axes):

[0266] Dst_base_addr=(Src_base_address x 2)-1

[0267] Furthermore, the destination pixel has three corresponding destination sub-pixels 721S calculated according to the following equation:

[0268] Dst_sub_addr = Dst_base_addr + 1 (for two axes)

[0269] Therefore, each 2×2 destination grid 935 typically includes a destination base pixel 936B along with three destination sub-pixels 936S, each of which is located to the right, below, and diagonally downward to the right of the destination base pixel. Figure 9F The above is shown in the image. However, other configurations of the destination grid and base pixels are possible.

[0270] The calculated destination base address and sub-address of destination pixels 936B and 936S may be out of range on destination frame 942. For example, pixel A (0, 0) on source frame 940 generates a destination base pixel address (-1, -1) for 2×2 destination grid 935. The destination address (-1, -1) does not exist on destination frame 942. When this occurs, writes to destination frame 942 for these out-of-range values ​​are ignored. This is expected to occur when upsampling the boundary source frame. However, it should be noted that in this particular instance, one of the destination sub-pixel addresses (0, 0) is in range on destination frame 942. The weighted average of the 2×2 source grid 930 (i.e., the highest weighted value based on the bottom left pixel value) will be written to address (0, 0) on destination frame 942. Similarly, pixel B (1, 0) on source frame 940 generates a destination base pixel address (1, -1), which is out of range because a -1 row does not exist. However, the destination sub-pixel addresses (1, 0) and (2, 0) are within range and correspondingly weighted sums are input into their respective addresses. A similar situation occurs for pixel C, but only the two values ​​at column 0 are input (i.e., addresses (0, 1) and (0, 2)). Pixel D at address (1, 1) of the source frame is converted to a fully 2×2 destination grid 935d based on the weighted average of the source grid 930d, as shown in the example. Figure 9G The 2×2 destination grids 935e, 935h and 935k and the corresponding source grids 930e, 930h and 930k shown are pixels E, H and K.

[0271] It will be understood that these equations effectively address the boundary region 910B and its associated segments, and ensure that when sampling on the center segment 910C, it will remain centered on the destination frame 942. For example, due to the way the destination subpixels are determined, any pixel values ​​determined twice using this method can be ignored or overwritten.

[0272] Furthermore, the ranges of boundary segments 910BR and 910BB are extended by +1 to fill all pixels in the destination frame. In other words, source frame 940 is extrapolated to provide a new column of pixels in boundary segment 910BR (shown as...). Figure 9G The index column number 8) and the new pixel row in the boundary segment 910BB (shown as) Figure 9G (index row number 8 in the text).

[0273] Three samplings

[0274] Figure 9H , 9ITogether with 9J, we illustrate the cubic upsampling method, specifically the bicubic method. The cubic upsampling method in this example can be divided into three main steps. The first step involves constructing a 4×4 source grid 962 for the source pixels, where the base pixel 964B is located at the local index (2, 2) within the 4×4 source grid 815. The second step involves performing bicubic interpolation. The third step involves writing the interpolation result to the destination pixel.

[0275] Triple upsampling - Step 1: Source pixel grid

[0276] Figure 9H The construction of a 4×4 source grid 962 on a source frame 960 is shown for inbound grid 962i and separately for outbound grid 962o. In this example, "inbound" refers to the fact that the grid covers source pixels within the source frame (e.g., center region 910C and boundary region 910B); "outbound" refers to the fact that the grid includes locations outside the source frame. The 4×4 source grid is then multiplied by a 4×4 kernel by performing a triple upsampling method using the 4×4 source grid 962. This kernel may be called the upsampling kernel. During the generation of the 4×4 source grid 962, any pixel falling outside the frame limits of the source frame 960 (e.g., the pixel shown in the outbound grid 962o) is replaced by the value of a source pixel 964 located at the boundary of the source frame 960.

[0277] Triple upsampling - Step 2: Bicubic interpolation

[0278] The kernel used for the bicubic upsampling process typically has a 4×4 coefficient grid. However, the relative positions of the destination pixel and the source pixel will result in different sets of coefficients, and because upsampling is a factor of two in this instance, there will be four sets of the 4×4 kernels used in the upsampling process. These sets are represented by a 4-dimensional coefficient grid (2×2x4×4). For example, there will be one 4×4 kernel representing each destination pixel in the 2×2 destination grid that represents a single upsampled source pixel 964B.

[0279] In one case, the bicubic coefficients can be calculated from a fixed set of parameters. In another case, this includes a set of core parameters (bicubic parameters) and spline creation parameters. In one instance, a core parameter of -0.6 and four spline creation parameters [1.25, 0.25, -0.75, and -1.75] can be used. The filter implementation can utilize fixed-point calculations within the hardware device.

[0280] Triple upsampling - Step 3: Destination pixels

[0281] Figure 9JThis presentation provides an overview of the cubic upsampling method, which includes source frame 972, destination frame 980, interpolation module 982, 4×4 source grid 970, and 2×2 destination grid 984. Source frame 972 and destination frame 980 have 0-based indices in each column and row for pixel addressing (alternative indexing schemes may be used).

[0282] Similar to the bilinear method, the bicubic destination pixel has a base address calculated for both axes according to the following equation:

[0283] Dst_base_addr=(Src_base_address x 2)-1

[0284] Furthermore, the destination address is calculated according to the following formula:

[0285] Dst_sub_addr = Dst_base_addr + 1 (for two axes)

[0286] Therefore, for the bilinear method, each 2×2 destination grid 984 typically includes a destination base pixel along with three destination sub-pixels, each located to the right, below, and diagonally downward to the right of the destination base pixel. However, other configurations of the destination grid and base pixels are possible.

[0287] Furthermore, these equations ensure that when the center segment is sampled, it will remain centered in the destination frame. Additionally, the ranges of the boundary segments 510BR and 510BB are extended by +1 to fill all pixels in the destination frame 980 in the same manner as described for the bilinear method. For example, due to the way the destination subpixels are determined, any pixel values ​​determined twice using this method may be ignored or overwritten. The calculated destination base address and sub-address may be out of range. When this occurs, these out-of-range values ​​are ignored in the write to the destination frame. This is expected to occur when upsampling is performed on the boundary regions.

[0288] Instance entropy coding

[0289] Figures 10A to 10I This illustrates different aspects of entropy coding. These aspects may involve, for example, [the following]... Figure 3A and 3B The entropy coding performed by the entropy coding components 325 and 344 and / or, for example, by... Figure 5B and 5C The entropy decoding is performed by the entropy decoding components 571 and 581 in the system.

[0290] Figure 10A Example entropy decoding component 1003 is shown (e.g., Figure 5B and 5CAn embodiment 1000 of entropy decoding components 571, 581 (or one or more of these components) is described. Entropy decoding component 1003 takes a set 1001 of entropy-coded residuals (Ae, He, Ve, De) 1002 as input and outputs a set 1006 of quantized coefficients 1007 (e.g., quantized transform residuals in the example shown). The entropy-coded residuals 1002 may include the received coded level 1 or level 2 stream (e.g., such as...). Figure 2 (As shown in 226 or 246). The entropy decoding component 1003 includes a Huffman decoder 1004 followed by a run-length decoder 1005. The Huffman decoder 1004 receives an encoded enhanced stream encoded using Huffman coding and decodes this encoded enhanced stream to produce a run-length encoded stream. The run-length encoded stream is then received by the run-length decoder 1005, which applies run-length decoding to generate quantized coefficients 1007. Figure 10A The example of the 2×2 transformation is shown, therefore the coefficients are shown as the A, H, V and D coefficients from the 2×2 oriented decomposition.

[0291] The entropy coding component can be arranged in a manner opposite to that of implementation 1000. For example, the input to the entropy coding component may include a surface (e.g., residual data derived from a quantized set of transformed residuals) and can be configured as an entropy-coded form of the residual data, such as data in the form of coded stream data 1001 (where, for a 2×2 instance, Ae, He, Ve, De are encoded and quantized coefficients).

[0292] Instance Entropy Encoding - Header Format

[0293] Figures 10B to 10E This illustrates a specific implementation of the header format and how the code length can be written to the stream header depending on the amount of non-zero code.

[0294] Figure 10B This shows the prefix code (i.e., Huffman) decoder stream header 1010 for cases containing more than 31 non-zero codes. The first 5 bits indicate the minimum length of the prefix code. The second 5 bits indicate the maximum length of the prefix code. The third bit then provides a compression flag 1011 indicating whether compression is being applied. Figure 10B In this example, three symbols follow: a first non-zero symbol 1014, a second zero symbol 1015, and a third non-zero symbol 1016. The non-zero length flag 1017 includes a bit flag indicating whether each symbol is non-zero; the flags for the first and third symbols 1014 and 1016 are 1, while the flag for the second symbol 1015 is 0. Each non-zero symbol indicates the code length of the prefix code, which is equal to the code length minus the minimum length (e.g., sent with the first 5 bits). The code length can be used to initialize the prefix (i.e., Huffman) decoder, for example... Figure 10AThe value is 1004. In this example, the number of code length bits can be equal: log2(max_length-min_length+1).

[0295] Therefore, in Figure 10A In this example, the data contains more than 31 non-zero values, and the header includes a minimum code length and a maximum code length. The code length of each symbol is then sent sequentially. A flag indicates that the symbol length is non-zero. The code length bits are then sent as the difference between the code length and the minimum transmit length. This reduces the overall size of the header.

[0296] Figure 10C Showing something similar to Figure 10B However, header 1020 is used when there are fewer than 31 non-zero codes. This may include the normal case. Header 1020 again has a first 5 bits indicating the minimum length, a subsequent 5 bits indicating the maximum length, and a compression flag 1021 (e.g., which may be 0 or 1 to indicate compression, as described elsewhere herein). Header 1020 thus further includes the number of symbols in the data, followed by a set of consecutive symbols 1024, 1025. Each symbol may include 8 bits indicating the symbol value followed by the length of the codeword of said symbol, again transmitted as the difference between said length and the minimum length, as relative to Figure 10A As described.

[0297] In both cases, header 1010 or 1020 is used to initialize the entropy decoding component (more precisely, the Huffman or prefix codec decoder) by reading the code length from the header.

[0298] Figure 10D and 10E This illustrates additional headers 1030 and 1040 that can be sent in peripheral situations. For example, when all frequencies are zero, the stream header may include, for instance... Figure 10D The header 1030 shown has 5 bits for the minimum and maximum lengths (1031 and 1032) both set to 31 (i.e., set to the maximum value) to indicate special cases. Figure 10E The header 1040 shows the case where only one code exists in the Huffman tree. In this case, the 0 (i.e., minimum) value in the minimum and maximum length fields (1041 and 1042) indicates a special case of a code, and thus these field values ​​are followed by the symbol value to be used, 1043. In the latter example, when only one symbol value exists, this can indicate that only one data value exists in the quantized coefficient set data.

[0299] Instance Entropy Coding - RLE State Machine

[0300] Figure 10F The display can be used as, for example Figure 10AThe state machine 1050 of the run-length encoder 1005 and other run-length encoders is described. The run-length encoder is configured to read the set of run-length encoded data byte by byte. The state machine 1050 has three states: run-length encoded (RLC) residual least significant bit (LSB) case 1051; run-length encoded (RLC) residual most significant bit (MSB) case 1052; and run-length encoded (RLC) zero run case 1053. Different run-length encoders and decoders can be used for different types of data. For example, different run-length encoding and decoding configurations can be used for each of the following: coefficient groups, time signal coefficient groups, and entropy-encoded pieces of data.

[0301] In some instances, prefix or Huffman coding may optionally be signaled in the header (e.g., using the rle_only flag). The input to the RLE decoder may include a byte stream of Huffman-decoded data if Huffman coding is used (e.g., the rle_only flag is equal to zero), or a byte stream of raw data if Huffman coding is not used (e.g., if the rle_only flag is equal to 1). The output of the RLE decoder may include a stream of quantized transform coefficients. In one case, these coefficients may belong to... Figure 9A The blocks indicated in the text (e.g., by plane, level, and layer indexes—as described in later examples, such as the variables planeIdx, levelIndex, and layerIndex) or include time signal streams (time blocks that form the part of the time layer used to implement time prediction—this is described in later examples).

[0302] Figure 10F The run-length state machine 1050 can be used to implement an RLE decoder for coefficient groups. The run-length state machine 1050 can be used by Huffman coding and decoding processes that know which Huffman code to use for the current symbol or codeword. The RLE decoder uses the run-length state machine 1050 to decode sequences of zeros. It also decodes the frequency table used to construct the Huffman tree for Huffman decoding.

[0303] Based on the configuration, the first byte of data is guaranteed to be in state 1051 (i.e., the RLC residual LSB state). The RLE decoder uses state machine 1050 to determine the state of the next data byte based on the content of the received stream. The current state tells the decoder how to interpret the current data byte. Figure 10G , 10H Section 10I demonstrates how to configure the RLE decoder in this instance to interpret bytes.

[0304] like Figure 10F As shown, state machine 1050 has three states:

[0305] RLC Residual LSB State 1051: This is where state machine 1050 begins. For each byte in the received stream, this state 1051 expects 6 less significant bits (bits 6 to 1) to encode non-zero element values. This is as expected for instances of byte 1070, which are delimited by this state. Figure 10G The run-length bit 1071 indicates that the next byte is encoding a count of a series of zeros. This is encoded in the data portion 1072. If the feature value does not fit within 6 data bits, the overflow bit 1073 is set, which in this instance is the least significant bit of the byte (e.g., the overflow bit is set to 0 if there is no overflow and to 1 if there is an overflow). If the run-length bit 1071 is 0 and the overflow bit 1073 is 0, the state machine 1050 remains in the RLC residual LSB state 1051. When the overflow bit 1073 is set (e.g., to 1), as indicated by arrow 1074, the state of the next byte moves to the RLC residual MSB state 1052, as described below. Figure 10G The lower half therefore shows the byte in the RLC residual LSB state 1051 that causes the state transition. When the overflow bit is set, as shown at 1075, the next state cannot be a series of zeros, and the data can instead be encoded using bit 7, as shown in the data section 1076.

[0306] RLC Residual MSB Status: This status (displayed as 1052) encodes bits 7 to 13 of the feature values ​​that are not fitted within the 6 data bits. The run-length encoding for the byte 1080 of the RLC residual status is as follows: Figure 10H As shown in the diagram. The data portion 1082 is filled with seven least significant bits. In this example, bit 7 - indicating the run bit 1081 - encodes whether the next byte is a sequence of zeros. If the run bit is set (for example, to 1), the state transitions to the RLC zero-run state 1053.

[0307] RLC Zero Run State: This state (displayed as 1053) encodes the 7 bits of the zero run count. The run length of the RLC zero run state 1053 bytes is encoded in byte 1085. Figure 10I The data portion 1087 is provided in seven least significant bits. The most significant bit 1086 is the run bit. If the encoding count requires more bits, the run bit is high. If the run bit is high (e.g., 1), state machine 1050 remains in RLC zero-run state 1053. If the run bit is low (e.g., 0), state machine 1050 transitions to RLC residual LSB state 1051. In RLC residual LSB state 1051, if the run bit is high (e.g., 1) and the overflow bit is low (e.g., 0), state machine 1050 transitions from RLC residual LSB state 1051 to RLC zero-run state 1053.

[0308] In this example, a frequency table is created for each state for use by the Huffman encoder. To ensure the decoder starts in a known state, the first symbol in the encoded stream will always be the residual. Bits can, of course, be inverted (0 / 1, 1 / 0, etc.) without compromising functionality. Similarly, the positions of flag symbols or bytes are purely illustrative.

[0309] Time prediction and communication

[0310] The following describes some variations and implementation details of time prediction, including some aspects of time signaling.

[0311] In some instances described herein, information from two or more video frames associated with different time samples may be used. This can be described as a temporal pattern, for example, because it involves information from different times. Not all embodiments utilize the temporal aspect. Figures 1 to 5C The examples illustrate components used for timing prediction. As described herein, the step of encoding one or more residual sets may utilize a time buffer arranged to store information related to the previous video frame. In one instance, the step of encoding the residual set may include deriving a set of temporal coefficients from the time buffer and modifying the current coefficient set using the retrieved set of temporal coefficients. In these instances, "coefficients" may include transformed residuals, such as those defined by one or more coding units of a frame of a reference video stream, and the method may be applied to both residuals and coefficients. In some cases, the modification may include subtracting the set of temporal coefficients from the current coefficient set. This method may be applied to multiple coefficient sets, such as those involving level 1 streams and those involving level 2 streams. Modifications to the current coefficient set may be performed selectively, for example, by referring to coding units within a video data frame.

[0312] Time-related aspects can be applied at both the encoding and decoding levels. Figure 3A and 3B In the encoder 300 and Figure 5B and 5C The use of a time buffer is demonstrated in decoders 580 and 590. As described herein, the current coefficient set may undergo one or more of a grading and a transform before being modified. In one case, the dequantized transformed coefficients dqC from the previously coded (n-1) frames at the corresponding location (e.g., the same location or a mapped location) are transformed. x,y,n-1 Used to predict coefficient C in the (n) frames to be coded x,y,n If a 4×4 transform is used, x and y can be in the range [0, 3]; if a 2×2 transform is used, x and y can be in the range [0, 1]. Dequantization coefficients can be generated by inverse quantization blocks or operations. For example, in Figure 3B In this process, the dequantization coefficients are generated by the inverse quantization component 372.

[0313] In some instances, at least two time patterns may exist.

[0314] • First-time mode: No time buffer used or a time buffer with all zero values ​​used. First-time mode can be considered an intra-frame mode because it only uses information from the current frame. In first-time mode, after any applied grading and transformation, coefficients can be quantized without modification based on information from one or more previous frames.

[0315] • Utilize a time buffer, for example, a second time mode using a time buffer with potentially non-zero values. The second time mode can be considered an inter-frame mode because it uses information from outside the current frame, such as from multiple frames. In the second time mode, after any applied grading and transformation, the dequantization coefficients from the previous frame—C—can be subtracted from the coefficients to be quantized. x,y,n,inter. =C x,y,n -dqC x,y,n-1 .

[0316] In one scenario, the first time mode can be applied by subtracting from the set of zero-time coefficients. In another scenario, subtraction can be performed selectively based on time signaling data. Figure 11A and 11B This demonstrates example operations in encoders used for two corresponding time modes. Figure 11A The first instance 1100 shows the set of coefficients generated by the encoding component 1102 in the first time mode - C x,y,n,intra These coefficients are then passed for quantization. Figure 11A In the process described above, the encoding component 1112 generates a set of coefficients -C in the second time pattern through subtraction 1114. x,y,n,inter These coefficients are then passed for quantization. Subsequently, based on... Figure 3A and 3B Encode the quantized coefficients for both cases. Note that in other instances, the timing pattern may be applied after quantization or at another point in the encoding pipeline.

[0317] Each of two time modes can be transmitted. Time signaling can be provided between the encoder and decoder. The two time modes can be optional within the video stream; for example, different modes can be applied to different parts of the video stream (e.g., different encoded pictures and / or different areas with pictures such as mosaics). Time modes can also be transmitted over the entire video stream, or alternatively. Time signaling can form, for example, part of the metadata transmitted from the encoder to the decoder. Time signaling can be encoded.

[0318] In one scenario, global configuration variables can be defined for a video stream, such as for multiple frames within the video stream. For example, this could include a `temporal_enabled` flag, where a value of 0 indicates a first time mode and a value of 1 indicates a second time mode. In other cases, and / or instead of global configuration values, a flag indicating the time mode can be assigned to each frame or "picture" within the video stream. If the `temporal_enabled` flag is used as a global configuration variable, then this can be set by the encoder and passed to the decoder.

[0319] In some cases, one or more portions of a frame in a video stream may be assigned variables indicating the temporal pattern of said portion. For example, said portion may include coding units or blocks, such as 2×2 or 4×4 regions transformed by a 2×2 or 4×4 transform matrix. In some cases, each coding unit may be assigned a variable indicating the temporal pattern. For example, a value of 1 may indicate a first temporal pattern (e.g., the unit is an "intra-frame" unit), and a value of 0 may indicate a second temporal pattern (e.g., the unit is an "inter-frame" unit). The variables associated with each portion can be signaled between the encoder and decoder. In one case, this can be done by setting one of the transform coefficients as a variable value, for example, by setting the H coefficient for a 2×2 coding unit or the HH coefficient for a 4×4 coding unit as a variable value (e.g., 0 or 1). In another case, each coding unit may include metadata and / or sideband signaling indicating the temporal pattern. Figure 11C Example 1120 illustrates the previous case. In this example 1120, there are four coefficients 1122 generated by a 2×2 transform. These four coefficients 1122 can be generated by a 2×2 coding unit of the transform residual (e.g., for a given plane). When using a Hadamard transform, the four coefficients can be referred to as A, H, V, and D components 1124, which represent the average, horizontal, vertical, and diagonal aspects within the coding unit, respectively. Figure 11C In example 1120, the H component is used for the transmission time mode, as shown in 1126.

[0320] Timing processing can be selectively applied at the encoder and / or decoder based on the indicated timing pattern. For example, run-length encoding can be used to encode metadata and / or timing signaling within sideband channels of a portion of a frame used to enhance the stream, thereby reducing the size of the data to be transmitted to the decoder. Run-length encoding can be advantageous for small portions, such as coding units and / or pieces, where several timing patterns exist (e.g., because this metadata may include a stream of '0's and '1's with repeating values).

[0321] One or more signaling timing patterns can be applied to two enhancement streams (e.g., at level 2 and / or level 1). For example, in one case, a timing pattern may be applied at LoQ2 (i.e., level 2) but not at LoQ1 (i.e., level 1). In another case, a timing pattern may be applied at both LoQ2 and LoQ1. The timing pattern can be signaled independently for each enhancement level (e.g., as discussed above). Each enhancement level may use a different timing buffer. For LoQ1, the default mode may be no timing pattern used (e.g., a value of 0 indicates no timing feature used, and a value of 1 indicates timing pattern used). Whether a timing pattern is used at a particular enhancement level may depend on the capabilities of the decoder. The timing patterns of the operations described herein can be similarly applied at each enhancement level.

[0322] Time processing at the encoder

[0323] In some cases, the cost of each time-mode used for at least a portion of the video can be estimated. This can be performed at the encoder or in different devices. In some cases, time-modes with lower transmission costs are selected. In the encoder, this can be achieved by... Figure 3A and 3B The time mode selection block shown in the diagram is executed. The decoder can then decode the signaling and apply the selected time mode, such as that indicated by the encoder.

[0324] Cost accounting can be performed on a per-frame basis and / or a per-part basis, such as per mosaic and / or per coding unit. In the latter case, the results of the cost accounting assessment can be used to set the temporal pattern variables for the coding unit before quantization and coding.

[0325] In some cases, an initial time pattern mapping can be provided to indicate a set of frames or portions of frames in a video. This mapping can be used by the encoder. In one case, the temporal_type variable can be obtained from the encoder for use in cost estimation, as described in more detail below.

[0326] In one scenario, the cost of selecting a time pattern can be controllable, for example, by setting parameters in a configuration file. In another scenario, the cost of selecting a time pattern can be based on the difference between the input frame and one or more sets of residuals (e.g., reconstructed). In yet another scenario, the cost function can be based on the difference between the input frame and the reconstructed frame. The cost of each time pattern can be evaluated, and the pattern with the lowest cost can be selected. The cost can be calculated based on the sum of absolute differences (SAD). The cost can be evaluated in this manner per frame and / or per coding unit.

[0327] For example, the first cost function can be based on J o =Sum(abs(Ix,y,n -R x,y,v,o )), where I x,y,n For input values, R x,y,v,o Let be the reconstructed residual and be intra-frame or inter-frame (i.e., indicating the first or second time mode). The cost function can be evaluated using the reconstructed residual from each time mode, and the results of the cost function can then be compared for each time mode. The second cost function can be based on an additional term applying a penalty to non-zero quantized coefficients, and / or on the values ​​of one or more directional components (if these directional components are used for signal transmission (e.g., after transformation)). In the second case, the second cost function can be based on Jo = Sum(abs(I x,y,n -R x,y,v,o ))+step_widthAA*Sum((qC x,y,n,o ! = 0)+((o==intra)&(qC 0,3,n,intra ==0), where the step width is an empirically adjustable configurable weight or multiplier, qC x,y,n,o It is a quantized coefficient and qC 0,3,n,intra These are coefficients related to the H (for 2×2 transformation) or HH (for 4×4 transformation) elements. In other cases, when using sideband signaling, the cost of setting these bits to 1 can be incorporated into the second cost function. For the first time mode (e.g., intra-frame mode), it can be determined according to R. x,y,n,intra =Transform(dqC x,y,n,intra Reconstruct the residual, where "dq" indicates dequantization. For the second time mode (e.g., inter-frame mode), it can be based on R. x,y,n,inter =Transform(dqC x,y,n,inter +dqC x,y,n-1 Reconstructing the residuals. In both cases, "transformation" can indicate the inverse transformation of the coefficients. If the transformation matrix is ​​an inverse matrix, a common or shared matrix can be used for both the forward and inverse transformations. As previously mentioned, the time mode used can be indicated in signaling information such as metadata and / or set parameter values.

[0328] In one scenario, cost can be evaluated at the encoder. For example, the cost can be evaluated using a timing selection block. In other scenarios, cost can be evaluated by a separate entity (e.g., a remote server, during preprocessing of the video stream) and the timing patterns communicated to the encoder and / or decoder.

[0329] If the second time mode is selected (e.g., inter-frame processing), then the modified quantized coefficients (e.g., by...) are sent. Figure 3B The subtraction block 342 output between the transform component 341 and the quantization component 343 is used for entropy encoding. The dequantized values ​​of these coefficients can then be retained for time prediction of the next frame, such as frame n+1. Although Figure 3B Two separate inverse quantization operations are shown for Level 1 streams, but it should be noted that in some cases these may include a single common inverse quantization operation.

[0330] Time mode selection and time prediction can be applied to Figure 3B This refers to one or more of the Tier 2 and Tier 1 streams shown (e.g., applied to one or two sets of residuals). In some cases, the time pattern can be configured and / or signaled separately for each stream.

[0331] Time refresh

[0332] As described in later chapters, in some instances, a second time mode may utilize a time refresh parameter. This parameter can be signaled when the time buffer is about to be refreshed, for example, when a first set of values ​​stored in the time buffer is about to be replaced with a second set of values. Time refresh can be applied at one or more locations in the encoder and decoder. The time buffer can be any of time buffers 124, 144, 230, 250, 345, 361, 424, 444, 530, 550, and 591. For example, in the encoder, the time buffer may store the dequantization coefficients of the previous frame loaded when the time refresh flag is set (e.g., equal to 1, indicating "refresh"). In this case, the dequantization coefficients are stored in the time buffer and used for time prediction of future frames (e.g., for subtraction), while the frame's time refresh flag is not set (e.g., equal to 0, indicating "no refresh"). In this case, when a frame with the associated time refresh flag set to 1 is received, the contents of the time buffer are replaced. This can be performed on a per-frame basis and / or applied to portions of a frame, such as a mosaic or coding unit.

[0333] The time refresh parameter can be used to represent a set of frames in a slowly changing or relatively static scene; for example, the first snapshot of the frame set can be used for subsequent frames in the scene. When the scene changes again, the first frame in the next scene's frame set can indicate that a time refresh is needed again. This can help speed up time prediction operations.

[0334] The time refresh operation of the time buffer can be achieved by setting all values ​​of the time buffer to zero.

[0335] The encoder can send a time refresh parameter to the decoder, for example, as a binary temporal_refresh_bit, where 1 indicates that the decoder will refresh the time buffer for a specific encoded stream (e.g., level 1 or level 2).

[0336] Time estimate and refresh for piecework

[0337] As described in this article, in some instances, data can be divided into mosaicks, such as 32×32 blocks of an image. In this case, time refresh operations, such as those described above, can be performed frame-by-frame on a mosaic basis, where coefficients are stored in a time buffer and addressable by the mosaic. Mechanisms for mosaic time refresh can be applied asymmetrically at the encoder and decoder.

[0338] In one scenario, timing processing can be performed at the encoder to determine the timing refresh logic based on each frame or each block / coding unit. In some cases, the signaling for timing refresh at the decoder can be adapted to store the number of bits transmitted from the encoder to the decoder.

[0339] Figure 12A Example 1200 shows time processing that can be performed at the encoder. Figure 12A The example encoder's time processing subunit 1210 is shown. This encoder can be based on... Figure 3A The encoder 300 or 360 can be a 3B encoder. The timing processing subunit receives a set of residuals indicated as R. These can be level 2 or level 1 residuals as described herein. They can include a set of graded and filtered residuals or a set of ungraded and unfiltered residuals. The timing processing subunit 1210 outputs a set of quantized coefficients indicated as qC, which can then be entropy encoded. In the present example, the timing processing subunit 1210 also outputs timing signaling data indicated as TS for transmission to the decoder. The timing signaling data TS can be encoded together with the quantized coefficients or separately. The timing signaling data TS can be provided as header data and / or as part of the sideband signaling channel. In one case, the timing data can be encoded as a separate surface transmitted to the decoder.

[0340] exist Figure 12A In instance 1200, the residual (R) is received by transformation component 1212. This may correspond to the transformation component in other instances, such as... Figure 3A and 3BOne of the transformation components 322 and 341 in the encoder. Transformation component 1212 outputs transformation coefficients (i.e., transformed residuals) as described herein. The timing processing subunit 1210 also includes a central timing processor 1214. This also receives metadata in the form of a tile-based timing refresh parameter `temporal_refresh_per_tile` and an estimate of the timing mode `initial_temporal_mode`. The timing mode estimate can be provided per frame of coding units, and the tile-based timing refresh parameter can be provided per tile. For example, if a 2×2 transformation is used, the coding unit involves a 2×2 region, and there are 16×16 of these regions in a 32×32 tile, and therefore there are 256 coding units. The metadata can be generated by another subunit of the encoder, for example, in a preprocessing operation, and / or can be supplied to the encoder, for example, via a web application programming interface (API).

[0341] exist Figure 12A In instance 1200, time processor 1214 receives metadata and is configured to determine the timing mode for each coding unit and the value of the time refresh bit for the entire frame or picture. Time processor 1214 controls the application of time buffer 1222. Time buffer 1222 may correspond to the time buffer of the previous instance mentioned above. Time buffer 1222 receives dequantization or inverse quantization coefficients from inverse quantization component 1220, which may correspond to... Figure 3A and 3B One of the inverse quantization components 372 or 364. The inverse quantization component 1220 is then communicatively coupled to the output of the quantization component 1216, which may correspond to... Figure 3A and 3B One of the quantization components 323 or 343 in the time processor 1214. Figure 3A and 3B The time mode selection component 363 or 370 shown in the image has some of its functions. Although Figure 12A This illustration shows a coupling between quantization component 1216, inverse quantization component 1220, and time buffer 1222. However, in other instances, time buffer 1222 may receive the output of time processor 1214 before quantization, and therefore inverse quantization component 1220 may be omitted. Figure 12A The document also showcases a time signaling component 1218 that generates time signaling TS based on the operation of the time processor 1214.

[0342] Figure 12BExample 1230 is shown, for instance, implemented at the decoder, where the decoder receives a temporal_refresh bit per frame and a temporal_mode bit per coding unit. As discussed above, in some cases, the temporal mode for each coding unit can be set within the coded coefficients, for example, by replacing the H or HH values ​​within the coefficients. In other instances, the temporal mode for each coding unit can be transmitted via additional signaling information, such as via sidebands and / or as part of the frame metadata.

[0343] exist Figure 12B In instance 1230, a timing processing subunit 1235 is provided at the decoder. This may implement at least a portion of a level 1 or level 2 decoding component. The timing processing subunit 1235 includes an inverse quantization component 1240, an inverse transform component 1242, a timing processor 1244, and a timing buffer 1248. The inverse quantization component 1240 and the inverse transform component 1242 may include... Figure 5B and 5C The embodiments of inverse quantization components 572, 582 and inverse transform components 573, 583 are shown. The time processor 1244 may correspond to the functionality applied by the time prediction component 585 and the third summing component 594 or by the time prediction component 585 and the fourth summing component 595. The time buffer 1248 may correspond to one of the time buffers 550 or 591. Figure 12B In addition, there is a time signaling component 1246 that receives data 1232, which in this example is indicated in a set of headers H of the bit stream. These headers H may correspond to Figure 5C The header 556. It should be noted that time subunits 1210 and 1235 may, in some cases, be implemented by corresponding encoders and decoders different from those in other instances herein.

[0344] In some cases, when time mode is enabled, such as when set by the global temporal_enabled bit, Figure 12A The timing processor 1214 is configured to use estimated values ​​of the tile-based temporal refresh parameters temporal_refresh_per_tile and the initial temporal_mode, and to determine the value of the temporal mode for each coding unit and the value of the temporal refresh bits for the entire frame to improve the communication efficiency between the encoder and decoder.

[0345] In one scenario, the time processor can determine the cost based on an estimate of the initial_temporal_mode and use these costs to set the values ​​transmitted to the decoder.

[0346] In one scenario, the time processor may initially determine whether a per-frame refresh should be performed and signaled based on the percentage of different estimated time modes across the set of coding units in a frame, for example, when the coding units have initial estimates of time modes. For instance, firstly, all coding units of two estimated time modes (e.g., elements associated with 2×2 or 4×4 transforms) can be ignored if they have a zero absolute difference sum (e.g., where there are no residuals). The refresh bits of the frame can then be estimated based on the proportion (e.g., percentage) of non-zero coding units. In some instances, the refresh operation of the time buffer contents can be set based on the percentage of coding units initially estimated to involve the first time mode. For example, if more than 60% of the coding units are estimated to involve the first time mode without setting temporal_refresh_per_tile, or if more than 75% of the coding units are considered to involve the first time mode with temporal_refresh_per_tile set, the time buffer 1222 can be refreshed for the entire frame and with appropriate signaling set for the decoder (e.g., by zeroing the values ​​within the buffer). In these cases, even if time processing is enabled (e.g., via temporal_enabled signaling), any subtraction is performed relative to the zeroed value within time buffer 1222, thus suppressing time prediction at the decoder, similar to the first time mode. This can be used to revert to the first time mode based on changes within the video stream (e.g., if it is a live stream), even if a second time mode with time prediction is signaled. This improves viewing quality.

[0347] Similarly, in some cases, even if the second time mode is selected for the encoding unit and signaled to the decoder, if the frame encoded by the underlying encoder is set to I or intra-frame (e.g., by setting the frame's temporal_refresh_bit), then the time buffer 1222 is refreshed as described above (e.g., implementing processing similar to the first time mode). This can help ensure adherence to, for example, the picture group (GoP) boundaries of the encoded underlying stream when time processing is enabled.

[0348] Whether a time-based refresh is performed, for example, for a tile may depend on whether the noisy sequence is present alongside isolated static edges. The exact form of the cost function may depend on the implementation scheme.

[0349] Return to Figure 12AThe processing performed by the time processing subunit 1210, after a decision on refreshing the entire frame, may involve tile-based processing based on the temporal_refresh_per_tile bit value at a second level. This can be performed per tile for a given set of tiles of the frame. If temporal_refresh_per_tile is used, and if the flag temporal_refresh_per_tile is set in the metadata received by the time processor, the following processing can be performed.

[0350] In the first sub-level, it can be checked whether the time buffer for a given piece is empty. If so, all time signals in the piece are zero, and the coding units in this piece are encoded in a second time mode (e.g., inter-frame coding), for example, by setting the time mode for said unit to a second mode, with further timing processing performed at the encoder regarding this mode, and the time mode being signaled to the decoder (e.g., by setting coefficient values ​​or via sideband signaling). This allows for efficient encoding of the piece according to the first time mode (e.g., intra-frame coding) when the time buffer is empty. If the second time mode (e.g., inter-frame mode) is set via a 0 value in the time mode bits, this method can reduce the number of bits that need to be sent to the decoder when the time buffer will be empty.

[0351] If the flag `temporal_refresh_per_tile` is not set for a given tile, the first coding unit in the tile can be encoded according to a second time mode (e.g., as an inter-frame unit), and the timing signaling for this tile is not set. In this case, the cost accounting operation as described above is performed for the other coding units within the tile (e.g., the first or second time mode can be determined based on the sum of absolute differences (SAD) metric). In this case, for the other coding units, the initial estimated time mode information is recalculated based on the current (e.g., real-world) coding conditions. All other coding units in the tile can undergo the above procedures and cost accounting steps. Encoding the first coding unit in the tile into the second time mode can be used to indicate the initial timing processing at the decoder (e.g., indicating the initial refresh of the tile), wherein timing processing for other coding units is performed at the decoder based on the confirmed value of the `temporal_mode` bit set for the coding unit.

[0352] If the flag `temporal_refresh_per_tile` is set for a given tile and the time buffer used for the tile is not empty, the time processor can schedule a time refresh for the tile, wherein time signaling is then set to indicate this information at the decoder. This can be done by setting the time mode value of the first coding unit to 1 and the time mode values ​​of all other coding units to 0. This content of 1 in the first coding unit and 0 in the other coding units indicates to the decoder that a refresh operation will be performed relative to the tile but with reduced information to be transmitted. In this case, the time processor effectively ignores the time mode value and encodes all coding units according to the first time mode (e.g., encoding them as intra-coding units without time prediction).

[0353] Therefore, in these instances, when temporal_refresh_per_tile is set as part of the encoder metadata, the first encoding unit can be used to instruct the decoder to clean (i.e., clear) its corresponding time buffer at the location of the tile, and the encoder logic can apply time processing as the appropriate time pattern.

[0354] The above methods allow for temporal prediction to be performed on a per-piece basis, based on coding units within the piece. The configuration of a given piece can be set for a single coding unit within the piece. These methods can be applied to one or more of Level 2 and Level 1 streams, for example, to one or more residual sets.

[0355] In some cases, the global parameter `temporal_tile_intra_signalling` can be set for the video stream to indicate that the tile refresh logic described above will be used at the decoder.

[0356] Initial time mode flag

[0357] In some instances, initial_temporal_mode data can be provided for multiple frames (e.g., for the current frame and the next frame). In these instances, the initial_temporal_mode estimate for the next frame (e.g., frame n+1) can also be used to remove quantized values ​​that are not important for reducing the bit rate. The estimated temporal mode information can be used to control comparisons with one or more thresholds to indicate the removal of quantized values ​​(e.g., in...). Figure 3A Or at one of the quantization components 323 and 343 in 3B, at one of the time mode selection components 363 and 370, or at the RM L-1 control components 324 and 365).

[0358] In some cases, if the estimated initial_temporal_mode of the coding unit at the same location in the next frame is related to the first temporal mode (e.g., intra-frame mode), it can be assumed that the residual to be encoded in the current coding unit will disappear in the next frame, and therefore residuals less than or equal to a given threshold can be removed. As an example, in the test case, this threshold can be set to 2, meaning that all quantized values ​​less than + / -3 will be removed from the coding unit.

[0359] Figure 12C Examples 1250 illustrate how timing signaling information can be provided for residual frames 1251. In these examples, the reference to "frame" can refer to a frame for a specific plane, such as where a separate residual frame is generated for each of the YUV planes. Thus, the terms "plane" and "frame" are used interchangeably. Figure 12C The left side shows how the residual frame can be divided into several pieces 1252. Figure 12C The right-hand side shows how time signaling information can be assigned to each tile. For example, circle 1253 indicates the first tile 1254. In frame 1251, tiles form raster-like rows across frame 1251. The right-hand side shows the first tile 1254 in more detail.

[0360] Figure 12C The circle 1253 on the right-hand side illustrates how each piece 1254 comprises several coding units. A coding unit may include one or more residuals. In one case, the coding unit may relate to a residual block associated with a transform operation, such as a 2×2 block as described herein, which may involve a Directed Decomposition Transformation (DD, described in more detail below), or a 4×4 block as described herein, which may involve a Directed Decomposition Square (DDS). Figure 12C In this context, each coding unit within a tile has a temporal_type flag 1255 (displayed as "TT"), and tile 1254 has a temporal_refresh_per_tile flag 1256 (displayed as "TR"). This information can be obtained by the encoder and used to apply the temporal coding as described above.

[0361] Other time signaling examples

[0362] As described above, in one case, timing signaling can be provided "in the stream," for example, as part of the enhancement stream. This can be done by replacing specific coefficients after the transform, for example, by embedding the timing signaling within the transform coefficients. In one case, level coefficients (e.g., H in a 2×2 directional decomposition transform or HH in a 4×4 directional decomposition square transform) can be used to signal the timing pattern for a specific coding unit. Level coefficients can be used because this minimizes the effect on the reconstructed signal. In some cases, such as based on data carried by other coefficients in the coding block, the effect of the level coefficients can be reconstructed via an inverse transform at the decoder.

[0363] In another scenario, metadata can be used to perform time signaling. The metadata used here may be in the form of sideband signaling, for example, that does not form part of the base stream or enhancement stream. In one case, metadata is transmitted in a separate stream received by the decoder (e.g., via an encoder or a remote server).

[0364] While "in-stream" time signaling offers some advantages for compression, sending the frame's time data as separate information chunks (e.g., metadata) allows for different and potentially more efficient entropy encoding for this information. It also allows for time control and processing, such as those described above, to be performed without requiring the received enhanced streaming data. This allows for the preparation of time buffers and makes in-loop time decoding a simple addition process.

[0365] In the second time mode (e.g., when time processing is enabled), three levels of time signaling can exist:

[0366] • At the first level, per-frame timing signals may exist. These may include per-frame time refresh signals. This may be per-frame refresh bits. If configured this way, the entire frame can be encoded without timing prediction. Signals at this level can be used to encode frames and can be transmitted to the decoder.

[0367] At the second level, per-tile timing signals may exist. For example, these can be set for every m multiplied by n tiles, where m and n can be 32. The per-tile timing signals may include a per-tile timing refresh signal. This can be a per-tile refresh bit. If a timing refresh signal is set for a tile, the entire tile is encoded without timing information. This timing signaling level can be used to encode frames. In one case, it may not be explicitly signaled to the decoder; in this case, the tile refresh signal can be indicated by the first timing signal at the third level, as described below. In another case, the per-tile timing refresh signal can be explicitly signaled to the decoder.

[0368] At the third level, per-block or coded unit time signals may exist. These may include time pattern signals for the blocks. This can be relayed to the decoder. If the per-patch time refresh signal is set to 1, and the entire patch is encoded without time information (e.g., according to a first time pattern), this can be relayed to the decoder along with one bit of the per-block time signal for the first block, which can be set to 1. If the per-patch time refresh signal is set to 0, the first transform block (e.g., a 2×2 or 4×4 block) in the patch can be encoded via time prediction (e.g., using a time buffer). In this case, the per-block time signal can be set to 0, indicating that time prediction has been used (e.g., encoded according to a second time pattern). If the per-patch time refresh signal is set to 0, then all other transform blocks in the patch may have a one-bit time signal, which is set to 1 when the patch is encoded without time information, and is set to 0 when the transform coefficients of the previous frame from the same spatial location are first subtracted from the transform coefficients and then the difference is quantized and passed to the entropy encoder (i.e., when a second time mode and time buffer are to be used).

[0369] Figure 12D The representation 1260 shows a time signal used for a 4×4 transform size (e.g., DDS transform). A 2×2 transform size can be transmitted accordingly. Figure 12D A frame (or plane) 1261 (e.g., similar to) displaying elements 1262 (e.g., derived from residuals) with multiple pieces 1265, 1266. Figure 12C The time signal is organized using tiles 1265 and 1266. For a 4×4 transform and 32×32 tiles, there is an 8×8 time signal per tile (i.e., 32 / 4). For a 2×2 transform and 32×32 tiles, there is a 16×16 time signal per tile (i.e., 32 / 2). For example... Figure 12D The set of time signals of the residual frames shown in the figure can be called a "time map". The time map can be passed from the encoder to the decoder.

[0370] Figure 12DThis demonstrates how the timing signals for the first transform blocks 1268 and 1269 within tiles 1265 and 1266 can indicate whether the tiles will be processed in a first or second timing mode. The timing signals can be bits indicating the timing mode. If the bit is set to 1 for the first transform block, as shown for block 1268, this indicates that tile 1265 will be decoded according to the first timing mode, for example, without using a time buffer. In this case, bits for other transform blocks may not be set. This reduces the amount of timing data transmitted to the decoder. If the timing signaling bit for the first transform block is set to 0, as shown for block 1269, this in 12D indicates that tile 1266 will be decoded according to a second timing mode, for example, through timing prediction and the use of a time buffer. In this case, the timing signaling bits for the remaining transform blocks are set to 0 or 1, thereby providing a hierarchy of timing control at the (third) per-block level.

[0371] Encoded time signal

[0372] In some cases, as described above, time signaling at the third level can be effectively encoded when it is sent as metadata (e.g., sideband data).

[0373] In the cases described above, and for example as Figure 12D As shown, a time map of a frame can be sent to a run-length encoder (e.g., where the frame is a "picture" of an encoded residual). Run-length encoding can be used to efficiently encode the time map. Run-length encoding can be performed using the same run-length encoder used in one or more of the "entropy encoding" components in the first and second enhancement streams (or copies of this encoder process). In other cases, different run-length encoders can be used.

[0374] If run-length encoding is used, several operations can occur when the time map is received by the run-length encoder. In one case, if the first time signal in the tile is 1, the time signaling of the remaining tiles is skipped. This is illustrated by the arrow from the first transform block with a value of 1. In this case, if the first time signal in the tile is 0, as illustrated for subsequent tile 1266 in 12D, the tile's time signaling bits can be scanned line by line (e.g., along the first row of transform blocks, then moving to the next row of transform blocks, at each step of moving to the next column of transform blocks). Figure 12D In this process, each piece has 8 rows and 8 columns. Therefore, for bit 0, the iteration is performed on the first 8 columns of the first row, and then the iteration is repeated on the same 8 columns of the second row, and so on, until all the time signals of the transform block for the particular piece are encoded.

[0375] In one scenario, the run-length encoder for the time signal can have two states representing bit values ​​0 and 1 (i.e., second time mode and first time mode). These can be used to encode run 1 and run 0. In another scenario, the run-length encoder can encode the run byte-by-byte, using 7 bits per byte to encode the run, and using bit 7 to encode cases where more bits are needed to encode the run (set to 1) or when the context changes. By convention, the first symbol in the stream is always encoded as 0 or 1, so the decoder can initialize the state machine. A state machine 1280 can be used in... Figure 12E It is displayed in the middle. Figure 12D The data displayed in the document can be referred to as a "time surface," such as a surface of time signaling data.

[0376] Figure 12E The state machine 1280 has a start state 1281, followed by two subsequent states 1282 and 1283. The run-length decoder for time signaling can read run-length encoded data byte-by-byte (e.g., data encoded by the run-length encoder). Figure 12D (Data shown in the image). Based on the construction, it is guaranteed that the state 1281 of the first byte of data is true for the first symbol in the stream. The decoder uses state machine 1280 to determine the state of the next byte of data. The bytes of data can be... Figure 10H and 10I Bytes 1080 and 1085 are encoded in a similar manner. In these cases, the first subsequent state is a run state 1282. This can have the most significant bit (bit 7) as the run flag bit (e.g., similar to...). Figure 10H (1081) and the remaining bits (bits 6 to 0 - seven in total - similar to) Figure 10H The 1082 bit in the code is used as the data part. A run state 1082 encodes a 7-bit run count. If encoding the count requires more bits, the run bit is high. If both the run and sign bits are 0 or both are 1, state machine 1280 can move from a first sign state 1281 to a run state 1282, and if the run and sign bits are different (e.g., 0 and 1 or 1 and 0), state machine 1280 can move from a first sign state 1281 to a zero-run state. A run bit value of 0 allows for bi-state switching between a run state 1282 and a zero-run state 1283. The zero-run state 1283 may also have similar characteristics to... Figure 10H (Or the byte structure shown in 10I). The zero-run state encodes the 7 bits of the zero-run count. If encoding the count requires more bits, the run bits are high.

[0377] In one instance, the run-length decoder can write 0 and 1 values ​​into a temporal signal surface array TempSigSurface of size (PictureWidth / nTbs, PictureHeight / nTbs), where nTbs is the transform size (e.g., 2 or 4 in this example). If the value to be written at the write position (x, y) in TempSigSurface is 1 and x%(32 / nTbs) == 0 and y%(32 / nTbs) == 0, then the next write position moves to (x, y+32 / nTbs) when y+32 / nTbs < PictureWidth / nTbs, otherwise it moves to (x+32 / nTbs, 0). Run-length encoding and decoding for temporal signaling can be compared with run-length encoding described for residual data (e.g., reference...). Figures 10A to 10I Implemented in a similar manner.

[0378] In one scenario, the information generated by the run-length encoder can be sent to the entropy encoder. This may include a Huffman encoder. The Huffman encoder can write two Huffman codes to the metadata stream for each state and the Huffman-encoded value. Run-length encoding and entropy encoding can therefore use existing entropy encoding components and / or appropriately adapted copies of these components (e.g., as appropriately initialized threads). This simplifies encoding and decoding because the components can be reused with different configuration information. In some cases, this can be done for both residual data and time signaling data (e.g., as referenced). Figures 10A to 10I The described method implements Huffman or prefix coding in a similar manner.

[0379] Example of a time processing flowchart

[0380] Figure 13A and 13B These are two halves of a flowchart, 1300 and 1340, illustrating the time processing method based on an instance. The time processing method can be executed at the encoder. The time processing method can implement some of the procedures described above. The processing method can be applied to... Figure 12C The residual frames shown in the image.

[0381] At box 1302, it is checked whether the current frame of the residual is an I-frame (i.e., an intra-coded frame). If the current frame of the residual is an I-frame, the time buffer is refreshed at box 1304, and the current frame of the residual is encoded as an inter-frame at box 1306, with per-picture signaling set to 1 at box 1308. If it is determined at box 1302 that the current frame of the residual is not an I-frame, the first tile is selected, and a check is performed at box 1310 to determine whether the temporal_refresh_per_tile flag is set (e.g., with a value of 1). This could be TR variable 1256, such as... Figure 12CAs shown on the right side. If the `temporal_refresh_per_tile` flag is set, then at the next box 1320, the `temporal_type` flag of the cells within the current tile is analyzed. For example, for the first tile, these could be... Figure 12C The temporal_type flag 1255 of the cell shown on the right side. In the next box 1324, the percentage of the I or first time mode flag value (e.g., value '1') can be counted. If these are greater than 75%, the time buffer is refreshed at box 1328 and inter-frame coding of the mosaic is performed at box 1330, where the time signal in each mosaic is set to 0 at box 1332. If these are less than 75%, the method proceeds to... Figure 13B (For example, via node A). If temporal_refresh_per_tile is not set (e.g., has a value of 0), a similar process occurs, where a check is performed at box 1322 to determine if more than 60% of the cells within the current tile have their temporal_type flags set to I or first-time mode (e.g., have a value of '1'). If so, a similar process occurs as per the previous 75% check (e.g., execute boxes 1328 to 1332). If less than 60% of the cells within the current tile have their temporal_type flags set to I or first-time mode, the method proceeds again to... Figure 13B (For example, via node B).

[0382] Turn Figure 13B The second half, 1340, shown in the image, and from... Figure 13B Starting with node A on the left-hand side, if less than 75% of the cells have I or the first time mode, then at box 1342, check if the time buffer is empty. If the time buffer is empty, then at box 1344, inter-frame coding is performed on the cells within the tile, and at box 1346, the time signal for the cells in the tile is set to 0. If the time buffer is not empty, then at box 1348, intra-frame coding is performed on the cells within the tile. In this case, then at box 1350, the time signal for the first cell is set to 1, and the time signals for all other cells in the tile are set to 0.

[0383] Now turning Figure 13BStarting at node B on the right-hand side, if less than 60% of the cells have I or the first time mode, then the first cell in the current tile is inter-coded at box 1352, and the time signal of the first cell is set to 0 at box 1354. Next, at box 1356, it is checked whether the temporal_type of the co-located n+1 cell (i.e., the co-located cell in the next frame) is set to 1. If so, and it is determined at box 1358 that the residual value is less than 2, then at box 1360, the residual is removed, for example, by setting the residual value to 0. If the residual value is not less than 2 at box 1358, or if the co-located cell is not set to 1, then at box 1362, a determination is made regarding whether the next cell in the tile should be intra- or inter-coded based on the cost function. At box 1364, the time signal of the next cell can be set according to the cost function classification. This can be repeated for the remaining cells in the tile. The method can be repeated for each tile in the frame, for example, starting with a check of temporal_refresh_per_tile.

[0384] Cloud configuration

[0385] In some instances, the encoder (or encoding process) can communicate with one or more remote devices. The encoder can be, for example... Figure 1 , 3A The encoders shown in any of the 3B or described in any other instance.

[0386] Figure 14A Example 1400 illustrates encoder 1402 communicating across network 1404. In one scenario, encoder 1402 can receive configuration data 1406 and / or transmit configuration data 1408 across network 1404. Figure 14A In this example, the encoder receives configuration data 1406 in the form of one or more of encoder parameters, timing signaling, and residual masks. Timing signaling may include any of the timing signaling discussed herein. Encoder parameters may include values ​​for one or more parameters controlling the encoder. In one case, encoder parameters may include parameters for one or more of the base encoder, processing components for level 1 streams, and processing components for level 2 streams. Encoder parameters can be used to configure one or more of the stream resolution, quantization, sequence processing, bit rate, and codec for each stream. The residual mask may include weights, for example, from 0 to 1, applied to a set of residuals, such as 2×2 or 4×4 groups (i.e., blocks) of residuals. The residual mask may indicate the priority used to deliver blocks to the decoder and / or for encoding. In another case, the residual mask may include weights controlling the processing of blocks, such as visually enhancing or weighting specific blocks. Weights may be set based on a class (e.g., label or value) applied to one or more residual blocks.

[0387] In some cases, encoder 1402 may be adapted to perform encoding at multiple bit rates. In this case, encoder parameters are available for application to each of the multiple bit rates. In some cases, configuration data 1406 received from network 1404 may be provided as one or more of global configuration data, per-frame data, and per-block data. In an example, residual masks and timing signaling may be provided per frame. For instance, the multiple bit rates may be set based on, for example, the available capacity of the communication channel, such as measured bandwidth, and / or the intended use, such as using 2 Mbps in a 10 Mbps downlink channel.

[0388] Configuration data 1408 transmitted from encoder 1402 may include one or more of the following: base codec type, desired bit rate set, and sequence information. The base codec type may indicate the type of base encoder used for a set of current processing. In some cases, different base encoders may be available. In one case, the base encoder may be selected based on the received base codec type parameter; in another case, the base codec type may be selected based on local processing within the encoder and transmitted across the network. The desired bit rate set may indicate one or more bit rates to be used to encode one or more of the base stream and two enhancement streams. Different streams may use different bit rates. Enhancement streams may use additional bandwidth (if available); for example, if bandwidth is unavailable, bandwidth may be used by the encoded base stream and layer 1 stream to provide a first quality level at a given bit rate; the encoded layer 2 stream may then use a second bit rate to provide further improvement. This method may also be applied differentially to the base stream and layer 2 stream, rather than the base stream and layer 1 stream.

[0389] In one scenario, encoder parameters received across network 1404 can indicate one or more of the residual mode and time mode to be applied by encoder 1402. The encoder parameters can individually indicate the mode for each first-stream, or indicate a common mode used for both enhancement streams. The residual mode parameters can be obtained from... Figure 1 , 3A The residual mode selection components 150 and 350 shown in 3B are received. In some cases, the residual mode selection component can be omitted, and the residual mode parameters can be directly received by other components of the encoder, such as... Figure 1 L-1 or L-2 coding components 122, 142 or Figure 3A , 3BThe RM L-1 control and / or RM L-2 selection / grading components 321, 340 receive residual mode parameters from the cloud interface of encoder 1402. In some cases, each residual or time mode may be indicated by an integer value, such as '1' for time processing, and / or residual mode 2, where a specific coefficient is retained only after the transformation operation. The residual mode may indicate what form of predicted coefficient processing will be applied, such as whether a specific coefficient will be predicted, for example, using data from a lower resolution stream.

[0390] In one scenario, encoder 1402 may have different configuration settings regarding remote or cloud configuration. In one mode, which may be a "default" mode, the encoder may be configured to make remote program calls across a network to retrieve initial configuration parameters to perform encoding as described herein. In another mode, which may be a "custom" mode, encoder 1402 may retrieve local parameter values ​​indicating a specific user configuration, such as the configuration of a specific set of tools and / or those tools used by the encoder. In one scenario, encoder 1402 may have different modes indicating which parameters to retrieve from a remote device and which parameters to retrieve from local storage.

[0391] In one scenario, timing signaling can instruct certain processing of a video data frame, such as as described above. Timing signaling can, for example, indicate the timing mode of a specific frame as described above (e.g., indicating mode 1 or 0 within or between frames). Timing signaling can be provided for one or both of the enhanced streams.

[0392] Figure 14B The encoder 1402 can send configuration data 1406, 1408 to and / or receive configuration data 1406, 1408 from the remote control server 1412. The control server 1412 may include a server computing device implementing an application programming interface for receiving or sending data. For example, the control server may implement a RESTful interface, thereby transmitting data via (secure) Hypertext Transfer Protocol (HTTP) requests and responses. In another case, a side channel implemented using a specific communication protocol (e.g., at the transport or application layer) may be used for communication between the control server 1412 and the encoder 1402 via a network 1404. The network 1404 may include one or more wired and / or wireless networks, including local area networks (LANs) and wide area networks (WANs). In one case, the network may include the Internet.

[0393] Figure 14C Demonstrates encoder 1432 (which may include) Figure 14A and 14BThe encoder 1402 described herein (any of the encoders) may include a configuration interface 1434 configured to communicate via a network, for example, with a remote control server 1412. The configuration interface 1434 may include hardware interfaces and / or software, such as Ethernet and / or wireless adapters, to provide a communication stack for communication via one or more communication networks. Figure 14C In this context, configuration parameters and settings 1436 used and / or stored by encoder 1432 are transmitted via network 1404 using configuration interface 1434. Encoder configuration parameters 1438, which may be stored in one or more memories or registers, are received from the configuration interface. In one case, encoder configuration parameters 1438 may control one or more of the undersampling, base encoder, and base decoder components within encoder 1432, as shown in the figures. Configuration interface 1434 also transmits data to L-1 flow control component 1440 and L-2 flow control component 1442. These components can be configured for tooling use on each enhancement stream. In one case, L-1 and L-2 flow control components 1440, 1442 control one or more of the residual mode selection, transformation, quantization, residual mode control, entropy coding, and time processing components (e.g., as shown in the figures and described herein).

[0394] Using a cloud configuration as described herein offers implementation advantages. For example, encoders can be remotely controlled, for instance, based on network control systems and measurements. Encoders can also be upgraded to provide new functionality, for instance, by providing additional data to enhancement processing via firmware upgrades based on measurements or preprocessing supplied by one or more remote data sources or control servers. This provides a flexible way to upgrade and control legacy hardware.

[0395] Residual mode selection

[0396] As mentioned above, for example, regarding Figure 3A and 3B As described, different residual processing modes can be implemented for specific instances. For example, in Figure 3A In the residual mode hierarchical component 350, residual mode selection components 321 and 340 are controlled in each of the enhanced flows of level 1 and level 2; Figure 3B In this encoder, residual mode selection component 350 controls residual mode grading components 321 and 340 in each of the enhancement flows of level 1 and level 2. Generally, the encoder may include residual mode control components that select and implement residual modes, and residual mode implementation components that implement processing for the selected residual mode on one or more enhancement flows.

[0397] In one instance, once the residual has been calculated, it can be processed to determine how it will be encoded and transmitted. As previously described, here, the residual is calculated by comparing the original image signal with the reconstructed image signal. For example, in one case, the residual is calculated by subtracting the upsampled output (e.g., ...) from the original image signal (e.g., input video 120, 302 as indicated in the figures). Figure 1 , 3A The residuals of the level 2 enhancement stream are determined (in 3B). The input to the upsampled signal can be referred to as the reconstruction of the signal after analog decoding. In another case, the residuals are determined by the original image signal in the downsampled form (e.g., Figure 1 , 3A The residual of the layer 1 enhancement stream is determined by subtracting the image stream output by the base decoder from the output of the downsampling components 104 and 304 in 3B.

[0398] To process residuals, for example, within a selected residual pattern, the residuals can be classified. For instance, residuals can be classified to select a residual pattern. The residual classification process can be performed, for example, based on specific spatial and / or temporal characteristics of the input image.

[0399] In one instance, the input image is processed to determine, for each feature (e.g., a pixel or a region containing multiple pixels) and / or group of features, whether the feature and / or group of features possess specific spatial and / or temporal characteristics. For example, features are measured against one or more thresholds to determine how the features are classified against corresponding spatial and / or temporal characteristics. Spatial characteristics may include the level of spatial activity between specific features or groups of features (e.g., how much change exists between adjacent features), or the contrast between specific features and / or groups of features (e.g., the degree of difference between a group of features and one or more other groups of features). Spatial characteristics may be a measure of change in a set of spatial orientations (e.g., the horizontal and / or vertical orientations of a 2D planar image). Temporal characteristics may include the temporal activity of specific features and / or groups of features (e.g., the degree of difference between juxtaposed features and / or groups of features on one or more previous frames). Temporal characteristics may be a measure of change in a temporal direction (e.g., along a time series). The characteristics may be determined per feature and / or group of features, which may be per pixel and / or per 2×2 or 4×4 residual block.

[0400] Classification can be based on the spatial and / or temporal characteristics of elements and / or groups of elements, assigning corresponding weights to each element and / or group of elements. Weights can be normalized values ​​between 0 and 1.

[0401] In a residual mode, a decision can be made regarding whether to encode and transmit a given set of residuals. For example, in a residual mode, a particular residual (and / or residual block—such as a 2×2 or 4×4 block as described herein) can be generated as follows: Figure 3A and 3B The RM Lx hierarchical components and / or RM Lx selection components shown selectively forward along the tier 1 and / or tier 2 enhancement processing pipelines. In other words, different residual modes can have Figure 1 Different residual processing in the Level 1 and / or Level 2 encoding components 122, 142. For example, in a residual mode, a specific residual may not be forwarded for further Level 1 and / or Level 2 encoding; for instance, the specific residual may not undergo transform, quantization, and entropy encoding. In one case, the residual may not be forwarded by setting the residual value to 0 and / or by setting a specific control flag regarding the specific residual or a group containing the residual.

[0402] In one residual mode, binary weights of 0 or 1 may be applied to the residuals, for example, by the components discussed above. This may correspond to a mode where selective residual processing is “on.” In this mode, a weight of 0 may correspond to “ignoring” a particular residual, for example, not forwarding the residual for further processing in the enhancement pipeline. In another residual mode, there may be no weighting (or the weight of all residuals may be set to 1); this may correspond to a mode where selective residual processing is “off.” In yet another residual mode, normalization weights of 0 to 1 may be applied to the residuals or groups of residuals. This may indicate the importance or “usefulness” weight for reconstructing the video signal at the decoder, for example, where 1 indicates that the residual has normal use, and values ​​below 1 reduce the importance of the residual. In other cases, normalization weights may be in another range, for example, a range of 0 to 2 may give a particular residual importance with a weight greater than 1.

[0403] In the residual model described above, residuals and / or residual groups can be multiplied by assigned weights, which can be assigned after a classification process applied to the set of corresponding features and / or feature groups. For example, in one case, each feature or feature group can be assigned a class represented by integer values ​​selected from a predefined set or range of integers (e.g., 10 classes from 0 to 9). Each class can thus have a corresponding weight value (e.g., 0 for class 0, 0.1 for class 1, or some other nonlinear mapping). The relationship between classes and weight values ​​can be determined through analysis and / or experimentation, for example, based on image quality measurements at the decoder and / or encoder. The weights can then be used to multiply the corresponding residuals and / or residual groups, for example, the residuals and / or residual groups corresponding to the features and / or feature groups. In one case, this correspondence can be spatial, for example, by calculating residuals based on specific input feature values ​​and applying classification to those specific input feature values ​​to determine the weights of the residuals. In other words, classification can be performed on features and / or groups of features in the input image, where the input image can be a frame of a video signal, but then the residuals and / or groups of residuals for co-localization, rather than features and / or groups of features, are weighted using weights determined from this classification. In this way, representation can be performed as a process separate from the encoding process, and therefore it can be computed in parallel with the encoding of the residual process.

[0404] Examples of residual pattern processing

[0405] Figure 15 This demonstrates an example of a residual pattern. This example involves a Level 2 stream but can provide a similar set of components for a Level 1 stream. The set i of input image features is processed via classification component 1502. ij 1501 performs classification to generate a set 1503 of class indicators (e.g., in the range of 0 to 4). The class indicators 1503 are then used by the weight mapping component 1504 to retrieve a set 1505 of weights associated with the class indicators 1503. In parallel, from the input image feature i... ij 1501 minus the set of reconstructed upper-sampled features u ij 1506 to generate the initial residual set r ij 1508. These residuals 1508 and the weight set 1505 are then input into a weight multiplication component 1509, which multiplies the residuals 1508 by the weight set 1505 to output a set r′ of modified residuals. ij 1510. Figure 15 The residual mode selection may involve filtering a subset 1512 of residual values ​​(e.g., multiplying it by a weight of 0) and passing through or modifying another subset 1511 of residual values ​​(e.g., where there are non-zero weights).

[0406] In some cases, characterization can be performed and transmitted to the encoder from a location remote from the encoder. For example, a pre-recorded movie or television program may be processed once to determine a set of weights for a set of residuals or groups of residuals. These weights may be transmitted to the encoder via a network, and may include, for example, references. Figures 14A to 14C The residual mask described.

[0407] In one scenario, as an alternative or supplement to weighting the residuals, the residuals can be compared against one or more thresholds derived from the classification process. For example, the classification process may determine a set of classes with associated weights and thresholds, or only a set of associated thresholds. In this case, the residuals are compared against the determined thresholds, and residuals that fall below one or more specific thresholds are discarded and not encoded. For example, additional thresholding may be applied to classes derived from… Figure 15 The modified residual 1510, and / or the weight mapping and weight multiplication components can be replaced by threshold mapping and threshold application levels. Generally, in both cases, the residuals are modified based on the classification process for further processing, where the classification process can be applied to the corresponding image features.

[0408] The residual pattern processing method described above can be applied at the encoder but not at the decoder. Therefore, this represents a form of asymmetric coding that can take into account increased resources at the encoder to improve communication. For example, residuals can be weighted to reduce the size of data transmitted between the encoder and decoder, thereby allowing for improved quality at limited bit rates (e.g., where discarded residuals have reduced detection capabilities at the decoder).

[0409] Forecast average

[0410] As described in this paper, residual features can be limited to the difference between input frame features and corresponding / co-localized upsampled features, as indicated below:

[0411] r ij =i ij -u ij

[0412] At the encoder, the residual is transformed, then quantized, entropy-encoded, and transmitted to the decoder. Specifically, the encoder uses two possible transforms, the first called Directed Decomposition (DD) and the other called Directed Decomposition Squared (DDS). Further details about these transforms are contained in patent applications PCT / EP2013 / 059847 and PCT / GB2017 / 052632, which are incorporated herein by reference.

[0413] Figure 16AThis demonstrates the process 1600 involving the DD transform at the encoder. In the case of DD, the transform is applied to every 2×2 block of the frame or plane of the input data 1610. (Reference) Figure 16A This presents four 2×2 blocks 1611 of the input value 1612. These are generated by a downsampling process 1615 (e.g., similar to...). Figure 1 and 3A The downsampling components 104 and 304 of / B are downsampled to generate a downsampled frame or plane 1620 with feature values ​​1621. The downsampled frame 1620 is then upsampled by an upsampling process 1625 (e.g., via...). Figure 1 and 3A Sampling is performed on components 134 and 334 in / 3B (shown). This produces an upsampled frame 1630, which also has a block 1631 with an upsampled value 1632, wherein... Figure 16A In this process, an upsampled value 1622 is used to generate four upsampled values ​​1632 (i.e., an upsampled block 1631). Figure 16A In the process, sampled frame 1630 on input frame 1635 is subtracted from input frame 1610 to generate residual frame 1640, which includes block 1641 containing residual value 1642.

[0414] During the transformation, the following coefficients are calculated for each residual block 1641 (for simplicity, the following expressions refer to the top leftmost 2×2 block, but similar expressions can be easily derived for other blocks):

[0415]

[0416]

[0417]

[0418]

[0419] Now observe the average component (A0), which can be decomposed as follows:

[0420]

[0421] It should be noted that the upsampling operation is performed starting from the corresponding lower resolution element 1622 to obtain the following results: Figure 16A The diagram shows each upsampled 2×2 block 1631 with an upsampled value of 1632. This lower resolution feature 1622 can be referred to as a "control feature". In the case of the top left block, this feature will be d. 00 .

[0422] Accordingly, d 00 The following can be obtained by adding and deleting:

[0423]

[0424] It can then be grouped as follows:

[0425]

[0426] The δA0 (difference mean) is shown as 1650, corresponding to the difference of 1645 between the mean of the features in the input image (e.g., block 1611) and the control feature 1622. The predicted mean PA0 corresponds to the difference between the mean of the oversampled features and the control feature. This can be calculated at the decoder.

[0427] Figure 16B The corresponding process at the decoder is described in section 1655. Data δA value 1658 is transmitted from the encoder 1656. In parallel, the level 1 resolution frame 1660 is reconstructed and upsampled 1665 to form the upsampled frame 1666. Figure 16B Block 1661 displays four lower-resolution elements 1662. These elements correspond to the reconstructed video signal. Upsampled frame 1666 is displayed as four blocks 1668 with four upsampled elements 1669. The decoder is able to calculate PA using the upsampled elements 1668 and control elements 1662 obtained by decoding lower-resolution frames (e.g., frames obtained by decoding based on individual codecs such as AVC, HEVC, etc.). Figure 16B In the above sampled element 1668, the predicted average 1671 is determined as the difference 1670 between the average of the blocks of the sampled element 1668 and the control element 1662. The original average 1675 can then be reconstructed by summing the δA value 1658 with the predicted average PA 1671 1672. This is why this element is called the "predicted average," because it is a component of the average that can be predicted at the decoder. The decoder will therefore only need the δA provided by the encoder, since information about the input image frame is known at the encoder.

[0428] Accordingly, when using the DD transform type, the decoder is able to calculate a predicted average using one or more upsampled features and corresponding features (“control features”) from a lower-resolution image, which are used to generate the one or more upsampled features. It is then able to decode values ​​received from the encoder, representing the difference between one or more features in the reference (e.g., input) image and the controlled features. It is then able to combine the predicted average with the decoded values ​​to generate one of the transformed coefficients, namely the average coefficient.

[0429] When using the DD transform type, the encoder is able to calculate values ​​to be transmitted to the decoder, representing the differences between one or more features in a reference (e.g., input) image and corresponding features (“control features”) from a lower-resolution image. The encoder is able to generate control features by replicating the operations that the encoder would need to perform to reconstruct the image. Specifically, the control features correspond to the features that the decoder would use to generate the one or more upsampled features. The encoder is then able to further transmit the H, V, and D coefficients to the decoder.

[0430] In the case of DDS transformation, the operation is slightly modified. The DDS operation on a 4×4 block of residuals generates 16 transformed coefficients. DDS can be implemented in at least two ways. Directly, by summing and subtracting the 16 residuals in the 4×4 block – see below:

[0431]

[0432] Alternatively, and in a more efficient manner, it can be implemented as a “two-step” transformation, first performing a DD transformation on the 2×2 residual block to generate a 2×2 block of DD coefficients, and then applying a second DD transformation to it.

[0433] First step:

[0434]

[0435]

[0436]

[0437]

[0438] Second step:

[0439]

[0440] As can be seen, in the DDS case, there are four “average” coefficients, one for each direction: (1) AA, or the average of the average coefficients; (2) AH, or the average of the horizontal coefficients; (3) AV, or the average of the vertical coefficients; and (4) AD, or the average of the diagonal coefficients.

[0441] Similar to the DD transform, each of these average coefficients can be decomposed into the difference average (to be calculated by the encoder and decoded at the decoder) and the prediction average (to be calculated by the decoder), as follows:

[0442] AA=δAA+PAA

[0443] AH=δAH+PAH

[0444] AV = 8AV + PAV

[0445] AD = 8AD + PAD

[0446] Accordingly, there are four differential average values ​​to be calculated by the encoder, namely δAA, δAH, δAV and δAD.

[0447] Using the two-step method defined above, the average of the four differences can be calculated as follows:

[0448]

[0449]

[0450]

[0451]

[0452] On the other hand, the average values ​​of the various predictions can be calculated as follows:

[0453]

[0454]

[0455]

[0456]

[0457] in

[0458]

[0459]

[0460]

[0461]

[0462] An alternative to calculating the predicted average is to first calculate the predicted average for each 2×2 block and then perform a targeted decomposition on it.

[0463] In other words, the first step is calculation:

[0464]

[0465] And then

[0466]

[0467]

[0468]

[0469]

[0470] Accordingly, when using DDS transformation, the encoder can generate the average values ​​of each differential δAA, δAH, δAV and δAD, and send them together with other DDS coefficients HA, HH, HV, HD, VA, VH, VV, VD, DA, DH, DV and DD to the decoder.

[0471] At the decoder, the decoder can calculate PAA, PAH, PAV, and PAD, as explained above. Furthermore, in the current example, it receives the differential averages, decodes them, and then sums them with the predicted averages to obtain the averages AA, AH, AV, and AD. The averages are then combined with other DDS coefficients, the inverse DDS is applied, and the residuals are obtained from the inverse transform.

[0472] Alternatively, since the transformation and inverse transformation are linear operations, the inverse DDS can be performed on the difference averages δAA, δAH, δAV, and δAD, as well as other DDS coefficients HA, HH, HV, HD, VA, VH, VV, VD, DA, DH, DV, and DD to obtain the residuals, and PA ij The value of s can be obtained by adding the residuals in the corresponding 2×2 block after transformation.

[0473] Figure 16C and 16D Show each corresponding Figure 16A and 16B The encoding process 1680 and decoding process 1690 are performed, but the transformation is made one-dimensional, for example, where downsampling and upsampling are performed in one direction instead of two directions. For example, this could be a case of a horizontally scaled-only mode that can be used for interleaved signals. This can be seen through indicated element 1681, where two elements 1682 in block 1683 are downsampled to generate element 1684. The input data element 1681 and the downsampled (“control”) element are then used to generate the differential average (δA) 1685. Accordingly, at decoding process 1690, the two element blocks 1691 of the upsampled element are compared with the downsampled element 1662 to determine the predicted average 1671.

[0474] Signaling within DDS

[0475] In some implementations, bit or byte stream signaling can be used to indicate whether one or more of the coefficients from the DDS transform are used for internal signaling (e.g., relative to the carried transform coefficient values).

[0476] For example, in one case, the signaling bit can be set to a value of 0 to indicate that internal signaling is not used (e.g., predefined coefficient values ​​carry the transformed residual value of the coding unit), and can be set to a value of 1 to indicate that internal signaling is used (e.g., any existing transformed residual value is replaced by a signaling value carrying information to the decoder). In the latter case, when performing an inverse transform on the transformed residual, the value of the coefficient can be ignored; for example, the value of the coefficient can be assumed to be 0, regardless of the value of the signaling used therein.

[0477] In one case, the HH coefficients of the DDS transform can be adapted to carry signaling when the signaling bit is set to 1. This coefficient can be chosen because its value has been determined to have minimal impact on the decoded residual value of the coded block.

[0478] The values ​​carried in the internal coefficient signaling can be used for a variety of purposes. If the decoder is configured to receive and act on the information, the information can be used at the decoder (e.g., at the decoder's discretion).

[0479] In one case, intra-coefficient signaling may indicate information associated with post-processing to be performed on a wider coding unit (e.g., a coding unit associated with a signaling coefficient). In another case, intra-coefficient signaling may indicate information associated with potential artifacts or impairments that may exist when the decoded coding unit is applied in one or more of the level 1 and level 2 enhancement operations. For example, intra-coefficient signaling may indicate that the decoded residual data (and / or a portion of the reconstructed video frame) associated with the coding unit may be subject to striping, blocking effects, etc. One or more post-processing algorithms may then use this information embedded in the coefficient data to selectively apply one or more post-processing operations to correct impairments and improve the reconstructed video.

[0480] Predicted residuals

[0481] As described above, specific instances may utilize methods for predicting the coefficients generated by the transform stage. In one case, a "predicted average" can be used to calculate the predicted average component (A). Calculating the predicted average allows the transmission of a differential average, rather than a perfect average. This saves a significant amount of data (e.g., reducing the required bit rate) because it reduces the entropy of the average component being encoded (e.g., this differential average can often be small or zero).

[0482] For example, when decoding a Level 2 enhanced stream, a cell at Level 1 resolution can be input to an upsampling operation, where it is used to create four cells at either upsampling or Level 2 resolution. As part of the reconstruction, the predicted average value of the upsampled coding units of the four cells can be added to the upsampled value of the four cells.

[0483] In one case, a variation of the above-described predicted average calculation can be applied.

[0484] In this variant, the addition of the predicted average after upsampling can be modified. This addition can be modified by a linear or nonlinear function used to add different proportions of the predicted average to different positions within the upsampled coded block.

[0485] For example, in one scenario, information from one or more adjacent coded blocks can be used to weight the predicted average in different ways for different pixels. In this case, pixels adjacent to lower-value pixels receive a smaller predicted average, and pixels adjacent to higher-value pixels receive a larger predicted average. Therefore, the weighting of the predicted average for a pixel can be set based on the relative values ​​of its neighboring pixels.

[0486] This can provide improvement when edges exist within a coding block. In cases where edges exist within an upsampled coding block, it may be advantageous to weight the predicted average based on the edge location. For example, if the edge is vertical, cells within a column of a coding unit can be combined with higher or lower values ​​compared to other columns of the coding unit, where the accurate weighting depends on the edge's gradient. Edges at different angles can have more complex weightings for the predicted average. This form of correction to the predicted average can be referred to as adding some form of "tilt." It can form part of the calculation of the predicted residual. In these cases, each cell can receive different values ​​for combination, contrasting with a common single predicted average.

[0487] After modification and transformation

[0488] In some instances, the transformation process (e.g., as by...) Figure 3A and 3B The transform component 322 or 341 applied in the transform can be modified to reduce the bit rate required to encode a specific quality level (e.g., LoQ1 or L-1) and / or reduce the quantization step width used when quantizing the transform residual (also referred to as "coefficient") at the same output bit rate.

[0489] In one instance, it can be decided to retain only the average transformed coefficients (e.g., A for the directed decomposition transform (e.g., 2×2), AA, AH, AV, AD for the DDS transform, e.g., 4×4), and only those average transformed coefficients are sent to the quantizer and entropy encoder. In another instance, particularly applicable to the directed decomposition squared (4×4) transform, it can be decided to retain only the average of the average coefficients, i.e., AA. In yet another embodiment, all coefficients are retained. In some cases, different coefficients can be weighted differentially, e.g., each coefficient position within an x-multiplied y coding unit or block can have different weights. Any combination can be used.

[0490] For example, in some cases, the residual processing described above can be applied after the transform stage, in contrast to before the transform stage. In these cases, the input residuals can be used as a substitute or supplement to the weighted results of the transform (referred to as coefficients in this paper). For example, keeping certain coefficients is equivalent to weighting those coefficients with 1 and the others with 0.

[0491] In one instance, the decision about which coefficients to forward for further processing can be made before transforming the residuals. In other words, instead of performing a transformation and then discarding coefficients not selected for quantization and transmission, only the coefficients to be quantized, entropy-coded, and transmitted are computed, thus saving additional computation. For example, instead of weighting the output of the transformation, specific transformation operations can be performed selectively, such as performing only the average transformation (A or Ax). This could correspond to multiplying only a subset of the rows of the transformation matrix, such as multiplying only the residuals by the vector representing the first row of the transformation matrix to determine the average (A) coefficients (e.g., for a 2×2 case with a 4×4 transformation matrix).

[0492] Each of the above choices can be associated with a corresponding transformation mode.

[0493] The selection is typically based on a decision associated with the bit rate to be used for the corresponding enhancement level (e.g., level 1 or level 2), and / or the corresponding quantization step width to be used for a particular enhancement level, but it can also use residual pattern classification as input as discussed above. In one case, the bit rate to be used for the corresponding enhancement level can be determined based on data received via the network, as referenced... Figures 14A to 14C As described.

[0494] Rate control and quantization

[0495] In some implementations, the quantization operation can be controlled to control the bit rate of one or more of the encoded streams. For example, Figure 3A and 3B The quantization parameters of quantization components 323 and / or 343 can be set to provide the desired bit rate in one or more of the encoded video streams (whether for a common bit rate for all streams to generate a common encoded stream, or for different bit rates for different encoded streams).

[0496] In some cases, quantization parameters can be set based on analysis of one or more of the base coding and enhanced stream coding. Quantization parameters can be selected to provide the desired quality level within a set of predefined bit rate constraints, or to maximize the quality level. Several mechanisms can be used to control variations in the raw video.

[0497] Figure 17A This diagram illustrates an example encoder 1700. Encoder 1700 can be... Figure 1 , 3A And one of the encoders shown in 3B, where some components are omitted for clarity. Encoder 1700 has two enhanced level coding components, 1700-1 and 1700-2. These correspond to Figure 1 Components 122 and 142 in the text. Except... Figure 1 , 3A In addition to the 3B examples, Figure 17A The encoder 1700 also includes a rate controller 1710. The rate controller 1710 controls the encoding rate of one or more of the enhancement-level coding components 1700-1 and 1700-2. The rate controller 1710 may further receive information from a base codec 1730, which may correspond to a base encoder 112 and a base decoder 114. The instance encoder 1700 also includes a buffer 1740. Unlike a time buffer, this is a buffer that receives the encoded stream, for example, before transmission and / or storage. The rate controller 1710 may include software routines (e.g., in a fast low-level language such as C or C++) executed by a processor and / or dedicated electronic circuitry system. The buffer 1740 may include software-defined buffers (e.g., reserved segments of memory resources) and / or dedicated hardware buffers. Figure 17A The rate controller 1710 receives data from the base processing layer (e.g., at least the base encoder of the base codec 1730) and the buffer 1740. The buffer 1740 is used to store and / or combine at least the encoded base stream (BS) and the encoded enhancement stream (L1S and / or L2S).

[0498] Figure 17A Demonstrates the use of buffer 1740 relative to the encoded base stream and the encoded L-1 stream; Figure 17B Another example is shown, where buffer 1740 receives both the encoded base stream and the encoded level 1 and level 2 enhanced streams. Figure 17A In this example, the rate controller 1710 controls the quantization 1720 within the level 1 coding layer by supplying a set of quantization parameters. Figure 17B In one example, the rate controller 1710 controls quantization 1720 within two enhanced coding layers by supplying quantization parameters to the corresponding quantization components (e.g., 1720-1 and 1720-2, which may correspond to quantization components 323 and 343). In another case, the buffer 1740 may be configured to receive both the coded base stream and the coded level 2 stream.

[0499] exist Figure 17A and 17BIn this example, buffer 1740 is configured to receive input at a variable bit rate while reading output at a constant rate. In the figures, the output is shown as a hybrid video stream (HVS). Rate controller 1710 reads the state from buffer 1740 to ensure it does not overflow or become empty, and that data is always available for reading at its output.

[0500] Figure 18 and 19 Two possible implementations of a rate controller (e.g., rate controller 1710) are shown. These implementations use the state of the buffer to generate the set of quantization parameters Q for the current frame t. t Quantization parameters can be supplied to, for example... Figure 3A , 3B The quantization components 323, 343 in one or more of the Level 1 and Level 2 encoding pipelines shown in the diagram. In one case, Figure 18 or Figure 19 The architecture allows for replication of each of the level 1 and level 2 encoding pipelines, enabling the generation of different quantization parameters for each pipeline.

[0501] Figure 18 The first example rate controller 1800 is shown, which includes a Q (i.e., quantization) estimation component 1820 that receives a signal 1840 from a buffer and calculates a set of quantization parameters, i.e., Q, at a given time t. t . Figure 19 The second example rate controller 1900 is shown, which also includes a Q (i.e., quantization) estimation component 1920, which receives a signal 1940 from a buffer and calculates a set of quantization parameters, i.e., Q′, at a given time t. t The second instance rate controller 1900 also includes a target size estimation component 1910, a Q buffer 1930 for storing a set of quantization parameters for the next frame, an encoding component 1940, and a Q-terminal component 1950. The target size estimation component 1910 receives data 1942 from the base layer, and the encoding component 1940 receives input 1944.

[0502] The general operation of rate controllers 1800 and 1900 is as follows: The quantization parameter Q is controlled based on the amount of data in the buffer (e.g., buffer 1740). t .exist Figure 18 and 19 Both receive an indication of the amount of data in the buffer (i.e., how "full" the buffer is) via "from buffer" signals 1840 and 1940. This is then used directly or indirectly by Q estimation components 1820 and 1920 to estimate the set of quantization parameters used as operating parameters for quantization components (e.g., 323 and / or 343).

[0503] In one scenario, the quantization parameter value is inversely related to the amount of data in the buffer. For example, if there is a large amount of data in the buffer when a new frame is received, the rate controller 1800 sets a low value for Q to reduce the amount of residual data being encoded, where a low value for Q corresponds to a larger quantization step size that produces fewer quantization block groups or clusters within a given residual value range. Alternatively, if the buffer is relatively empty, the rate controller 1800 is configured to set a high value for Q (i.e., a low step size) to encode more residual data into the mixed video stream.

[0504] Figure 19 Instances of this use additional components to determine the set of quantization parameters. Figure 19 In this example, the rate controller 1900 also uses the amount of “filler” data that the base encoder wants to add to its stream (e.g., received via the “from base” signal 1942). In this case, the encoder can replace the base encoder “filler” data with additional boost stream data to maximize available bandwidth. In this case, if advanced filler is present, the rate controller 1900 may be able to set a higher Q value (e.g., a lower step size value, so that much residual data is received within the buffer) because this “filler” data can be removed or replaced in the base encoder stream (e.g., before or at the buffer).

[0505] exist Figure 19 In this context, the target size estimation component 1910 receives the status of the buffer and information about the amount of "fill" data that the base encoder is planning to add to the frame. The amount of data held in the buffer can be indicated by a "fill" parameter that can be normalized to a range of 0 to 1 or 0% to 100%, where 60% indicates that the buffer is 60% full (i.e., with 40% remaining space). In this case, a mapping function or lookup table can be defined to map from the "fill" block group to the "target size" parameter, where the target size is the target size of the next frame to be encoded by one or more of the layer 1 and layer 2 enhancement layers. In one case, the mapping function or lookup table can implement a non-linear mapping that can be set experimentally. In another case, the target size estimation can also be based on a configuration parameter that indicates the desired proportion of the mixed video stream to be filled by the enhancement stream (e.g., the remainder of the mixed video stream is filled by the base stream).

[0506] exist Figure 19 In this instance, the target size determined by the target size estimation component 1910 is transmitted to the Q estimation component 1920. Figure 19 In this process, the Q estimation component 1920 additionally receives input from the Q buffer 1930, which stores Q values ​​from at least one implementation of the previous frame and the enhancement coding pipeline. Figure 19 In the middle, the Q estimation component 1920 receives the "target size" Q. t-1(i.e., the set of quantization parameters determined for the previous frame), and Q t-1 The size of the current frame being encoded (“current size”). The size of the current frame is supplied by an implementation of at least one of the enhancement coding pipelines (e.g., level 1 and level 2 components). In some cases, an implementation of at least one of the enhancement coding pipelines may also supply Q... t-1 The size of one or more previous frames encoded. In one case, the "current size" information can be determined by a parallel copy of at least one of the enhancement coding pipelines, for example, using the quantization parameter Q. t Quantize the current frame for transmission, but Figure 19 The Lx encoding component 1940 in the middle receives Q t-1 The current size is determined by performing non-transmittable encoding based on these quantization parameters. Alternatively, the current size could be received from a cloud configuration interface, for example, based on preprocessing of pre-recorded video. In this alternative instance, a parallel implementation may not be required.

[0507] exist Figure 19 In this process, the Q estimation component 1920 takes its input (e.g., as described above) and computes the initially estimated set of quantization parameters Q′. t In one scenario, this can be performed using a set of size functions that map the data size (e.g., as expressed by a target or current size) to quantization parameters. The data size and / or quantization parameters can, for example, be normalized to values ​​between 0 and 1. The quantization parameters can be associated with the quantization step size, for example, it can be a "quality factor" inversely proportional to the quantization step size and / or it can be the quantization step size itself.

[0508] exist Figure 19 In an instance, a set of curves can be defined to map the normalized size to the quantization parameters. Each curve can have one or more multipliers and offsets, which can depend on the nature of the current frame (e.g., the complexity of the information to be encoded within the frame). The multipliers and offsets define the shape of the curves. The multipliers can be applied to the size normalization function as a function of the quantization parameters Q. In one case, the current size (i.e., in Q...) t-1 The size of the encoded frame t) and Q t-1 A point can be used to define a space containing a set of curves. This point can be used to select a set of curves that are closest to the given curve. These can be curves above and below the point, or the highest or lowest curve at the point. The set of closest curves, along with the point, can be used in an interpolation function to determine a new curve associated with the point. Once this new curve is determined, the multiplier and offset used for the new curve can be determined. These values, along with the received target size, can then be used to determine Q. t The value of (e.g., a function whose curve can limit the size of the curve and Q).

[0509] In some cases, at least the Q-estimate of the rate controller is adaptive, where the properties of one or more previous frames influence the Q-estimate of the current frame. In one case, the set of curves can be stored in accessible memory and updated based on the set of curves determined for previous frames. In some cases, adaptive quantization can be applied differently for different coefficient locations within a coding unit or block, such as for different elements in an array of 4 or 16 coefficients (for 2×2 or 4×4 transforms).

[0510] at last, Figure 19 An instance characterizes the Q-terminal component 1950, which receives the estimated set of quantized parameters Q′ output from the Q-estimation component 1920 and corrects this set based on one or more factors. The estimated set of quantized parameters Q′ t It may include one or more values. In one case, one or more of the initial quantization parameter set Q′ may be corrected based on the operational characteristics of the underlying coding layer and changes in the quantization parameter Q. t In one case, the estimated set of quantization parameters Q′ can be based on the set of quantization parameters that are used by the underlying coding layer and can be received with data from this layer. t End capping. In one case, with or without adaptation using the underlying coding layer data, the estimated set of quantization parameters Q′ can be constrained based on the values ​​of the previously defined set of quantization parameters. t In this case, Q′ t One or more of the minimum and maximum values ​​can be based on a previous Q value (e.g., Q). t-1 )Setting. Next, the output of the end cap is... Figure 19 China provides Q t .

[0511] In one case, the set of quantization parameters includes Q. t A value. In this case, the step width applied to frame t by one of the quantization components can be based on Q. t The function used to determine the step width can also be based on a maximum step width (e.g., the step width can be in the range of 0 and 10). Example step width calculation:

[0512] Step width = [(1-Q 0.2 )·(Step width max -1)]+1

[0513] Quantitative features

[0514] Reference Figures 20A to 20D Describe certain quantization variations.

[0515] Figure 20A Examples 2000 are provided on how to perform quantization of residuals and / or coefficients (transformed residuals) based on blocks with defined step widths. Figure 20A The x-axis 2001 represents the residual or coefficient value. In this example, several block groups 2002 are defined by a step width of 5 (e.g., shown by 2003). The size of the step width 2004 can be selected, for example, based on a parameter value. In some cases, the size of the step width 2004 can be set dynamically, for example, based on the rate control example described above. Figure 20A In this example, step width 2004 generates block groups 2002 corresponding to residual values ​​within the ranges 0-4, 5-9, 10-14, and 15-19 (i.e., 0 to 4, inclusive of both 0 and 4). The block group width can be configured to include or exclude endpoints as needed. In this instance, quantization is performed by replacing all values ​​falling within the block group with integer values ​​(e.g., residual values ​​between 0 and 4 (inclusive of endpoints) have a quantization value of 1). Figure 20A In this case, quantization is performed by dividing by the step width 2004 (e.g., 5), taking the base of the result (i.e., for positive values, the nearest integer less than decimal), and then adding one (e.g., 3 / 5 = 0.6, floor(0.6) = 0, 0+1 = 1; or 16 / 5 = 3.2, floor(3.2) = 3, 3+1 = 4). Negative values ​​can be handled in a similar way, for example by acting on the absolute value and then converting it to a negative value after calculation (e.g., abs(-9) = 9, 9 / 5 = 1.8, floor(1.8) = 1, 1+1 = 2, 2*-1 = -2). Figure 20A This demonstrates a case of linear quantization where all block groups share a common step width. It should be noted that various different implementations based on this method are possible; for example, the first block group may have a quantization value of 0 instead of 1, or may include values ​​from 1 to 5 (inclusive of the endpoints). Figure 20A This is merely an illustration of quantization based on a block group with a given step width.

[0516] Dead Zone

[0517] Figure 20B Examples demonstrating how so-called "dead zones" (DZ) can be implemented (2010). Figure 20B In this context, residuals or coefficients with values ​​within the predefined range of 2012 are set to 0. Figure 20B In this context, the predefined range is the range around the value 0, as shown by the range limits in 2011 and 2013. Figure 20B In this context, values ​​less than 6 and greater than -6 are set to 0, as shown in 2014. The dead zone can be set to a fixed range (e.g., -6 to 6) or based on the step size. In one case, the dead zone can be set to multiple predefined step sizes, for example, set as a linear function of the step size values. Figure 20B In this example, the dead zone is set to 2.4 * step width. Therefore, with a step width of 5, the dead zone extends from -6 to +6. In other cases, the dead zone can be set as a non-linear function of the step width value.

[0518] In one case, the dead time is set based on a dynamic step width, which can be adaptive, for example. In this case, the dead time can change with the step width. For example, if the step width is updated to 3 instead of 5, the dead time of 2.4 * step width can change from the range -6 to +6 to the range -3.6 to 3.6; or if the step width is updated to 10, the dead time can change to extend from -12 to 12. In one case, the step width multiplier can be between 2 and 4. In another case, the multiplier can also be adaptive, for example based on operating conditions such as the available bit rate.

[0519] Having a dead time helps reduce the amount of data to be transmitted over the network, for example, by helping to reduce the bit rate. When using a dead time, residual or coefficient values ​​that fall within the dead time are effectively ignored. This method also helps remove low-level residual noise. Having an adaptive rather than constant dead time means that smaller residual or coefficient values ​​are not over-filtered when the step size decreases (e.g., if more bandwidth is available), and the bit rate decreases appropriately if the step size increases. The dead time only needs to be implemented at the encoder; the decoder simply receives a quantized value of 0 for any residual or coefficient that falls within the dead time.

[0520] Block group folding

[0521] Figure 20C This demonstrates an example of how a method called block group folding can be applied. (In 2020) Figure 20C In some instances, block group folding is used in conjunction with dead zones, but in others, block group folding can be used without dead zones and / or in conjunction with other quantization methods. Figure 20C In this context, block group collapse is used to place all residual or coefficient values ​​residing above the selected quantization block group 2021 into the selected block group. For example, this can be viewed as a form of limiting. It is shown for positive values ​​via limit 2021 and arrow 2022, and for negative values ​​via limit 2023 and arrow 2024.

[0522] Figure 20CIn this case, a step width of 5 is applied again. A dead zone 2012 with a range of 2.4 * step width is also applied, such that values ​​between -6 and 6 are set to 0. This can also be seen as following into the larger first quantization block group (with a value of 0). Then, two quantization block groups 2002 with a width of 5 (as shown by 2003) are defined for positive and negative values. For example, the block group with quantization value 1 is defined between 6 and 11 (e.g., with a step width of 5), and the block group with quantization value 2 is defined between 11 and 16. In this example, to perform block group folding, all residuals or coefficients in the block group that would normally fall above the second block group (e.g., with a value greater than 16) are "folded" 2022 into the second block group, for example, limited to have a quantization value of 2. This can be done by setting all values ​​greater than a threshold to the maximum block group value (e.g., 2). A similar process occurs for negative values. Figure 20C The numbers 2022 and 2024 are indicated by large arrows.

[0523] Block group folding can be an optional processing option at the encoder. It does not need to be performed during dequantization at the decoder (e.g., a "folded" or "limited" value of 2 is simply dequantized as if it were in a second block group). Block group folding can be performed to reduce the number of bits sent to the decoder via the network. Block group folding can be configured based on network conditions and / or underlying streaming processing to reduce the bit rate.

[0524] Quantization offset

[0525] Figure 20D Example 2030 demonstrates how quantization offsets can be used in specific situations. Quantization offsets can be used to shift the position of a group of quantized blocks. Figure 20D Line 2031 is shown, indicating possible real-world counts of the range of residual or coefficient values ​​along the x-axis. In this example, many values ​​are close to zero, with the count of higher values ​​decreasing as they move away from 0. If the count values ​​are normalized, the line can also indicate the probability distribution of the residual or coefficient values.

[0526] Figure 20D The left-hand side bar 2032 and the right-hand side dashed line 2033 show the histogram for quantization modeling. For illustration, the count values ​​of the first to third blocks after the dead zone are shown (for both positive and negative values, the latter are striped to show bars). For example, bar 2035 shows the counts of quantized values ​​1, 2, 3 and -1, -2, -3. Due to quantization, the distribution modeled by the histogram differs from the actual distribution shown by the lines. For example, error 2037 is shown, which shows the degree to which the bars differ from the lines.

[0527] To modify the nature of error 2037, a quantization offset 2036 can be applied. For positive values, a positive quantization offset shifts each block group to the right, and a negative quantization offset shifts each block group to the left. In one case, a dead zone can be applied based on a first threshold set, where all values ​​less than (n*step_width) / 2 and greater than (n*step_width*-1) / 2 are set to 0, and block group folding can be applied based on a second threshold set (e.g., from the previous example), where all values ​​greater than 16 or less than -16 are set to 2. In this case, the quantization offset cannot shift the beginning of the first block group or the end of the last block group because these are based on the aforementioned higher and lower thresholds, but it can shift the position 2034 of the block groups between these thresholds. An example quantization offset can be 0.35.

[0528] In one case, the quantization offset 2036 can be configurable. In another case, the quantization offset can vary dynamically, for example, based on conditions during encoding. In this case, the quantization offset can be signaled to the decoder for use in dequantization.

[0529] In one case, at the encoder, the quantization offset can be subtracted from the residual or coefficient value before step-width-based quantization. Therefore, in the decoder, the transmitted offset can be added to the received quantized value before step-width-based dequantization. In some cases, the offset can be adjusted based on the sign of the residual or coefficient to allow symmetric operation about a zero value. In one case, the use of the offset can be disabled by setting the quantization or dequantization offset value to 0. In one case, the applied quantization offset can be adjusted based on a defined dead-zone width. In one case, the dead-zone width can be calculated at the decoder, for example, based on the step width and quantization parameters received from the encoder.

[0530] Quantization matrix

[0531] In one scenario, the step width used for quantization can vary for different coefficients within a 2×2 or 4×4 coefficient block. For example, a smaller step width can be assigned to coefficients that, experimentally determined, have a greater impact on the perception of the decoded signal. For instance, in a 4x4 directional decomposition (DD-squared or "DDS") as described above, smaller step widths can be assigned to the AA, AH, AV, and AD coefficients, with larger step widths assigned to later coefficients. In this case, the base_stepwidth parameter can be used to define the default step width, and a modifier can then be applied to this parameter to calculate modified_stepwidth for use in quantization (and dequantization), for example, modified_stepwidth = base_stepwidth * modifier, where the modifier can be set based on specific coefficients within the block or cell.

[0532] In some cases, the modifier may additionally or alternatively depend on the enhancement level. For example, for a level 1 enhancement stream, the step size can be smaller because it can affect multiple reconstructed pixels at a higher quality level.

[0533] In some cases, modifiers can be defined based on both the coefficients within a block and the enhancement level. In one case, a quantization matrix can be defined with a set of modifiers for different coefficients and different enhancement levels. This quantization matrix can be preset (e.g., at the encoder and / or decoder), signaled between the encoder and decoder, and / or dynamically constructed at the encoder and / or decoder. For example, in the latter case, the quantization matrix can be constructed at the encoder and / or decoder based on other stored and / or signaled parameters (e.g., parameters received via a configuration interface as previously described).

[0534] In one scenario, different quantization modes can be defined. In one mode, a common quantization matrix can be used for two enhancement levels; in another mode, separate matrices can be used for different levels; and in yet another mode, a quantization matrix can be used for only one enhancement level, such as only for level 2. The quantization matrix can be indexed by the position of the coefficients within the block (e.g., 0 or 1 along the x-direction and 0 or 1 along the y-direction for a 2×2 block, or 0 to 3 for a 4×4 block).

[0535] In one case, the base quantization matrix can be defined by a set of values. This base quantization matrix can be modified by a scaling factor, which is a function of the stride width of one or more enhancement layers. In another case, the scaling factor can be a clamping function of the stride width variable. At the decoder, the stride width variable can be received from the encoder for one or more of the layer 1 and layer 2 streams. In yet another case, each entry in the quantization matrix can be scaled using an exponential function of the scaling factor, for example, each entry can be raised to a power of the scaling factor.

[0536] In one scenario, different quantization matrices can be used for each of the Level 1 and Level 2 streams (e.g., different quantization matrices are used when encoding and decoding coefficients (after transform residuals) for these levels). In another scenario, a particular quantization configuration can be set to a predefined default value, and any variations arising from this default value can be signaled between the encoder and decoder. For example, if a different quantization matrix is ​​used by default, signaling between the encoder and decoder is not required. However, if a common quantization matrix is ​​used, this can be signaled to override the default configuration. Having a default configuration reduces the required signaling level (because the default configuration may not require signaling).

[0537] flat

[0538] For example, see the above text. Figure 12CAs described, in a particular configuration, a video data frame can be divided into two-dimensional portions called "tiles". For example, a 640x480 video data frame may contain 1200 tiles of 16 pixels by 16 pixels (e.g., 40 tiles by 30 tiles). Therefore, a tile can comprise a non-overlapping, contiguous region within the frame, where each region has a defined size in each of the two dimensions. A common convention is that tiles extend continuously in rows across the frame; for example, one row of tiles may extend across the horizontal range of the frame before starting the next row (the so-called "grid" format, but other conventions such as interlaced formats may also be used). A tile can be defined as a specific set of coding units; for example, a 16x16 pixel block may comprise an 8x8 set of 2x2 coding units or a 4x4 set of 4x4 coding units.

[0539] In some cases, the decoder may selectively decode portions of one or more of the base stream, Level 1 enhancement stream, and Level 2 enhancement stream. For example, it may be necessary to decode only the data relevant to the region of interest in the reconstructed video frame. In this case, the decoder may receive the complete dataset of one or more of the base stream, Level 1 enhancement stream, and Level 2 enhancement stream, but may only decode the data within the stream that is available for rendering the region of interest in the reconstructed video frame. This can be considered a form of partial decoding.

[0540] Partial decoding in this manner offers several advantages in different aspects.

[0541] When implementing virtual or augmented reality applications, only a portion of the wide field of view can be examined at any given time. In this case, only a small area of ​​interest related to the examined area can be reconstructed at a high quality level, while the remaining area of ​​the field of view is rendered at a low (i.e., lower) quality level. Further details of this method can be found in patent publication WO2018 / 015764 A1, which is incorporated herein by reference. Similar methods can be useful when transmitting video data related to computer games.

[0542] Partial decoding can also offer advantages for resource-constrained mobile and / or embedded devices. For example, the base stream can be quickly decoded and presented to the user. The user can then select a portion of this base stream for more detailed rendering. After selecting the region of interest, data related to the region of interest from one or both of the Layer 1 and Layer 2 enhancement streams can be decoded and used to render a specific, limited area with high detail. A similar approach can also be advantageous for object identification, where an object can be located in the base stream, and this location can form a region of interest. Data related to the region of interest from one or both of the Layer 1 and Layer 2 enhancement streams can then be decoded for further processing of the object-related video data.

[0543] In this example, partial decoding can be based on mosaicking. For instance, the region of interest can be defined as a set of one or more mosaics within a frame of the reconstructed video stream, such as a reconstructed video stream at high quality or full resolution. The mosaics in the reconstructed video stream can correspond to equivalent mosaics in frames of the input video stream. Therefore, a set of mosaics covering an area smaller than a full video frame can be decoded.

[0544] In the specific configuration described herein, the encoded data forming portions of at least the Level 1 enhancement stream and the Level 2 enhancement stream can be generated by run-length encoding followed by Huffman coding. In this encoded data stream, it may be impossible to identify the data (e.g., until at least the quantized transform coefficients organized into the coding unit) without first decoding the data associated with the specific portion of the reconstructed video frame.

[0545] In the above configuration, a specific variant of the instance described herein may include a set of signaling within the encoded data of one or more of the Level 1 Enhanced Stream and Level 2 Enhanced Stream, such that the encoded data associated with a particular piece can be identified prior to decoding. This thus allows for the partial decoding discussed above.

[0546] For example, in some instances, Figures 10A to 10I One or more of the encoding schemes shown in the examples may be adapted to include header data that identifies specific pieces within a frame. The identifier may include a 16-bit integer that identifies the number of specific pieces within a rule-based tile grid (e.g., ...). Figure 12C (As shown in the diagram). For example, at the start of transmission of encoded data associated with a specific piece of an input video frame, a piece identifier can be added to the header field of the encoded data. At the decoder, all data following the identifier is considered to relate to the identified piece until a new header field or a frame transition header field is detected within the encoded stream. In this case, the encoder transmits piece identification information within one or more of the Level 1 and Level 2 enhancement streams, and this information can be received within the stream and extracted without decoding the stream. Therefore, where the decoder will decode one or more pieces associated with a defined region of interest, the decoder can decode only the portions of one or more enhancement streams relating to those pieces.

[0547] Using a mosaic identifier within an encoded enhancement stream allows for variable-length data output, for example, by a combination of Huffman and run-length encoding, while still enabling the determination of data relating to specific regions of the reconstructed video frame before decoding. Therefore, the mosaic identifier can be used to identify different portions of the received bitstream.

[0548] In this instance, the augmentation data associated with a mosaic (e.g., in the form of transformed coefficients and / or decoded residual data) can be independent of the augmentation data associated with other mosaics within the augmentation stream. For example, residual data can be obtained for a given mosaic without needing data associated with other mosaics. In this way, the current instance can differ from contrasting scalable video coding schemes associated with, for example, HEVC and AVC standards (e.g., SVC and SHVC), which require additional intra- or inter-picture data to decode data associated with specific regions or macroblocks of the reconstructed image. This enables the efficient implementation of the current instance using parallel processing—the reconstruction of different mosaics and / or coding units of the reconstructed frame can be performed in parallel. This can significantly accelerate decoding and reconstruction on modern computing hardware where multiple CPU or GPU cores are available.

[0549] Pieces within a byte stream

[0550] Figure 21A Another example 2100 shows the bit or byte stream structure of an enhanced stream. Figure 21A It can be regarded as similar to Figure 9A Another example. Figure 21A The top of the image shows components 2112 to 2118 of an example byte stream 2110 for a single video data frame. The video stream will therefore include multiple such structures for each video frame. The byte stream of a single frame includes a header 2112 and data associated with each of three planes. In this example, these planes are the color components of the frame, namely the Y, U, and V components 2114, 2216, and 2218.

[0551] exist Figure 21A The second level shows the general structure of the byte stream for a given color plane 2115. In this case, a sub-section of the Y plane is shown. Other planes may have similar structures. Figure 21A In this context, each plane includes data 2120 associated with each of the two enhancement levels: a first quality level (level or LoQ1) 2122 and a second quality level (level or LoQ2) 2124. As discussed above, these may include data for both the level 1 enhancement stream and the level 2 enhancement stream.

[0552] exist Figure 21A In the third level, each enhancement level 2125 further includes data 2130, which includes byte stream portions 2132 associated with multiple levels. Figure 21AThe diagram shows N layers. Each layer can involve different "planes" of encoded coefficients, such as residual data after transform, quantization, and entropy coding. If a 2×2 coding unit is used, four such layers can exist (e.g., each direction of the Directional Decomposition-DD). If a 4×4 coding unit is used, sixteen such layers can exist (e.g., each direction of the Directional Decomposition Squared-DDS). In one case, each layer can be decoded independently of the others; thus, each layer can form an Independently Decoded Unit (IDU). If a time-based pattern is used, one or more layers related to time information can also exist.

[0553] When using a tiling configuration, for example, for partial decoding, the data 2140 of each layer can be further decomposed 2135 into portions 2142 associated with multiple tiles. These tiles can correspond to rectangular regions of the original input video. The tile size can be fixed for each group of pictures (GOP). The tiles can be ordered in raster order. Figure 6B and 12C Examples of piecework structures are shown.

[0554] Figure 21A The example demonstrates how each layer further includes a portion 2142 of the byte stream associated with M pieces. Each piece thus forms an IDU and can be decoded independently of the other pieces. This independence enables optional or partial decoding. Figure 21B An alternative example 2150 is shown, in which each quality level 2120 or byte stream 2110 is first decomposed into a portion 2140 associated with M pieces, whereby each piece portion is then decomposed into a portion associated with each layer 2130. Either method can be used.

[0555] In an example, each IDU may include header information such as an isAlive field (e.g., indicating whether or not non-zero data is used), StreamLength (indicating the data size of the stream portion), and a payload of encoded data carrying the IDU. Using an indication of whether a particular piece contains data (e.g., isAlive = 1) can help reduce the amount of data to be transmitted, because often a particular piece may be 0 due to the use of residual data, and therefore the additional piece data to be transmitted can be minimized.

[0556] When tiling is used, for example, the header of a group of pictures (GOP) can be modified to include a tiling mode flag. In this case, a first flag value (e.g., 0) could indicate a "blank" mode that does not support partial decoding, and a second flag value (e.g., 1) could indicate a "tiled" mode that supports partial decoding. The second flag value could indicate that a specific fixed-size tiling mode is being used, thereby dividing the plane (e.g., one of the YUV planes) into sizes T. W x T HThe tiles are fixed-size rectangular areas (tiles) and indexed in grid order. In other cases, different flag values ​​can indicate different tiling modes; for example, a mode can indicate a custom tile size transmitted along with header information.

[0557] In one case, the tile size can be communicated in the header information. The tile size can also be communicated explicitly (e.g., by sending the tile width T in pixels). W And the height of the mosaic in the pixel T H In one case, the piece size can be signaled by sending an index to a lookup table stored at the decoder. Therefore, a single byte indicating one of up to 255 piece sizes can be used to signal the piece size. An index value can also indicate a custom size (e.g., to be signaled separately in the header). If the piece size is explicitly signaled in the header information, it can be transmitted using 4 bytes (two bytes per width / height).

[0558] If the message is in tiled mode, one or more tile-specific configurations can exist in the message header information. In one case, a message aggregation mode can be messaged (e.g., using a 1-bit flag). A value of 1 can indicate that tile data segments within a byte stream, such as the isAlive / StreamLength / Payload sections described above, will be grouped or aggregated (e.g., the data stream first contains the isAlive header information for the tile set, followed by the StreamLength information for the tile set, and then the payload information for the tile set). Organizing the byte stream in this way facilitates selective decoding of tiles, for example, because the stream length information for each tile can be received before the payload data. In this case, the aggregated data can also optionally be compressed using run length and Huffman coding (e.g., as described herein), and this can also be marked (e.g., using a 1-bit field). Different parts of the aggregated data stream can have different compression settings. If information such as the stream length field is Huffman encoded, these can be encoded as absolute or relative values ​​(e.g., as a difference relative to the previous stream value). Relative value encoding can further reduce the byte stream size.

[0559] In these examples, a method for encoding an enhanced stream is described, whereby the enhanced bitstream can be split into portions or chunks representing different spatial parts (i.e., mosaics) of video frames. Data associated with each mosaic can be received and decoded independently, thereby allowing parallel processing and selective or partial decoding.

[0560] sampling on neural networks

[0561] In some instances, upsampling can be enhanced by using artificial neural networks. For example, a convolutional neural network can be used as part of the upsampling operation to predict upsampled pixel or signal feature values. The use of artificial neural networks to enhance upsampling operations is described in WO 2019 / 111011 A1, which is incorporated herein by reference. A neural network upsampler can be used to implement any of the upsampling components described in the examples herein.

[0562] Figure 22A A first instance 2200 of a neural network upsampler 2210 is shown. The neural network upsampler can be used to transform signal data at a first level (n-1) and signal data at a second level n. In the context of the current instance, the neural network upsampler can transform data processed at enhancement level 1 (i.e., quality level - LoQ-1) and data processed at enhancement level 2 (i.e., quality level - LoQ-2). In one case, the first level (n-1) may have a first resolution (e.g., size_1 multiplied by size_2 features), and the second level n may have a second resolution (e.g., size_3 multiplied by size_4 features). The number of features in each dimension at the second resolution may be a multiple of the number of features in each dimension at the first resolution (e.g., size_3 = F1 * size_1 and size_4 = F2 * size_2). In the described instance, the multiple may be the same in both dimensions (e.g., F1 = F2 = F, and in some instances, F = 2).

[0563] In some instances, the use of artificial neural networks may involve converting feature data (e.g., cell values, such as color plane values) from one data format to another. For example, feature data (e.g., as input to an upsampler in non-neural cases) may be in the form of 8- or 16-bit integers, while the neural network may operate based on floating-point data values ​​(e.g., 32- or 64-bit floating-point values). The feature data can therefore be converted from integer to floating-point format before upsampling, and / or from floating-point to integer format after neural upsampling. Figure 22B As shown in the image.

[0564] exist Figure 22B In the middle, to the sampler 2210 on the neural network (e.g., from Figure 22A The input to the upsampler is first processed by a first conversion component 2222. The first conversion component 2222 converts the input data from integer format to floating-point format. The floating-point data is then input to a neural network upsampler 2210, which freely performs floating-point operations. The output from the neural network upsampler 2210 includes data in floating-point format. Figure 22BThis is then processed by a second conversion component 2224, which converts the data from floating-point format to integer format. The integer format can be the same as the original input data or a different integer format (e.g., the input data can be provided as an 8-bit integer, but the output can be provided as a 10, 12, or 16-bit integer). The output of the second conversion component 2224 can place the output data in a format suitable for operation at the upper enhancement level (e.g., level 2 enhancement as described herein).

[0565] In some instances, as an alternative to or supplement to data format conversion, the first and / or second conversion components 2222 and 2224 may also provide data scaling. Data scaling can place the input data in a form better suited to applications with artificial neural network architectures. For example, data scaling may include a normalization operation. Instances of normalization operations are described below:

[0566] norm_value=(input_value-min_int_value) / (max_int_value-min_int_value)

[0567] Where `input_value` is the input value, `min_int_value` is the minimum integer value, and `max_int_value` is the maximum integer value. Additional scaling can be applied by multiplying by the scaling divisor (i.e., dividing by the scaling factor) and / or subtracting the scaling offset. The first transformation component 2222 provides positive data scaling, and the second transformation component 2224 applies the corresponding inverse operation (e.g., inverse normalization). The second transformation component 2224 can also round the values ​​to generate an integer representation.

[0568] Figure 22C An example architecture 2230 for a simple neural network upsampler 2210 is shown. The neural network upsampler 2210 includes two layers 2232 and 2236 separated by a non-linearity 2234. Optional post-processing operations 2238 are also present. By simplifying the neural network architecture, upsampling is enhanced while still allowing real-time video decoding.

[0569] Convolutional layers 2232 and 2236 may include two-dimensional convolutions. A convolutional layer may apply one or more filter kernels of predefined size. In one case, the filter kernel may be 3×3 or 4×4. The convolutional layer may apply a filter kernel defined by a set of weight values, and may also apply a bias. The bias has the same dimension as the output of the convolutional layer. Figure 22CIn the example, the two convolutional layers 2232 and 2236 may share a common structure or function but have different parameters (e.g., different filter kernel weight values ​​and different bias values). Each convolutional layer may operate in different dimensions. The parameters of each convolutional layer may be constrained to a four-dimensional tensor of size -(kernel_size1, kernel_size2, input_size, output_size). The input of each convolutional layer may include a three-dimensional tensor of size -(input_size_1, input_size_2, input_size). The output of each convolutional layer may include a three-dimensional tensor of size -(input_size_1, input_size_2, output_size). The first convolutional layer 2232 may have an input_size of 1, i.e., such that it receives a two-dimensional input similar to that of a non-neural sampler as described herein. Instance values ​​for these sizes are as follows: kernel_size1 and kernel_size2 = 3; for the first convolutional layer 2232, input_size = 1 and output_size = 16; and for the second convolutional layer 2236, input_size = 16 and output_size = 4. Other values ​​may be used depending on the implementation and empirical performance. With an output size of 4 (i.e., four channels output for each input feature), this can be reconstructed as a 2×2 block representing the upsampled output of a given cell.

[0570] The input to the first convolutional layer 2232 can be a two-dimensional array, similar to other upsampler implementations described herein. For example, the neural network upsampler 2210 can receive partial and / or complete reconstructed frames (e.g., the base layer plus the layer 1 enhanced decoded output). The output of the neural network upsampler 2210 can include partial and / or complete reconstructed frames at higher resolution, as in other upsampler implementations described herein. The neural network upsampler 2210 can therefore be used as a modular component, as in other available upsampling methods described herein. In one case, for example, the selection of the neural network upsampler at the decoder can be signaled within the transmitted byte stream, for example, within the global header information.

[0571] The nonlinear layer 2234 may include any known nonlinearity, such as a bibliometric function, a hyperbolic tangent function, a modified linear unit (ReLU), or an exponential linear unit (ELU). Variations of common functions, such as so-called leaky ReLU or scaled ELU, may also be used. In one instance, the nonlinear layer 2234 includes a leaky ReLU—in this case, the layer's output is equal to the input for input values ​​greater than 0 (or equal to 0) and equal to a predefined scale of the input for input values ​​less than 0, such as a * input. In one case, a may be set to 0.2.

[0572] Figure 22D The display has from Figure 22C Example 2240 of an implementation of the optional post-processing operation 2238. In this case, the post-processing operation may include an inverse transform operation 2242. In this case, the second convolutional layer 2236 may output a tensor of size (size1, size2, number_of_coefficients), that is, the same size as the input but having channels representing each direction within the directional decomposition. The inverse transform operation 2242 may be similar to the inverse transform operation performed in the layer 1 enhancement layer. In this case, the second convolutional layer 2236 may be viewed as the output of coefficient estimates of the upsampled coding units (e.g., for a 2×2 coding block, the 4-channel output represents the A, H, V, and D coefficients). The inverse transform step then converts the multi-channel output into a two-dimensional set of pixels, for example, the [A, H, V, D] vector of each input pixel is converted into a 2×2 pixel block in layer n.

[0573] Similar adaptations can be provided for downsampling. The upsampling method applied at the encoder can be repeated at the decoder. Different topologies can be provided based on available processing resources.

[0574] In the above example, the parameters of the convolutional layer can be trained based on a pair of layer (n-1) and layer n data. For example, the input during training may include reconstructed video data at a first resolution generated by applying one or more of the encoder and decoder paths, while the ground-based output used for training may include the actual corresponding content from the original signal (e.g., higher or second-resolution video data, rather than upsampled video data). Thus, the neural network upsampler is trained to predict the input layer n video data (e.g., input video enhancement layer 2) as closely as possible given a lower-resolution representation. If the neural network upsampler can generate an output that is closer to the input video than the contrasting upsampler, this will have the benefit of reducing the layer 2 residual, which will further reduce the number of bits required to transmit for the encoded layer 2 enhancement stream. Training can be performed offline based on various test media contents. The parameters generated by training can then be used in online prediction modes. These parameters can be transmitted to the decoder for picture groups and / or as part of the encoded byte stream (e.g., in header information) during air or line updates. In one scenario, different video types may have different sets of parameters (e.g., film versus live sports). In another scenario, different parameters may be used for different parts of the video (e.g., action sequences versus relatively static scenes).

[0575] Instance encoder and decoder variants

[0576] Graphical instances with optional level 0 scaling upwards

[0577] Figure 23 A graphical representation of the decoding process described in some examples in this article is shown 2300. Figure 23 The middle section shows the various stages of the decoding process from left to right. Figure 23 The example demonstrates how additional upsampling can be applied after the base image has been decoded. Figure 25 and 26 The example encoder and example decoder used to perform this variant are shown in the image.

[0578] exist Figure 23 On the far left, the decoded base image 2302 is displayed. This may include the output of the base decoder as described in the examples herein. In the current example, optional upsampling (i.e., upscaling) is performed on the lower-resolution decoded base image 2302. For example, in one case, it may include... Figure 1 , 3A Another downsampling component can be selectively applied prior to the base encoder 112 or 332 of 3B. The lower-resolution decoded base image 2302 can be treated as a layer 0 or layer 0 signal. Upsampling of the decoded base image can be applied based on the transmitted signal scaling factor.

[0579] Figure 23 This demonstrates the first upsampling operation used to generate the preliminary intermediate image 2304. This can be considered as being at the spatial resolution associated with Layer 1 enhancement (e.g., Layer 1 or Layer 1 signal). Figure 23 In this process, the prepared intermediate image 2304 is added 2306 to the first layer decoded residual 2308 (e.g., generated by enhancement sublayer 1) to generate a combined intermediate image 2310. The combined intermediate image 2310 can then be upsampled during a second upsampling operation to generate a prepared output image 2312. The second upsampling operation may be selectively applied depending on the transmitted scaling factor (e.g., the second upsampling operation may be omitted or performed only in one dimension instead of two dimensions). The prepared output image 2312 can be considered to be at layer 2 spatial resolution. The combined intermediate image 2310 may include the output of summing component 220 or 530, and the prepared output image 2312 may include the input to summing component 258 or 558.

[0580] At level 2314, the prepared output image 2312 is added to the second-layer decoded residual 2316 (e.g., generated by enhancement sub-layer 2). The second-layer decoded residual 2316 is presented as an addition 2318 share with information stored in the time buffer 2320. Information 2320 reduces the amount of information required to reconstruct the second-layer residual 2316. This can be beneficial because more data exists at the second layer (layer 2) due to the increased spatial resolution (e.g., compared to the first layer - layer 1 - resolution). Figure 23 In the process, the final addition output is the final combined output image 2322. This can be considered a monochrome video, and / or the process can be repeated for multiple color components or planes to generate a color video output.

[0581] Fourth instance decoder

[0582] Figure 24 The fourth instance decoder 2400 is shown. The fourth instance decoder 2400 can be considered a variant of the other instance decoders described herein. Figure 24 The following diagram illustrates some of the processes described in more detail above and below. The scheme includes an augmentation layer for the residual data, which, once processed and decoded, is then added to the decoded base layer. The augmentation layer further comprises two sub-layers, 1 and 2, each containing a different set of residual data. There is also a temporal layer for the data, which contains signaling to predict some of the residuals at sub-layer 2, for example, using a zero-motion vector algorithm.

[0583] Figure 24In this process, decoder 2400 receives a set 2402 of headers. These may form part of the received combined bit stream and / or may originate from a cloud control component. Header 2402 may include decoder configuration information, which is used by decoder configuration component 2404 to configure decoder 2400. Decoder configuration component 2404 may be similar to... Figure 14C Configuration interface 1434.

[0584] Figure 24 A base layer 2410 and an enhancement layer are also shown, the enhancement layer consisting of two sub-layers: sub-layer 1 2420 and sub-layer 2 2440. These sub-layers may be equivalent to the previously described layers or sub-layers (e.g., layers 1 and 2, respectively). The base layer 2410 receives an encoded base 2412. As in other instances, a base decoding process 2414 decodes the encoded base 2412 to generate a layer 1 base image 2416. Without sampling on the base layer, the layer 1 base image 2416 may include a pre-intermediate image 2304. In some other instances, the base image 2416 may be sampled based on scaling information to generate the pre-intermediate image 2304.

[0585] Sublayer 1 receives a set of level 1 coefficient layers 2422. For example, level 1 coefficient layers 2422 may include, for example, a set of, coefficient layers 2422. Figure 21A and 21B The LoQ1 2122 layer 2130 is a sublayer. Sublayer 2 receives a set of coefficient layers 2442 of layer 2. These may include layers similar to... Figure 21A and 21B The LoQ2 2124 layer and the 2130 layer in the code. This can be used for, for example... Figure 21A and 21B The multiple planes shown in the diagram receive multiple layers, that is... Figure 24 The process shown in the image can be applied in parallel to multiple (color) planes. Figure 24 In addition, it also receives time layer 2450. This may include, for example, those described above and Figure 12D The time signaling shown. For example, along... Figure 9A or Figure 21A and 21B The line received in the diagram is combined into a bitstream by two or more of the following: encoding base 2412, level 1 coefficient layer 2422, level 2 coefficient layer, time layer 2450, and header 2402.

[0586] Turning to sublayer 1 2420, the encoded quantized coefficients are received and processed by entropy decoding component 2423, inverse quantization component 2424, inverse transform component 2425, and smoothing filter 2426. The encoded quantized coefficients can thus be decoded, dequantized, and inverse transformed, and can be further processed by a deblocking filter to generate the decoded residual of sublayer 1 (e.g., Figure 23The residual 2308 of the enhancement sublayer 1 is then passed to sublayer 2 2440, where the encoded quantized coefficients are received for enhancement sublayer 2 and processed by entropy decoding component 2443, timing processing component 2444, inverse quantization component 2445, and inverse transform component 2446. The encoded quantized coefficients can thus be decoded, dequantized, and inverse transformed to generate the decoded residual of sublayer 2 (e.g., ...). Figure 23 The residuals of the enhanced sublayer 2 (2316) are used. Before dequantization, the decoded quantized transform coefficients can be processed by the time processing component 2444, which applies a time buffer. The time buffer contains the transformed residuals (i.e., coefficients) of the previous frame. The decision on whether to combine them depends on the information received by the decoder regarding whether to use inter-frame or intra-frame prediction to reconstruct the coefficients before dequantization and inverse transform, where inter-frame prediction means using the information from the time buffer along with additional information received from the decoded quantized transform coefficients to predict the coefficients to be dequantized and inverse transformed.

[0587] As described above, the base layer can be further upsampled (not illustrated) based on scaling information to generate an upsampled base (e.g., Figure 23 (Prepared intermediate image 2304 in the image). In any case, the output of the base layer 2410 at layer 1 resolution is combined with the decoded residual output by the enhancement sublayer 1 at the first summing component 2430 to generate the combined intermediate image (e.g., ...). Figure 23 (231...

Claims

1. A method for encoding an input video into multiple encoded streams such that the encoded streams can be combined to reconstruct the input video, the method comprising: Receives full-resolution input video; The full-resolution input video is downsampled to create an downsampled video; The instruction is to use a basic codec to encode the downsampled video to create a basic encoded stream; Instruct the decoding of the underlying encoded stream to generate the reconstructed video; The reconstructed video is compared with the downsampled video to create a first set of residuals; Encode the first residual set to create a first-level encoded stream; Decode the first level encoded stream to create a decoded first residual set; The decoded first residual set is applied to the reconstructed video to generate a corrected reconstructed video; The corrected reconstructed video is upsampled to generate an upsampled reconstructed video; The upsampled reconstructed video is compared with the full-resolution input video to create a second residual set; and The second residual set is encoded to create a second-level encoded stream, which includes applying a time buffer to encode the difference between the current frame and the previous frame of the input video. The encoding of the first and second residuals is performed using an enhanced codec that is different from the basic codec. The base encoded stream, the first-level encoded stream, and the second-level encoded stream provide the encoding of the full-resolution input video; Each residual set is encoded as follows: The transformation is applied to the residual set to create a set of coefficients; The quantization operation is applied to the set of coefficients to create a set of quantized coefficients; and, The entropy encoding operation is applied to the set of quantized coefficients.

2. The method according to claim 1, The input video is decomposed into multiple consecutive frames, and each frame of the input video has one or more associated color planes. The method is performed frame-by-frame for each of the plurality of associated color planes, and Each residual set is encoded using a given color plane for a given frame: At the stated resolution of the residual set, the residual set is arranged as 2x2 or 4x4 coding units associated with a 2x2 or 4x4 cell set; and Each coding unit is encoded independently of other coding units in the given frame.

3. The method of claim 2, wherein applying the transformation comprises applying a corresponding 2x2 or 4x4 Hadamard transform kernel such that the set of coefficients includes a set of directional components.

4. The method according to any one of claims 1 to 3, wherein applying the quantization operation includes applying a linear quantizer with a variable-size dead zone.

5. The method according to any one of claims 1 to 3, wherein the application of entropy coding operation includes applying one or more of run-length coding and / or Huffman coding.

6. The method according to any one of claims 1 to 3, further comprising generating one or more headers for the decoder specifying parameters of the encoding, said parameters including one or more of the following: Basic codec parameters; Transformation parameters; and Quantization parameters.

7. The method according to any one of claims 1 to 3, wherein applying the time buffer comprises: Derive a set of time coefficients from the time buffer; as well as, The set of coefficients created by applying a transformation to the second residual set minus the set of time coefficients.

8. The method according to any one of claims 1 to 3, wherein encoding a given set of residuals from one or more of the first set of residuals and the second set of residuals comprises: The given residual set is classified based on a prior analysis. as well as, Selecting a subset of the given residual set to be encoded, the encoding comprising subsequently transforming the given residual set.

9. The method according to any one of claims 1 to 3, wherein upsampling comprises applying a neural network upsampler.

10. The method of claim 9, wherein the onsampler of the neural network comprises two layers of a convolutional neural network with a nonlinear relationship between the two layers, the convolutional neural network being trained to predict the full-resolution input video given a corresponding reconstructed video.

11. The method according to any one of claims 1 to 3, comprising: Obtain temporal pattern metadata for multiple coding units; The time pattern is determined based on the obtained time pattern metadata for use in encoding the plurality of coding units; as well as Based on the determined time pattern and the obtained time pattern metadata, time pattern signaling data for the multiple coding units is generated.

12. The method according to any one of claims 1 to 3, comprising: One or more time refresh parameters are set for frames, which are used to signal the refresh of the time buffer to the decoder.

13. The method according to any one of claims 1 to 3, comprising: The base encoded stream, the first-level encoded stream, and the second-level encoded stream are combined to generate a combined stream; as well as The combined stream is sent to the decoder.

14. A method for decoding multiple encoded streams into a reconstructed output video, the method comprising: Receive the first basic encoded stream; Instruct the use of the underlying codec to decode the first underlying encoded stream to generate a first output video; Receive the first-level encoded stream; The first-level encoded stream is decoded using an enhanced codec to generate a first set of residuals, wherein the enhanced codec is different from the base codec. The first residual set is combined with the first output video to generate the first reconstructed video; Receive the second-level encoded stream; Decoding the second-level encoded stream using the enhanced codec to generate a second residual set includes applying a time buffer to data derived from the second-level encoded stream to reconstruct the second residual set; Upsampling is performed on the first reconstructed video to generate an upsampled reconstructed video; and The second residual set is combined with the upsampled reconstructed video to generate a second reconstructed video, the second reconstructed video including the reconstructed form of the originally encoded full-resolution input video; Decoding each of the first and second level encoded streams includes: Apply entropy decoding operation; The dequantization operation is applied to the data derived from the entropy decoding operation; and The inverse transformation operation is applied to the data derived from the dequantization operation to generate the residual set.

15. The method of claim 14, further comprising: Retrieve multiple decoding parameters from one or more headers associated with one or more of the first and second level encoded streams. The decoding parameters are used to configure the decoding operation.

16. The method of claim 14 or 15, wherein upsampling the first reconstructed video comprises: The values ​​of elements in the first residual set from which the blocks in the upsampled reconstructed video are derived are added to the corresponding blocks in the upsampled reconstructed video.

17. The method according to claim 14 or 15, The input video is decomposed into multiple consecutive frames, and each frame of the input video has one or more associated color planes. The method is performed frame-by-frame for each of the plurality of associated color planes, and The decoding of each residual set includes a given color plane for a given frame: At the stated resolution of the residual set, the residual set is arranged as 2x2 or 4x4 coding units associated with a 2x2 or 4x4 cell set; and Each coding unit is decoded independently of the other coding units in the given frame.

18. The method of claim 14 or 15, wherein applying the time buffer comprises: In response to an indication of a second time mode, data from the time buffer is added to the data decoded from the second-level encoded stream.

19. An encoder comprising: The input video interface is used to receive input video. A downsampler, used to downsample the input video; A base codec interface for receiving a decoded base encoded stream, which is generated by applying a base codec to the downsampled input video; A first residual generator is used to generate a first set of residuals by subtracting the decoded underlying encoded stream from the downsampled input video; A first-level encoder is used to encode the first residual set to generate a first-level coded stream; A first-level decoder is used to decode the first-level encoded stream to obtain a first set of decoded residuals; A summing component is used to add the decoded first residual set to the decoded base encoded stream to generate a corrected reconstructed video; An upsampler is used to upsample the corrected reconstructed video to generate an upsampled reconstructed video. A second residual generator is used to generate a second set of residuals by subtracting the upsampled reconstructed video from the input video; A second-level encoder is used to encode the second residual set to generate a second-level coded stream; A time selection component is used to determine whether a time prediction will be applied to the second set of residuals; as well as A time buffer, used to apply time predictions to the second residual set when generating the second-level coded stream. The first-level encoder and the second-level encoder implement enhanced codecs that differ from the basic codec. The base encoded stream, the first-level encoded stream, and the second-level encoded stream provide encoding for the full-resolution input video; The first-stage encoder and the second-stage encoder each include: Transformation component, which applies a transformation to the residual set to create a set of coefficients; A quantization component, used to apply quantization operations to the set of coefficients to create a quantized coefficient set; and An entropy coding component is used to apply entropy coding operations to the set of quantization coefficients.

20. The encoder of claim 19, comprising: A residual mode selection component is used to instruct one or more of the first residual set and the second residual set to be selectively graded and filtered by one or more of the first level encoder and the second level encoder.

21. The encoder according to claim 19 or claim 20, wherein: The transformation component is configured to apply a 2x2 or 4x4 Hadamard transformation; and The entropy coding component is configured to apply one or more of run-length coding and Huffman coding.

22. A decoder comprising: A base layer interface is used to receive the decoded form of a base encoded stream, which is being decoded using a base codec. An enhanced layer interface is used to receive the first-level encoded stream and the second-level encoded stream; A first-level decoder is used to decode the first-level encoded stream to obtain a first residual set; A first summing component is used to add the first residual set to the decoded form of the underlying coded stream to generate a first reconstructed video; An upsampler is used to upsample the first reconstructed video to generate an upsampled reconstructed video; The second-level decoder is used to decode the second-level encoded stream to obtain the second residual set; A timing selection component is used to determine whether a timing prediction will be applied to the second residual set using header information accompanying one or more of the first and second level encoded streams; A time buffer is used to apply time predictions to the second residual set when decoding the second level coded stream; as well as A second summing component is used to add the second set of residuals to the oversampled reconstructed video to generate a second reconstructed video in a reconstructed form that includes the originally encoded full-resolution input video. The first-level decoder and the second-level decoder each include: Entropy coding component, which is used to apply entropy decoding operations; Dequantization component, used to apply dequantization operation to data derived from the entropy decoding operation; and An inverse transform component is used to apply an inverse transform operation to the data derived from the dequantization operation to generate the residual set.

23. The decoder according to claim 22, wherein: The entropy coding component is configured to apply one or more of run-length coding and Huffman coding; and The inverse transform component is configured to apply a 2x2 or 4x4 Hadamard inverse transform.

24. A non-transitory computer-readable storage medium storing instructions that, when executed by at least one processor, cause at least one processor to perform any one of the methods according to claims 1 to 18.