Adaptive selection of entropy coding parameters
Adaptive entropy coding parameters address the inefficiencies of predefined alphabet sizes by dynamically adjusting to bitrate conditions, improving reconstruction quality and efficiency in video and image encoding.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2022-06-30
- Publication Date
- 2026-07-01
Smart Images

Figure 0007883610000016 
Figure 0007883610000017 
Figure 0007883610000018
Abstract
Description
[Technical Field]
[0001] This disclosure relates to entropy coding and decoding. In particular, this disclosure relates to the adaptive selection of entropy coding parameters. [Background technology]
[0002] Video coding (video encoding and decoding) is used in a wide range of digital video applications, such as broadcast digital TV, video transmission over the internet and mobile networks, real-time conversation applications like video chat, video conferencing, DVDs and Blu-ray discs, video content acquisition and editing systems, mobile device video recording, and camcorders for security applications.
[0003] Even relatively short videos can require a considerable amount of video data to depict, potentially causing difficulties when streaming or otherwise communicating data across communication networks with limited bandwidth. Given limited network resources and the increasing demand for higher video quality, improvements in compression and decompression techniques that improve compression ratios with little to no sacrifice of image quality are desirable. Video encoding and decoding can be performed by standard video encoders and decoders compatible with, for example, H.264 / AVC, HEVC (H.265), VVC (H.266), or other video coding techniques. Furthermore, video coding, or part thereof, may be performed by neural networks.
[0004] Entropy coding is widely used for arbitrary encoding or decoding, or for other source signals such as still images, pictures, or feature channels in neural networks. The input alphabet of an entropy encoder is finite, and the size of the input alphabet must be known on both the encoder and decoder sides. A coder with a larger input alphabet size can encode a wider symbol range, but is less efficient than the same coder with a smaller input alphabet. Because of this effect, it is optimal to use the smallest possible alphabet. In conventional methods, entropy coding parameters, particularly the input alphabet size, are predefined and used for all possible input signals, resulting in clipping effects under high bitrate conditions and unwarranted bit waste under low bitrate conditions. Consequently, reconstruction quality and coding efficiency deteriorate. [Overview of the project] [Problems that the invention aims to solve]
[0005] Embodiments of this disclosure provide an apparatus and method for entropy encoding data into a bitstream and entropy decoding data from a bitstream.
[0006] Embodiments of the present invention are defined by the features of the independent claims, and more advantageous implementations of the embodiments are defined by the features of the dependent claims. [Means for solving the problem]
[0007] According to a first aspect, an embodiment of the present application provides a decoding method implemented by a decoder, the decoding method comprising: receiving a bitstream including encoded data of an input signal and a first parameter; analyzing the bitstream to obtain the first parameter; obtaining an entropy coding parameter based on the first parameter; and reconstructing at least a portion of the input signal based on the entropy coding parameter.
[0008] In conventional methods, entropy coding parameters are typically predefined; for example, the alphabet size M is usually predefined by selecting it once based on the expected tensor range (or latent tensor range) and using the predefined alphabet size M for all cases. Since the size of the input alphabet of an entropy encoder is the same as the size of the output alphabet of an entropy decoder, in this specification, the alphabet size M represents the size of the input alphabet of an entropy encoder or the size of the output alphabet of an entropy decoder. In such cases, if the actual tensor range is wider than the expected tensor range, the input alphabet size determined based on the expected tensor range is inappropriate, and clipping of the coded tensor values becomes necessary. Such clipping corrupts the signal, especially if the coded tensor range differs significantly from the alphabet size. In this case, the corruption of the coded tensor is a nonlinear distortion that causes unpredictable errors in the reconstructed signal, and therefore the quality of the reconstructed signal can be greatly degraded. One implementation allows for the selection of a very large alphabet size, which can be used in all cases. However, increasing the alphabet size disadvantages compression efficiency under low bitrate conditions. While using a large alphabet size significantly increases the bitrate, it does not improve reconstruction quality.
[0009] In embodiments of this application, the decoder can obtain entropy coding parameters (particularly alphabet size) based on parameters carried in the bitstream, and since the parameters carried in the bitstream can be changed, the encoder can adaptively adjust the entropy coding parameters by changing the parameters carried in the bitstream. Thus, clipping effects can be avoided under high bitrate conditions, and rate overhead caused by an unduly large alphabet size can also be avoided under low bitrate conditions. In other words, due to the adaptability of the entropy coding parameters, particularly the alphabet size, optimal operation of the entropy encoder is possible at low bitrates (corresponding to a narrow range of coding values), resulting in bitrate savings, and clipping effects are eliminated at high bitrates (corresponding to a wide range of coding values), resulting in higher reconstructed signal quality.
[0010] Here, it should be noted that "entropy coder" can be used as a synonym for "entropy coding algorithm," which includes both the encoding and decoding algorithms. The entropy encoder may be a module that is part of the encoder, and the entropy decoder may be another module that is part of the decoder. The parameters of the entropy encoder and entropy decoder should be synchronized for correct operation, and therefore the terms "entropy encoder parameters" or "entropy coding parameters" mean the parameters of both the entropy encoder and the entropy decoder. In other words, "entropy coding parameters" may be equivalent to "parameters of the entropy encoder and entropy decoder." The entropy encoder encodes an alphabetic symbol into one or more bits in a bitstream, and the entropy decoder decodes one or more bits in the bitstream into an alphabetic symbol. On the entropy encoder side, the alphabet means the input alphabet, and on the entropy decoder side, the alphabet means the output alphabet. The size of the input alphabet on the entropy encoder side is equal to the size of the output alphabet on the entropy decoder side.
[0011] In one possible embodiment, the input signal is video data, image data, point group This includes data, motion flow, or motion vectors, or any other type of media data.
[0012] In one possible embodiment, the entropy coding parameters include at least one of: the size of the alphabet of the entropy coder, where the size of the input alphabet of the entropy encoder or the size of the output alphabet of the entropy decoder; or the minimum symbol probability supported by the entropy coder; or the renormalization period of the entropy coder. In some embodiments, the renormalization period can be 8 bits, 16 bits, etc.
[0013] Three possible schemas for deriving alphabet sizes based on bitstreams are conceivable: 1) Explicit signaling using predefined predictors; 2) Derivation from the quantization parameter (β); 3) Explicit signaling using predictors that correspond to quantization parameters.
[0014] The technique can be applied to any type of coder that uses entropy coding in its pipeline.
[0015] In one possible embodiment, the first parameter is the size of the alphabet, where the step of obtaining an entropy coding parameter based on the first parameter includes using the size of the alphabet as the first parameter.
[0016] In this embodiment, the size of the alphabet is signaled directly in the bitstream using, for example, fixed-length coding, exp-Golomb coding, or some other coding algorithm. Typical values of M can be 256, 512, or 1024. For example, when signaling 1024 using fixed-length coding, 11 bits are required (1024 10When signaling log2(1024)-9=1 (=100000000002), only one bit is needed if only the values 512 and 1024 are allowed, or two bits are needed if four different alphabet sizes such as 512, 1024, 2048, and 4096 are allowed. As a result, direct signaling of M consumes more bits. However, in some exotic cases (e.g., when the alphabet size M is not a power of 2), direct signaling of M may be useful.
[0017] In one possible embodiment, the first parameter is p, and the entropy coding parameter includes the size of the alphabet M, where M is a function of p.
[0018] In one possible embodiment, the step of obtaining the entropy coding parameter based on a first parameter is M=f -1 (p) is included, where f -1 (p) is the inverse function of f(M), and f(M) = p.
[0019] In this embodiment, instead of M itself, the output p of some reversible function f(M) is signaled in the bitstream. Such p can be signaled using fixed-length coding, exp-Golomb coding, or some other coding algorithm. Therefore, on the decoder side, M is derived based on p, specifically, M = f -1 This is derived as (p). The advantage of the above embodiment is that it provides greater flexibility in signaling the alphabet size, as any optimal alphabet size selected by the encoder can be signaled. In some embodiments, p is greater than or equal to 0, but in other embodiments, p can be negative. For example, the value p can be in the range [0, 5], and 3 bits are used for signaling. The function f(M) can be negotiated in advance between the encoder and the decoder.
[0020] In one possible embodiment, M satisfies one of the following: M = k^p, where k is a natural number; or M = k^(p+C), where k is a natural number and C is a constant; or M = k^(a*p+C), where k is a natural number and a and C are constants; or M = a*p+b, where a and b are constants; or M = p^2. In any one embodiment, A^B is A B Please note that this means...
[0021] In one possible embodiment, p = log2(M) - 9 and M = f -1 (p) = 2^(p+9), where f -1 (p) is the inverse function of f(M), and f(M) = log2(M) - 9.
[0022] In one possible embodiment, p is signaled using one of the following: binary code, unary code, truncated unary code, or exp-Golomb code.
[0023] In one possible embodiment, p is signaled using an exp-Golomb code of order 0.
[0024] In one possible embodiment, the size of the alphabet is signaled, for example, within the parameter set section of the bitstream, for example, within the picture parameter set section of the bitstream.
[0025] In one possible embodiment, the first parameter includes at least one of a rate control parameter, a quantization parameter (qp), image resolution, video resolution, frame rate, pixel density in a 3D object, or a rate distortion weighting coefficient.
[0026] In the embodiments described above, the alphabet size can be derived based on several other parameters. In one exemplary implementation, the alphabet size is derived from a quantization parameter or a rate control parameter; alternatively, the alphabet size can be derived from image resolution, video resolution, frame rate, pixel density in a 3D object, etc. In a trainable codec, the alphabet size can be derived from several parameters of the loss function used during training, e.g., rate / distortion weighting coefficients, or several parameters that affect the selection of the gain vector g. The loss function may include rate and distortion components such as peak signal-to-noise ratio (PSNR), multiscale structural similarity index (MS-SSIM), video multimethod evaluation fusion (VMAF), or some other quality metric. For example, the loss function can be loss = beta * distortion + bits, where distortion is measured by PSNR or MS-SSIM or VMAF, bits is the number of bits spent, and beta is a weighting parameter that controls the ratio of bitrate to reconstruction quality, and beta may also be called the rate control parameter. Furthermore, it can be a quantization parameter similar to the quantization parameter (qp) in common codecs such as JPEG, HEVC, and VVC.
[0027] The advantage of the above embodiment is that since the quantization parameters or rate control parameters are already present in the bitstream and used for other procedures, such parameters can be used by the decoder to derive the alphabet size M, eliminating the need for additional signaling for information specifically used to indicate the alphabet size M, and thus saving bitrate.
[0028] In one possible embodiment, the step of obtaining an entropy coding parameter based on a first parameter includes the steps of determining a target subrange in which the first parameter is located, wherein the acceptable range of values for the first parameter includes a plurality of subranges, the target subrange is one of the plurality of subranges, each of the plurality of subranges includes at least one value of the first parameter, and each of the plurality of subranges corresponds to one value of the entropy coding parameter, the steps of using the value of the entropy coding parameter corresponding to the target subrange as the value of the entropy coding parameter, or calculating a value of the entropy coding parameter based on one or more values of the entropy coding parameter corresponding to one or more subranges adjacent to the target subrange.
[0029] In the above embodiment, if such a rate control parameter is β, the range of beta(β) is divided into K intervals (K subranges) as follows: [β_0,β_1),[β_1,β_2),..,[β_(K-1),β_K)
[0030] Each interval / subrange corresponds to one alphabet size value Mi. Note that there is an acceptable range for a particular codec value β; for example, for some codecs, β may be allowed to be within the range [-∞,∞], while for others, β may only be allowed to be within the range [0,∞]. In the context of this embodiment, the original large range of acceptable β values is divided into several subranges, and for each subrange, there exists a specific value of alphabet size. After obtaining the parameter β, the decoder can select a target interval based on the β values obtained from the bitstream. Specifically, the decoder determines β_i ≤ β ≤ β_(i+1), then selects the interval [β_i,β_(i+1)] as the target interval, and derives the alphabet size value Mi corresponding to this target interval as the alphabet size value M. In some embodiments, each βi within the range {βi} of β may correspond to one alphabet size value Mi, and the alphabet size value M corresponding to a particular β is calculated based on one or more values Mi corresponding to βi adjacent to β. The value used to calculate M may be the nearest just value Mi corresponding to the target interval, or it may be a linear interpolation, bilinear interpolation or some other interpolation from two or more Mi corresponding to βi adjacent to β, or some other interpolation from two or more Mi corresponding to intervals adjacent to the target interval.
[0031] In one possible embodiment, the first parameter is D, and the entropy coding parameter includes the size of the alphabet M, where M is obtained based on P and D, and P is a predictor that can be derived by the decoder.
[0032] In the above embodiment, the alphabet size can be derived based on the predictor P and the first parameter signaled in the bitstream. Therefore, when receiving the bitstream, the decoder can derive the predictor P based on a predetermined parameter, analyze the first parameter from the bitstream, and then derive the alphabet size M based on the predictor P and the first parameter. The advantage of the above embodiment is that only the difference between P and M is signaled in the bitstream, so the additional bits consumed are reduced compared to M signaled in the bitstream. Also, the difference between P and M can be selected based on the content or bit rate, improving the flexibility in signaling the alphabet size. Therefore, this embodiment provides flexibility in alphabet size selection while minimizing the additional bits spent on signaling. Even in some rare cases where the alphabet size predicted from β does not function well, the encoder can still signal the difference value between M and P. This incurs a cost of a few bits, but can solve serious problems regarding the clipping effect.
[0033] In one possible embodiment, the step of obtaining the entropy coding parameter based on the first parameter is M = s -1 (D, P), where s -1 (D, P) is the inverse function of s(M, P) and s(M, P) = D.
[0034] In one possible embodiment, s(M, P) includes the following, namely: s(M, P) = log k (M) - log k (P), where k is a natural number; or s(M, P) = log k (P) - log k (M), where k is a natural number; or s(M, P) = log k (M) - log k (P) - C, where k is a natural number and C is an integer; or s(M, P) = log k(P)-log k (M)-C, where k is a natural number and C is an integer; also, s(M,P)=a*log k (P)-b*log k (M)-c, where k is a natural number and a, b, and c are constants; or s(M,P)=a*M+b*P+c, where a, b, and c are constants.
[0035] In one possible embodiment, M = 2^(D + log2(P)) and D = s(M, P) = log2(M) - log2(P).
[0036] The invertible function D=s(M,P) is D=s P (M) can be considered as M=s -1 (D,P) M=s -1 P (D) can be considered, and it should be noted that P can be any fixed number, in other words, P is a constant coefficient.
[0037] In one possible embodiment, D is signaled using one of the following codes, namely: binary code or unary code or truncated unary code or exp-Golomb code.
[0038] In one possible embodiment, D is signaled using an exp-Golomb code of order 0.
[0039] In one possible embodiment, P can be derived based on at least one parameter other than the first parameter carried in the bitstream.
[0040] In one possible embodiment, at least one parameter other than the first parameter includes at least one of the following: a rate control parameter, a quantization parameter (qp), image resolution, video resolution, frame rate, pixel density in a 3D object, or a rate distortion weighting coefficient.
[0041] In one possible embodiment, P is derived based on at least one parameter, including: obtaining a rate control parameter beta (β) from a bitstream; determining a target subrange in which the obtained β is located, wherein the acceptable range of values for the rate control parameter β is [β_0, β_K], the acceptable range [β_0, β_K] comprises several subranges, the target subrange is one of the several subranges, each of the several subranges contains at least one value of β, and each of the several subranges corresponds to one value of P; selecting a value corresponding to the target subrange as the value of P; or calculating a value of P based on one or more values corresponding to one or more subranges adjacent to the target subrange.
[0042] In one possible embodiment, the decoding method further includes the step of analyzing the bitstream to obtain a flag, which is used to indicate whether the entropy coding parameter is directly transported within the bitstream.
[0043] In the above embodiment, a flag can be introduced into the bitstream to indicate a switch between the three embodiments, in which case this flag may require two bits. In another possible embodiment, the flag can be used to indicate a switch between two embodiments, in which case only one bit is required.
[0044] In one possible embodiment, when the flag is equal to a first value, it is specified that the entropy coding parameter is carried in the bitstream, in which case the first parameter is either the entropy coding parameter or the result of a transformation of the entropy coding parameter; or when the flag is equal to a second value, it is specified that the entropy coding parameter is not carried in the bitstream, and the entropy coding parameter can be derived by the decoder.
[0045] Such a solution offers a balance between bit saving and flexibility, where in most cases only one bit is used for indication if the derived entropy parameter is appropriate, while in some specific cases the entropy parameter may be explicitly signaled.
[0046] In one possible embodiment, when the flag is equal to a third value, it is specified that the difference between M and P, or the result of the conversion of the difference between M and P, is carried in the bitstream, in which case the first parameter is the difference between M and P, or the result of the conversion of the difference between M and P, where M is the size of the input alphabet and P is a predictor that can be derived by the decoder.
[0047] In one possible embodiment, the entropy coder is an arithmetic coder, a range coder, or an ANS (Asymmetric Numerical Systems) coder.
[0048] In one possible embodiment, the step of reconstructing at least a portion of an input signal based on an entropy coding parameter includes the step of obtaining at least one probabilistic model, wherein the probabilistic model of the output symbol is used to indicate the probability of each possible value of the output symbol; the step of entropy decoding one or more bits in the bitstream using at least one probabilistic model and an entropy coding parameter to obtain one or more output symbols; and the step of reconstructing at least a portion of an input signal based on one or more output symbols.
[0049] In one possible embodiment, the method further includes a step of updating the probability model. For example, the probability model is updated after each output symbol, so that each output symbol has its own probability distribution of possible values. Note that the probability model may also be called a probability distribution.
[0050] In one possible embodiment, the probability model depends on the entropy coding parameter. For example, the symbol probability is distributed according to a normal distribution N(μ,σ), where N(μ,σ) has a mean equal to μ and a variance equal to σ. 2 This means a Gaussian distribution equal to . However, actual probability models (meaning mathematical or theoretical models as well) such as quantized histograms depend on the alphabet size and probability precision within the entropy coding engine or entropy coder. That is, the entropy coding parameters can affect the histogram construction inside the entropy encoder. Basically, the alphabet size is the number of possible symbol values, so if the alphabet size is equal to 4, for example, larger values, such as the value "7", cannot be coded / decoded. The histogram used in the entropy coder consists of the quantized probabilities of each symbol value, for example, if the alphabet is {0, 1, 2, 3}, the corresponding probabilities are {7 / 16, 7 / 16, 1 / 16, 1 / 16}, each probability is not 0, the sum of the probabilities is equal to 1, and each probability is greater than the minimum probability (probability precision) supported by the entropy coding engine (1 / 16 in this example). If the probabilities of some symbols are lower than the minimum probabilities supported by the entropy coding engine, the probabilities of at least some symbols must be adjusted to ensure that the probability of each symbol is greater than the minimum probability supported by the entropy coding engine.
[0051] According to a second aspect, an embodiment of the present application provides a decoding method for entropy decoding a bitstream, the method for input signal encodingThe process includes: receiving a bitstream containing data; analyzing the bitstream to obtain a flag, the flag being used to indicate whether an entropy coding parameter is directly carried within the bitstream; obtaining the entropy coding parameter based on the flag; and reconstructing at least a portion of the input signal based on the entropy coding parameter.
[0052] In the above embodiment, a flag can be introduced into the bitstream to indicate a switch between the three embodiments, in which case this flag may require two bits. In another possible embodiment, the flag can be used to indicate a switch between two embodiments, in which case only one bit is required. Such a solution provides a balance between bit saving and flexibility, where in most cases only one bit is used for indication if the derived entropy parameter is appropriate, on the other hand, in some specific cases the entropy parameter may be explicitly signaled.
[0053] In one possible embodiment, the entropy coding parameter includes at least one of the following: namely: the size of the alphabet of the entropy coder, where the size of the alphabet of the entropy coder is the size of the input alphabet of the entropy encoder or the size of the output alphabet of the entropy decoder; or the minimum symbol probability supported by the entropy coder; or the renormalization period of the entropy coder.
[0054] In one possible embodiment, when the flag is equal to a first value, it is specified that the entropy coding parameter is carried in the bitstream, or that the result of the conversion of the entropy coding parameter is carried in the bitstream; or when the flag is equal to a second value, it is specified that the entropy coding parameter is not carried in the bitstream, but that the entropy coding parameter can be derived by the decoder.
[0055] In one possible embodiment, when the flag is equal to a third value, it is specified that the difference between M and P is carried in the bitstream, or the result of converting the difference between M and P is carried in the bitstream, where M is an entropy coding parameter and P is a predictor that can be derived by the decoder.
[0056] In one possible embodiment, the step of obtaining an entropy coding parameter based on a flag includes the steps of: parsing a bitstream to obtain a first parameter when the flag is equal to a first value, the first parameter being an entropy coding parameter; using the first parameter as an entropy coding parameter; or the first parameter being a conversion result of an entropy coding parameter, and obtaining an entropy coding parameter based on the first parameter.
[0057] In one possible embodiment, the result of the entropy coding parameter transformation is p = f(M), where M is the entropy coding parameter and f(M) includes the following: f(M) = log k (M), where k is a natural number; or f(M) = a * log k(M)-C, where k is a natural number and a and C are predetermined constants; or f(M)=a*M+R, where a and R are predetermined constants; or f(M)=sqrt(M), where the step of obtaining the entropy coding parameter based on the first parameter is M=f -1 (p) is included, where f -1 (p) is the inverse function of f(M).
[0058] In one possible embodiment, the first parameter is p = log2(M) - 9.
[0059] In one possible embodiment, the step of obtaining an entropy coding parameter based on a flag includes, when the flag is equal to a second value, the step of analyzing the bitstream to obtain a second parameter, the second parameter including at least one of a rate control parameter, a quantization parameter (qp), image resolution, video resolution, frame rate, pixel density in a 3D object, or a rate distortion weight coefficient, and the step of deriving an entropy coding parameter based on the second parameter.
[0060] In one possible embodiment, the step of deriving an entropy coding parameter based on a second parameter includes the steps of determining a target subrange in which the second parameter lies, wherein the acceptable range of values for the second parameter includes a plurality of subranges, the target subrange is one of the plurality of subranges, each of the plurality of subranges includes at least one value of the second parameter, and each of the plurality of subranges corresponds to one value of the entropy coding parameter, the steps of using the value of the entropy coding parameter corresponding to the target subrange as the value of the entropy coding parameter, or calculating a value of the entropy coding parameter based on one or more values of the entropy coding parameter corresponding to one or more subranges adjacent to the target subrange.
[0061] In one possible embodiment, the step of obtaining an entropy coding parameter based on a flag includes the step of parsing a bitstream to obtain a third parameter when the flag is equal to a third value, the third parameter being the difference between M and P, or the result of a transformation of the difference between M and P, where M is an entropy coding parameter and P is a predictor derived by the decoder; the step of deriving P based on at least one of a rate control parameter, a quantization parameter (qp), image resolution, video resolution, frame rate, pixel density in a 3D object, or a rate distortion weight coefficient; and the step of obtaining an entropy coding parameter based on the third parameter and P.
[0062] In one possible embodiment, the result of the conversion of the difference between M and P is D = s(M,P), where s(M,P) is an invertible function, where s(M,P) includes, namely: s(M,P)=log k (M)-log k (P), where k is a natural number, or s(M,P)=log k (P)-log k (M), where k is a natural number, or s(M,P)=log k (M)-log k (P)-C, where k is a natural number and C is an integer, or s(M,P)=log k (P)-log k (M)-C, where k is a natural number and C is an integer, or s(M,P)=a*log k (P)-b*log k (M)-c, where k is a natural number and a, b, and c are constants, or s(M,P)=a*M+b*P+c, where a, b, and c are constants. The step of obtaining the entropy coding parameter based on the third parameter is M=s-1 (D,P) is included, where s -1 (D,P) is the inverse function of s(M,P).
[0063] According to a third aspect, an embodiment of the present application provides an encoding method implemented by an encoder, the method comprising the steps of encoding an input signal and a flag into a bitstream of a first parameter, the first parameter being used to obtain an entropy coding parameter, and transmitting the bitstream to a decoder.
[0064] In embodiments of the present invention, the decoder can obtain entropy coding parameters (particularly alphabet size) based on parameters carried in the bitstream, and since the parameters carried in the bitstream are modifiable, the encoder can adaptively adjust the entropy coding parameters by changing the parameters carried in the bitstream. Thus, clipping effects can be avoided under high bitrate conditions, and rate overhead caused by an unduly large alphabet size can also be avoided under low bitrate conditions. In other words, the adaptability of the entropy coding parameters, particularly the alphabet size, allows for optimal operation of the entropy encoder at low bitrates (corresponding to a narrow range of coding values), resulting in bitrate savings, and at high bitrates (corresponding to a wide range of coding values), clipping effects are eliminated, resulting in higher quality of the reconstructed signal.
[0065] In one possible embodiment, the entropy coding parameter includes at least one of the following: namely: the size of the alphabet of the entropy coder, where the size of the alphabet of the entropy coder is the size of the input alphabet of the entropy encoder or the size of the output alphabet of the entropy decoder; or the minimum symbol probability supported by the entropy coder; or the renormalization period of the entropy coder.
[0066] In one possible embodiment, the first parameter is the size of the alphabet.
[0067] In one possible embodiment, the first parameter is p, where p is the result of the transformation of M, and M is the entropy coding parameter.
[0068] In one possible embodiment, p = f(M), where f(M) is an invertible function.
[0069] In one possible embodiment, f(M) includes, namely: f(M) = a * log k (M)-C, where k is a natural number and a and C are predetermined constants, or f(M) = a*M + b, where a and b are constants, or f(M) = sqrt(M).
[0070] In one possible embodiment, p = log2(M) - 9.
[0071] In one possible embodiment, p is signaled using one of the following codes, namely: binary code, or unary code, or truncated unary code, or exp-Golomb code.
[0072] In one possible embodiment, the first parameter includes at least one of a rate control parameter, a quantization parameter (qp), image resolution, video resolution, frame rate, pixel density in a 3D object, or a rate distortion weight coefficient, and the first parameter is used by an entropy decoder to derive an entropy coding parameter.
[0073] In one possible embodiment, the first parameter is D obtained based on P and M, where M is an entropy coding parameter and P is a predictor that can be derived by the decoder.
[0074] In one possible embodiment, D = s(M,P), where s(M,P) is an invertible function.
[0075] In one possible embodiment, s(M,P) includes, namely: s(M,P)=log k (M)-log k (P), where k is a natural number, or s(M,P)=log k (P)-log k (M), where k is a natural number, or s(M,P)=log k (M)-log k (P)-C, where k is a natural number and C is an integer, or s(M,P)=log k (P)-log k (M)-C, where k is a natural number and C is an integer, or s(M,P)=a*log k (P)-b*log k (M)-c, where k is a natural number and a, b and c are constants, or s(M,P) = a*M + b*P + c, where a, b, and c are constants. Includes.
[0076] In one possible embodiment, D = s(M, P) = log2(P) - log2(M).
[0077] In one possible embodiment, D is signaled using one of the following codes, namely: a binary code, or a unary code, or a truncated unary code, or an exp-Golomb code.
[0078] In one possible embodiment, the encoding method further includes the step of encoding a flag into a bitstream, the flag being used to indicate whether an entropy coding parameter is carried directly within the bitstream.
[0079] In one possible embodiment, when the flag is equal to a first value, it specifies that the entropy coding parameter is carried in the bitstream and that the first parameter is either the entropy coding parameter or the result of a transformation of the entropy coding parameter; or when the flag is equal to a second value, it specifies that the entropy coding parameter is not carried in the bitstream, but that the entropy coding parameter can be derived by the decoder.
[0080] In one possible embodiment, when the flag is equal to a third value, it is specified that the difference between M and P is carried in the bitstream, or the result of converting the difference between M and P is carried in the bitstream, where M is an entropy coding parameter and P is a predictor that can be derived by the decoder.
[0081] In one possible embodiment, several possible solutions are proposed for alphabet selection on the encoder side.
[0082] In one possible embodiment, the method involves the steps of obtaining the minimum and maximum values of the latent space elements of an entropy encoder, where the latent space elements are the result of the processing of the input signal, and the size of the alphabet. M=ceil(max{y}-min{y}) or M=2^(ceil(log2(max{y}-min{y}))) The steps to obtain according to, This further includes, where ceil(x) is the smallest integer greater than x, max{y} represents the maximum value of the latent space element, min{y} represents the minimum value of the latent space element, and M represents the size of the alphabet.
[0083] In this embodiment, the alphabet size is selected as the smallest possible number greater than the range of the coded values. For example, the minimum and maximum values of tensor y are first obtained, and the alphabet size is selected as follows: M = ceil(max{y} - min{y})
[0084] In most entropy coders, the size of the alphabet should be a power of 2, in which case the size of the alphabet can be chosen as M = 2^(ceil(log2(max{y}-min{y}))). Note that there are some cases, for example, when the module of all y values is less than 1, in which case an additional scaling operation can be performed before entropy coding.
[0085] In one possible embodiment, the method comprises the steps of: obtaining at least two values around M0, where M0 = ceil(max{y} - min{y}) or M0 = 2^(ceil(log2(max{y} - min{y}))); calculating a loss function for at least two values; and selecting the value having the smallest loss function among the at least two values as the size of the alphabet, where ceil(x) is the smallest integer greater than x, max{y} represents the maximum value of the latent space element, and min{y} represents the minimum value of the latent space element.
[0086] The loss function can include rate and distortion components. For example, the loss function can be: Loss = Beta * Distortion + Bits, where distortion is measured by peak signal-to-noise ratio (PSNR), multiscale structural similarity index (MS-SSIM), video multimethod evaluation fusion (VMAF), or other quality metrics, bits is the number of bits used, and beta is a weighting parameter that controls the ratio between bitrate and reconstruction quality. Beta is also called the rate control parameter. Clipping may occur in this approach, but the bitrate savings from using smaller alphabets compensate for a slight increase in distortion.
[0087] According to a fourth aspect, an embodiment of the present application provides an encoding method implemented by an encoder, the method comprising: encoding an input signal and a flag into a bitstream, the flag being used to indicate whether an entropy coding parameter is directly carried in the bitstream; and transmitting the bitstream to a decoder.
[0088] In the above embodiment, a flag can be introduced into the bitstream to indicate a switch between the three embodiments, in which case this flag may require two bits. In another possible embodiment, the flag can be used to indicate a switch between two embodiments, in which case only one bit is required. Such a solution provides a balance between bit saving and flexibility, in most cases only one bit is used for indication if the derived entropy parameter is appropriate, on the other hand, in some specific cases the entropy parameter can be explicitly signaled.
[0089] In one possible embodiment, the entropy coding parameter includes at least one of the following: namely: the size of the alphabet of the entropy coder, where the size of the alphabet of the entropy coder is the size of the input alphabet of the entropy encoder or the size of the output alphabet of the entropy decoder; or the minimum symbol probability supported by the entropy coder; or the renormalization period of the entropy coder.
[0090] In one possible embodiment, when the flag is equal to a first value, it is specified that the entropy coding parameter is carried in the bitstream, or that the result of the conversion of the entropy coding parameter is carried in the bitstream; or when the flag is equal to a second value, it is specified that the entropy coding parameter is not carried in the bitstream, but that the entropy coding parameter can be derived by the decoder.
[0091] In one possible embodiment, when the flag is equal to a third value, it is specified that the difference between M and P is carried in the bitstream, or the result of converting the difference between M and P is carried in the bitstream, where M is an entropy coding parameter and P is a predictor that can be derived by the decoder.
[0092] In one possible embodiment, the method further includes the step of encoding a first parameter into a bitstream when a flag is equal to a first value, wherein the first parameter is an entropy coding parameter or the result of a transformation of an entropy coding parameter.
[0093] In one possible embodiment, the result of the entropy coding parameter transformation is p = f(M), where M is the entropy coding parameter, and f(M) can be defined as follows: f(M) = log k (M), where k is a natural number, or f(M) = a * log k (M)-C, where k is a natural number and a and C are predetermined constants, or f(M) = aM + R, where a and R are constants, or f(M) = sqrt(M).
[0094] In one possible embodiment, the first parameter is p = log2(M) - 9.
[0095] In one possible embodiment, p is signaled using one of the following codes, namely: binary code, or unary code, or truncated unary code, or exp-Golomb code.
[0096] In one possible embodiment, p is signaled using an exp-Golomb code of order 0.
[0097] In one possible embodiment, the method further includes the step of encoding a third parameter into a bitstream when the flag is equal to a third value, where the third parameter is either the difference between M and P, or the result of converting the difference between M and P, where M is an entropy coding parameter and P is a predictor that can be derived by the decoder.
[0098] In one possible embodiment, the result of the conversion of the difference between M and P is D = s(M,P), where s(M,P) is an invertible function, where s(M,P) includes, namely: s(M,P)=log k (M)-log k (P), where k is a natural number, or s(M,P)=log k (P)-log k (M), where k is a natural number, or s(M,P)=log k (M)-log k (P)-C, where k is a natural number and C is an integer, or s(M,P)=log k (P)-log k (M)-C, where k is a natural number and C is an integer, or s(M,P)=a*log k (P)-b*log k (M)-c, where k is a natural number and a, b, and c are constants, or s(M,P) = a*Mb*P + c, where a, b, and c are constants. Includes, The step of obtaining the entropy coding parameter based on the third parameter is M=s -1 (D,P) is included, where s -1 (D,P) is the inverse function of s(M,P).
[0099] In one possible embodiment, D is signaled using one of the following codes, namely: a binary code, or a unary code, or a truncated unary code, or an exp-Golomb code.
[0100] In one possible embodiment, D is signaled using an exp-Golomb code of order 0.
[0101] According to a fifth aspect, an embodiment of the present application provides a decoding device comprising: a receiver configured to receive a bitstream including encoded data of an input signal and a first parameter; an analysis unit configured to analyze the bitstream and obtain the first parameter; an acquisition unit configured to obtain an entropy coding parameter based on the first parameter; and a reconstruction unit configured to reconstruct at least a portion of the input signal based on the entropy coding parameter.
[0102] This device offers the advantages of the method described above.
[0103] In one possible embodiment, the input signal is video data, image data, point group This includes data, motion flow, or motion vectors, or any other type of media data.
[0104] In one possible embodiment, the entropy coding parameter includes at least one of the following: the size of the alphabet of the entropy coder, where the size of the alphabet of the entropy coder is the size of the input alphabet of the entropy encoder or the size of the output alphabet of the entropy decoder; or the minimum symbol probability supported by the entropy coder; or the renormalization period of the entropy coder. In some embodiments, the renormalization period can be 8 bits, 16 bits, etc.
[0105] In one possible embodiment, the first parameter is the size of an alphabet, where the acquisition unit is further configured to use the first parameter as the size of an alphabet.
[0106] In one possible embodiment, the first parameter is p, and the entropy coding parameter includes the size of the alphabet M, where M is a function of p.
[0107] In one possible embodiment, the acquisition unit sets M to M=f -1 It is further configured to be obtained as (p), where f -1 (p) is the inverse function of f(M), and f(M) = p.
[0108] In one possible embodiment, the acquisition unit is further configured to: determine a target subrange in which a first parameter lies, where the acceptable range of values for the first parameter includes a plurality of subranges, the target subrange is one of the plurality of subranges, each of the plurality of subranges includes at least one value of the first parameter, and each of the plurality of subranges corresponds to one value of the entropy coding parameter; use the value of the entropy coding parameter corresponding to the target subrange as the value of the entropy coding parameter; or calculate the value of the entropy coding parameter based on one or more values of the entropy coding parameter corresponding to one or more subranges adjacent to the target subrange.
[0109] According to the sixth aspect, the embodiments of the present application are in any one of the second aspect or a possible embodiment of the second aspect. decrypt A decoding device is provided that includes a functional unit for implementing the method.
[0110] This device offers the advantages of the method described above.
[0111] According to a seventh aspect, an embodiment of the present application provides an encoding device comprising an encoding unit configured to encode an input signal and a first parameter into a bitstream, wherein the first parameter includes an encoding unit used to obtain an entropy coding parameter and a transmitting unit configured to transmit the bitstream to a decoder. The encoding device further includes other functional units for implementing the encoding method in any one of the possible embodiments of the third aspect.
[0112] According to the eighth aspect, an embodiment of the present application provides an encoding device comprising an encoding unit configured to encode an input signal and a flag into a bitstream, the flag being used to indicate whether an entropy coding parameter is directly carried in the bitstream, and a transmitting unit configured to transmit the bitstream to a decoder. The encoding device further comprises other functional units for implementing the encoding method in any one of the possible embodiments of the fourth aspect.
[0113] According to the ninth aspect, an embodiment of the present application provides a decoding device comprising a processing circuit configured to perform a decoding method described in any one of the first aspects or in a possible embodiment of the first aspect.
[0114] According to a tenth aspect, an embodiment of the present application provides a decoding device comprising a processing circuit configured to perform a decoding method described in any one of the second aspects or in a possible embodiment of the second aspect.
[0115] According to the eleventh aspect, an embodiment of the present application provides an encoding device comprising a processing circuit configured to perform an encoding method described in any one of the third aspects or in a possible embodiment of the third aspect.
[0116] According to the twelfth aspect, an embodiment of the present application provides an encoding device comprising a processing circuit configured to perform an encoding method described in any one of the fourth aspects or in a possible embodiment of the fourth aspect.
[0117] According to a thirteenth aspect, an embodiment of the present application provides a decoder comprising one or more processors and a non-temporary computer-readable storage medium coupled to the one or more processors, wherein the storage medium stores a program for execution by the one or more processors, and the decoder is configured such that, when executed by the one or more processors, it performs the decoding method described in the first aspect or any one of the possible embodiments of the first aspect.
[0118] According to a fourteenth aspect, an embodiment of the present application provides a decoder comprising one or more processors and a non-temporary computer-readable storage medium coupled to the one or more processors, wherein the storage medium stores a program for execution by the one or more processors, and the decoder is configured to perform, when executed by the one or more processors, the method described in the second aspect or any one of the possible embodiments of the second aspect.
[0119] According to the 15th aspect, an embodiment of the present application provides an encoder comprising one or more processors and a non-temporary computer-readable storage medium coupled to the one or more processors, wherein the storage medium stores a program for execution by the one or more processors, and the program, when executed by the one or more processors, performs the method described in the third aspect or any one of the possible embodiments of the third aspect. encoder It constitutes.
[0120] According to the sixteenth aspect, an embodiment of the present application provides an encoder comprising one or more processors and a non-temporary computer-readable storage medium coupled to the one or more processors, wherein the storage medium stores a program for execution by the one or more processors, and the program, when executed by the one or more processors, performs the method described in the fourth aspect or any one of the possible embodiments of the fourth aspect. encoder It constitutes.
[0121] According to the 17th aspect, an embodiment of the present application provides a non-temporary computer-readable medium that, when executed by a computer device or one or more processors, carries computer instructions causing the computer device or one or more processors to perform the method described in the first aspect or any one of the possible embodiments of the first aspect.
[0122] According to the 18th aspect, an embodiment of the present application provides a non-temporary computer-readable medium for carrying computer instructions that, when executed by a computer device or one or more processors, cause the computer device or one or more processors to perform the method described in the second aspect or any one of the possible embodiments of the second aspect.
[0123] According to the 19th aspect, an embodiment of the present application provides a non-temporary computer-readable medium for carrying computer instructions that, when executed by a computer device or one or more processors, cause the computer device or one or more processors to perform the method described in the third aspect or any one of the possible embodiments of the third aspect.
[0124] According to the 20th aspect, an embodiment of the present application provides a non-temporary computer-readable medium that, when executed by a computer device or one or more processors, carries computer instructions causing the computer device or one or more processors to perform the method described in the fourth aspect or any one of the possible embodiments of the fourth aspect.
[0125] According to the 21st aspect, an embodiment of the present application provides a non-temporary storage medium including a bitstream encoded by the method described in the third aspect or any one of the possible embodiments of the third aspect.
[0126] According to the 22nd aspect, an embodiment of the present application provides a non-temporary storage medium including a bitstream encoded by the method described in the fourth aspect or any one of the possible embodiments of the fourth aspect.
[0127] According to the 23rd aspect, an embodiment of the present application provides a computer program, which includes code instructions, stored in a non-temporary medium, and which, when executed on one or more processors, causes a step of the method according to any one of the preceding aspects or any one of the possible embodiments of the preceding aspects.
[0128] According to the 24th aspect, an embodiment of the present application is a system for distributing a bitstream, comprising: at least one storage medium configured to store at least one bitstream generated by an encoding method described in the third aspect or any one of the possible embodiments of the third aspect; and a video streaming device configured to retrieve a bitstream from one of the at least one storage medium and transmit the bitstream to a terminal device, the video streaming device including a content server or a content distribution server.
[0129] In one possible embodiment, the system further includes one or more processors configured to perform an encryption process on at least one bitstream to obtain at least one encrypted bitstream; at least one storage medium configured to store the encrypted bitstream, or one or more processors configured to convert a bitstream of a first format to a bitstream of a second format; and at least one storage medium configured to store the bitstream of a second format.
[0130] In one possible embodiment, the system further includes a receiver configured to receive a first operation request, one or more processors configured to determine a target bitstream in at least one storage medium in response to the first operation request, and a transmitter configured to transmit the target bitstream to a terminal device.
[0131] In one possible embodiment, one or more processors are further configured to encapsulate a bitstream to obtain a transport stream in a first format, and a transmitter is further configured to transmit the transport stream in the first format to a terminal device for display or to transmit the transport stream in the first format to a storage space for storage.
[0132] The present invention can be implemented in hardware (HW) and / or software (SW), or any combination thereof. Furthermore, a hardware-based implementation may be combined with a software-based implementation.
[0133] Details of one or more embodiments are described in the accompanying drawings and the following description. Other features, purposes, and advantages will become apparent from the description, drawings, and claims. [Brief explanation of the drawing]
[0134] Embodiments of the present invention will be described in more detail below with reference to the attached drawings.
[0135] [Figure 1] This is a schematic diagram showing the channels processed by the layers of a neural network. [Figure 2] This is a schematic diagram illustrating the autoencoder type of neural network. [Figure 3] This is a schematic diagram showing an exemplary network architecture for the encoder and decoder sides, including a hyperplier model. [Figure 4] This is a schematic diagram showing an exemplary network architecture for the encoder and decoder sides, including a hyperplier model. [Figure 5] This block diagram shows the structure of a cloud-based solution for machine-based tasks such as machine vision tasks. [Figure 6A] A block diagram showing a neural network-based end-to-end video compression framework. [Figure 6B] This block diagram shows some illustrative details of the application of neural networks for motion field compression. [Figure 6C] This block diagram shows some illustrative details of the application of neural networks for motion compensation. [Figure 7] This is a schematic diagram showing a general scheme for an entropy coder. [Figure 8] This is a schematic diagram illustrating a common scheme for entropy coding used in autoencoder-based coders. [Figure 9] This is a schematic diagram showing a general scheme for an autoencoder-based coder having an entropy coder and a gain unit. [Figure 10] This is a schematic diagram showing a rate distortion curve accompanied by an abnormal decrease in PSNR at high rates. [Figure 11] This is a schematic diagram showing how the β range is divided into intervals. [Figure 12] This is a schematic diagram illustrating the decryption method. [Figure 13] This is a schematic diagram illustrating the decryption method. [Figure 14] This is a schematic diagram illustrating the encoding method. [Figure 15] This is a schematic diagram illustrating an exemplary method for determining the size of the alphabet in an entropy encoder. [Figure 16] This is a schematic diagram illustrating an exemplary method for determining the size of the alphabet in an entropy encoder. [Figure 17] This is a schematic diagram illustrating an exemplary method for determining the size of the alphabet in an entropy encoder. [Figure 18] This is a schematic diagram illustrating the encoding method. [Figure 19] This is a schematic diagram showing a multicore encoder that encodes the channels of input data into substreams and concatenates the substreams into a bitstream. [Figure 20] This is a schematic diagram showing an exemplary encoder configured to implement the technology of this application. [Figure 21] This is a schematic diagram showing an example of a decoder configured to implement the technology of this application. [Figure 22] This is a schematic diagram showing an example of a coding system configured to implement embodiments of the present invention. [Figure 23] This is a schematic diagram showing an example of an encoding or decoding device. [Figure 24] This is a schematic diagram showing an example of a coding device. [Figure 25] This is a schematic diagram showing an example of a coding system, encoding device, or decoding device. [Figure 26] This is a schematic diagram showing an exemplary structure of a content supply system 3100 that realizes a content distribution service. [Figure 27] This is a schematic diagram illustrating an example structure of a terminal service. [Modes for carrying out the invention]
[0136] In the following description, reference will be made to the accompanying drawings, which form part of this disclosure and illustrate specific aspects of embodiments of the present invention, or specific aspects in which embodiments of the present invention may be used. It will be understood that embodiments of the present invention may be used in other aspects and may include structural or logical modifications not shown in the drawings. Accordingly, the following detailed description should not be construed as restrictive, and the scope of the present invention is defined by the appended claims.
[0137] For example, disclosures relating to a described method are understood to also apply to a corresponding device or system configured to perform that method, and vice versa. For example, if one or more steps of a particular method are described, the corresponding device may include one or more units, e.g., functional units (e.g., one unit that performs one or more steps, or multiple units, each performing one or more of the steps), even if such one or more units are not explicitly described or illustrated in the drawings, in order to perform the steps of the described one or more methods. On the other hand, for example, if a particular device is described based on one or more units, e.g., functional units, the corresponding method may include one step that performs the functionality of one or more units (e.g., one step that performs the functionality of one or more units, or multiple steps, each performing the functionality of one or more of the units), even if such one or more steps are not explicitly described or illustrated. Furthermore, it is understood that the various exemplary embodiments and / or features of the aspects described herein may be combined with each other unless otherwise specified.
[0138] The following provides an overview of the technical terms used and some of the frameworks in which embodiments of this disclosure may be utilized.
[0139] Video coding typically refers to the processing of a series of pictures that make up a video or video sequence. In the field of video coding, the terms frame or image are sometimes used as synonyms for picture. Video coding consists of two parts: video encoding and video decoding. Video encoding is performed on the source side and typically involves processing the original video picture (e.g., by compression) to reduce the amount of data required to represent the video picture (for more efficient storage and / or transmission). Video decoding is performed on the destination side and typically involves the reverse processing compared to the encoder in order to reconstruct the video picture. Embodiments that refer to "coding" a video picture (or a general picture, as will be discussed later) should be understood to relate to both "encoding" and "decoding" of the video picture. The combination of the encoding and decoding parts is also called a CODEC (Coding and Decoding).
[0140] Artificial neural networks Artificial neural networks (ANNs), or connectionist systems, are computing systems vaguely inspired by the biological neural networks that make up animal brains. Such systems generally "learn" to perform tasks by considering examples, without being programmed with task-specific rules. For example, in image recognition, they may learn to identify images containing cats by analyzing example images manually labeled as "cat" or "no cat," and using the results to identify cats in other images. They do this without any prior knowledge of cats, such as having fur, a tail, whiskers, or a cat-like face. Instead, they automatically generate discriminative characteristics from the examples they process.
[0141] ANNs are based on a collection of connected units or nodes called artificial neurons, which roughly model the neurons of a biological brain. Each connection can transmit signals to other neurons, similar to synapses in a biological brain. The artificial neuron that receives the signal can then process it and send signals to the neurons it is connected to.
[0142] In an ANN implementation, the "signals" in a connection are real numbers, and the output of each neuron is calculated by some nonlinear function of the sum of its inputs. These connections are called edges. Neurons and edges typically have weights that adjust as learning progresses. The weights increase or decrease the strength of the signal in the connection. Neurons may have thresholds, and a signal is transmitted only when the aggregated signal exceeds that threshold. Typically, neurons are aggregated into layers. Different layers may perform different transformations on their inputs. Signals travel from the first layer (input layer) to the last layer (output layer), possibly passing through multiple layers.
[0143] The initial goal of the ANN approach was to solve problems in the same way the human brain does. Over time, attention shifted to performing specific tasks, leading to deviations from biology. ANNs have been used in a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games, medical diagnosis, and even activities traditionally thought to be limited to humans, such as painting.
[0144] The name "Convolutional Neural Network" (CNN) indicates that this network uses a mathematical operation called convolution. Convolution is a special type of linear operation. A convolutional network is a neural network that uses convolution instead of general matrix multiplication in at least one of its layers.
[0145] Figure 1 schematically illustrates the general concept of processing by a neural network such as a CNN. A convolutional neural network consists of an input layer, an output layer, and several hidden layers. The input layer is the layer to which the input (such as a portion of an image as shown in Figure 1) is provided for processing. The hidden layers of a CNN typically consist of a series of convolutional layers that are convolved by multiplication or other inner products. The result of a layer is one or more feature maps (f. maps in Figure 1), sometimes called channels. Some or all of the layers may involve subsampling. As a result, the feature maps may be smaller, as shown in Figure 1. The activation function in a CNN is usually a ReLU (Rectified Linear Unit) layer, followed by additional convolutions such as pooling layers, fully connected layers, and normalization layers. These layers are called hidden layers because their inputs and outputs are masked by the activation function and the final convolution. These layers are colloquially called convolutions, but this is merely a convention. Mathematically, it is a multidisciplinary sliding dot product or cross-correlation. This is important for matrix indices in that it affects how weights are determined at specific index points.
[0146] When programming a CNN to process images, the input is a tensor with shape (number of images) x (width of images) x (height of images) x (depth of images), as shown in Figure 1. It should be noted that the depth of an image may consist of the channels of the image. After passing through the convolutional layer, the image is abstracted into a feature map with shape (number of images) x (width of feature map) x (height of feature map) x (channels of feature map). The convolutional layer in a neural network should have the following attributes: a convolutional kernel defined by width and height (hyperparameters); the number of input and output channels (hyperparameters); and the depth of the convolutional filter (input channel) must be equal to the number of channels (depth) of the input feature map.
[0147] Convolutional layers are the core building blocks of a CNN. The layer parameters consist of a set of learnable filters (kernels, as described above), which have a small receptive field but extend across the entire depth of the input volume. During the forward pass, each filter is convolved across the width and height of the input volume, calculating the dot product between the filter's entry and the input to generate a two-dimensional activation map of that filter. As a result, the network learns which filters become active when it detects a particular type of feature at a given spatial location in the input.
[0148] Another important concept in CNNs is pooling, which is a form of nonlinear downsampling. There are several nonlinear functions for implementing pooling, the most common of which is max pooling. It divides the input image into a set of non-overlapping rectangles and outputs the maximum value for each such sub-region. The pooling layer works to progressively reduce the spatial size of the representation, reduce the number of parameters in the network, the memory footprint and the amount of computation, and thus control overfitting. In CNN architectures, it is common to periodically insert pooling layers between consecutive convolutional layers. The pooling operation provides another form of transformation invariance.
[0149] Pooling layers operate independently for all depth slices of the input, spatially resizing them. The most common form is a pooling layer of 2x2 filters, applying two strides each along both width and height for each depth slice of the input, discarding 75% of activations. Due to the significant reduction in representation size, there is a recent trend to use smaller filters or to discard pooling layers altogether. Region of interest pooling (also known as ROI pooling) is a variation of max pooling where the output size is fixed and the input rectangle is parameterized. Pooling is a key component of convolutional neural networks for object detection based on fast R-CNN architectures.
[0150] The ReLU mentioned above stands for rectified linear unit, and it applies a non-saturated activation function. By setting negative values to zero, it effectively removes negative values from the activation map. This increases the decision function and the nonlinear characteristics of the entire network without affecting the receptive field of the convolutional layer.
[0151] After several convolutional and max pooling layers, high-level inference in a neural network is performed via fully connected layers. Neurons within fully connected layers have connections to all activations in the previous layer, as seen in typical (non-convolutional) artificial neural networks. Thus, their activations can be computed as affine transformations involving matrix multiplication followed by bias offsets (vector addition of learned or fixed bias terms).
[0152] The "loss layer" (which includes the calculation of the loss function) specifies how to penalize the deviation between the predicted (output) label and the true label during training, and is usually the final layer of a neural network. Various loss functions suitable for different tasks may be used. Softmax loss is used to predict a single class out of K mutually exclusive classes. Sigmoid cross-entropy loss is used to predict K independent probability values in [0,1]. Euclidean loss is used for regression to real-valued labels.
[0153] In summary, Figure 1 shows the data flow in a typical convolutional neural network. First, the input image passes through a convolutional layer and is abstracted into a feature map containing several channels, corresponding to some filters in the set of learnable filters in that layer. The feature map is then subsampled, for example, using a pooling layer that reduces the dimension of each channel in the feature map. Next, the data comes to another convolutional layer, which may have a different number of output channels. As mentioned earlier, the number of input and output channels are hyperparameters of the layer. To establish network connectivity, these parameters must be synchronized between two connected layers so that the number of input channels in the current layer is equal to the number of output channels in the previous layer. For the first layer processing input data, e.g., an image, the number of input channels is usually equal to the number of channels in the data representation, e.g., 3 channels for an RGB or YUV representation of an image or video, or 1 channel for a grayscale image or video representation.
[0154] Autoencoders and unsupervised learning An autoencoder is a type of artificial neural network used to learn efficient data coding in an unsupervised manner. A schematic diagram of it is shown in Figure 2. The purpose of an autoencoder is to learn a representation (encode) of a set of data, typically for dimensionality reduction, by training the network to ignore the "noise" in the signal. Along with the reduction side, the reconstruction side is learned, and the autoencoder gets its name from its attempt to produce a representation from the reduced encoding that is as close as possible to its original input. In its simplest case, given one hidden layer, the encoder stage of the autoencoder takes input x and maps it to h: h = σ(Wx + b)
[0155] This image h is usually called the code, latent variable, or latent representation, where σ is an element-wise activation function such as a sigmoid function or a rectified linear unit. W is the weight matrix, and b is the bias vector. The weights and biases are usually initialized randomly and then iteratively updated during training through backpropagation. The decoder stage of the autoencoder then maps h to a reconstructed x' with the same shape as x: x'=σ'(W'h'+b') Here, σ', W', and b' for the decoder do not need to be related to the corresponding σ, W, and b for the encoder.
[0156] Recent advances in the field of artificial neural networks, particularly convolutional neural networks, have led researchers to become interested in applying neural network-based techniques to image and video compression tasks. For example, end-to-end optimized image compression using networks based on variational autoencoders has been proposed.
[0157] Therefore, data compression is considered a fundamental and well-studied problem in engineering, generally formulated with the aim of designing code for a given discrete data ensemble with minimum entropy. This solution relies heavily on knowledge of the probabilistic structure of the data, and thus the problem is closely related to probabilistic source modeling. However, since all practical code must have finite entropy, continuous-value data (such as vectors of image pixel intensity) must be quantized into a finite set of discrete values, which introduces errors.
[0158] In this context, known as the lossy compression problem, there is a trade-off between two competing costs: the entropy (rate) of the discrete representation and the error (distortion) resulting from quantization. Different compression applications, such as data storage and transmission over channels with limited capacity, require different rate and distortion trade-offs.
[0159] For example, JPEG uses a discrete cosine transform for blocks of pixels, while JPEG2000 uses multiscale orthogonal wavelet decomposition. Typically, the three components of a transform coding method—namely, the transform, the quantizer, and the entropy code—are optimized separately (often by manual parameter tuning). Modern video compression standards such as HEVC, VVC, and EVC also use transform representations to code the residual signal after prediction. For this purpose, several transforms are used, such as discrete cosine and sine transforms (DCT, DST), as well as low-frequency non-separable manually optimized transforms (LFNST).
[0160] Variational image compression The variational autoencoder (VAE) framework can be thought of as a nonlinear transformation coding model. This is illustrated in Figure 3, which shows the VAE framework: Encoder 101 maps the input image x to a latent representation (denoted by y) via the function y=f(x). This latent representation may also be referred to as a part or point of the “latent space” below. The function f() is a transformation function that transforms the input signal x into a more compressible representation y. Quantizer 102 transforms the latent representation y into
number
number
[0161] The latent space can be understood as a compressed representation of data where similar data points are close to each other within the latent space. The latent space is useful for learning data features and finding simpler representations of the data for analysis. (Quantized latent representation T, y^, hyperplier side information)
number
number
number
[0162] In Figure 3, component AE105 is an arithmetic coding module, which converts samples of the quantized latent representation y^ and side information z^ into a binary representation bitstream 1. The samples of y^ and z^ may consist of integers or floating-point numbers, for example. One purpose of the arithmetic coding module is to convert the sample values (via a binarization process) into a binary digit string (which is then included in a bitstream that may contain further parts or further side information corresponding to the coded image).
[0163] Arithmetic decoding (AD) 106 is the process of returning binary code, converting a binary number back to a sampled value. Arithmetic decoding is provided by the arithmetic decoding module 106.
[0164] Please note that this disclosure is not limited to this particular framework. Furthermore, this disclosure is not limited to image or video compression and can similarly be applied to object detection, image generation, and recognition systems.
[0165] Figure 3 shows two interconnected subnetworks. In this context, a subnetwork is a logical division between parts of an entire network. For example, in Figure 3, modules 101, 102, 104, 105, and 106 are called the "encoder / decoder" subnetwork. The "encoder / decoder" subnetwork is responsible for encoding (generating) and decoding (analyzing) the first bitstream, "bitstream 1". The second network in Figure 3, which includes modules 103, 108, 109, 110, and 107, is called the "hyperencoder / decoder" subnetwork. The second subnetwork is responsible for generating the second bitstream, "bitstream 2".
[0166] The first subnetwork is responsible for the following: • Convert the input image x to its latent representation y (which is easier to compress) 101. • Quantizing the latent representation y into the quantized latent representation y^ 102, • Compress the quantized latent representation y^ using AE with the arithmetic coding module 105 to obtain the bitstream "bitstream 1". • Analyze bitstream 1 via AD using arithmetic decoding module 106. • Reconstruct the reconstructed image (x^) using the analyzed data.
[0167] The purpose of the second subnetwork is to obtain the statistical properties of the samples in "bitstream 1" (e.g., mean, variance, and correlation between samples in bitstream 1) so that the compression of bitstream 1 by the first subnetwork becomes more efficient. The second subnetwork generates a second bitstream, "bitstream 2," which contains the aforementioned information (e.g., mean, variance, and correlation between samples in bitstream 1).
[0168] The second network includes converting the quantized latent representation y^ to side information z 103, quantizing the side information z to quantized side information z^, and encoding (e.g., binarizing) the quantized side information z^ to bitstream 2 109. In this example, binarization is performed by arithmetic coding (AE). The decoding unit of the second network includes arithmetic decoding (AD) 110, which converts the input bitstream 2 to the decoded quantized side information
number
number
[0169] Figure 3 shows an example of a VAE (variational autoencoder), and its details may differ in different implementations.
[0170] Most deep learning (DL)-based image / video compression systems reduce the dimensionality of a signal before converting it to binary digits (bits). For example, in the VAE framework, the encoder, which is a nonlinear transformation, maps the input image x to y, where y has smaller width and height than x. Since y has smaller width and height and is therefore smaller in size, the dimensionality (size) of the signal is reduced, and thus the signal y is easier to compress. It should be noted that, in general, an encoder does not necessarily have to reduce the size of both (or generally all) dimensions. Rather, some exemplary implementations may provide an encoder that reduces the size in only one (or generally a subset thereof) dimension.
[0171] An example of such a VAE framework is shown in Figure 4, which utilizes six downsampling layers marked 401 through 406. The network architecture includes a hyperplier model. (g) a ,g s ) shows the image autoencoder architecture, and on the right (h a ,h s ) corresponds to an autoencoder that implements hyperprior. The factorized-prior model is an analysis and synthesis transformation g a and g s The same architecture is used. Q represents quantization, and AE and AD represent the arithmetic encoder and arithmetic decoder, respectively. The encoder takes the input image x as g a This is applied to generate a response y (latent representation) with a spatially varying standard deviation. Encoding g a This includes multiple convolutional layers with subsampling and generalized divisive normalization (GDN) as the activation function.
[0172] The response is h aIt is supplied, and the distribution of the standard deviation of z is summarized. z is then quantized, compressed, and transmitted as side information. The encoder then uses the quantized vector z^ to
number
[0173] Layers that include downsampling are indicated by a downward arrow in the layer description. The layer description "Conv Nx5x5 / 2↓" means that the layer is a convolutional layer with N channels and a convolutional kernel size of 5x5. As mentioned above, 2↓ means that downsampling by a factor of 2 is performed in this layer. As a result of 2x downsampling, one of the dimensions of the input signal is reduced by half in the output. In Figure 4, 2↓ indicates that both the width and height of the input image are reduced by a factor of 2. Since there are six downsampling layers, if the width and height of the input image 4¹⁴ (indicated by x) are given by w and h, the output signal z^4¹³ will have a width and height equal to w / 64 and h / 64, respectively. The modules indicated by AE and AD are the arithmetic encoder and arithmetic decoder, which are described with reference to Figure 3. The arithmetic encoder and arithmetic decoder are concrete implementations of entropy coding. AE and AD can be replaced by other means of entropy coding. In information theory, entropy coding is a lossless data compression scheme, a reversible process used to convert the values of symbols into binary representations. Furthermore, "Q" in the diagram corresponds to a quantization operation, and the quantization operation and corresponding quantization unit as part of component 413 or 415 are not necessarily required and / or can be replaced by another unit.
[0174] Cloud solutions for machine tasks Video coding for machines (VCM) is another popular direction in computer science today. The main idea behind this approach is to transmit coded representations of image or video information that are targeted for further processing by computer vision (CV) algorithms such as object segmentation, detection, and recognition. In contrast to traditional image and video coding that targets human perception, the quality characteristic is not reconstruction quality, but rather performance on computer vision tasks such as object detection accuracy. This is illustrated in Figure 5.
[0175] Video coding for machines, also known as collaborative intelligence, is a relatively new paradigm for the efficient deployment of deep neural networks across mobile cloud infrastructure. By dividing the network between mobile and cloud, it is possible to distribute computational workloads so that the overall energy and / or latency of the system is minimized. In general, collaborative intelligence is a paradigm in which the processing of a neural network is distributed among two or more different computing nodes, e.g., between devices, but generally among any functionally defined nodes. Here, the term “node” does not mean the neural network nodes described above. Rather, a (computational) node here refers to a separate device / module (physically or at least logically) that implements a part of the neural network. Such devices may be different servers, different end-user devices, different intelligent vehicles or in-vehicle devices, a mix of servers and / or user devices and / or clouds and / or processors, etc. In other words, computing nodes may be considered nodes that belong to the same neural network and communicate with each other to transmit coded data within / for the neural network. For example, to enable the execution of complex calculations, one or more layers may be run on a first device and one or more layers on another device. However, the distribution may be finer, and a single layer may be run on multiple devices. In this disclosure, the term “multiple” refers to two or more. In some existing solutions, part of the neural network functionality is run on a device (such as a user device or edge device) or multiple such devices, and the output (feature map) is then passed to the cloud. The cloud is a collection of processing or computing systems outside the devices running part of the neural network. The concept of collaborative intelligence has also been extended to model training. In this case, data flows bidirectionally, i.e., from the cloud to the mobile during backpropagation in training, and from the mobile to the cloud during the forward path in training, and then to inference.
[0176] Several studies have proposed semantic image compression by encoding deep features and reconstructing the input image from them. Compression based on uniform quantization has been demonstrated, as has context-based adaptive arithmetic coding (CABAC) from H.264. In some scenarios, it may be more efficient to send the output of the hidden layer (deep feature map) from the mobile part to the cloud than to send the compressed natural image data to the cloud and perform object detection using the reconstructed image. Efficient compression of feature maps is beneficial for image and video compression and reconstruction for both human perception and machine vision. Entropy coding methods, such as arithmetic coding, are common approaches to compressing deep features (i.e., feature maps).
[0177] Currently, video content accounts for over 80% of internet traffic, and this percentage is expected to increase further. Therefore, it is crucial to build efficient video compression systems that can produce higher-quality frames within a given bandwidth budget. In addition, most video-related computer vision tasks, such as video object detection or video object tracking, are sensitive to the quality of compressed video, and efficient video compression can benefit other computer vision tasks. Furthermore, video compression techniques are also useful for motion recognition and model compression.
[0178] End-to-end image or video compression DNN-based image compression methods can utilize large-scale end-to-end training and advanced nonlinear transformations not available in conventional approaches. However, directly applying these techniques to build an end-to-end learning system for video compression is not straightforward. Firstly, learning how to generate and compress motion information tailored to video compression remains an unresolved problem. Video compression methods rely heavily on motion information to reduce temporal redundancy in video sequences.
[0179] A simple solution is to represent motion information using learning-based optical flow. However, current learning-based optical flow approaches aim to generate the most accurate flow field possible. Accurate optical flow is often not optimal for specific video tasks. In addition, the amount of data for optical flow is significantly larger compared to motion information in conventional compression systems, and directly applying existing compression approaches to compress optical flow values significantly increases the number of bits required to store motion information. Secondly, it is not clear how to build a DNN-based video compression system by minimizing rate distortion-based goals for both residual and motion information. Rate distortion optimization (RDO) aims to achieve higher quality (i.e., less distortion) of reconstructed frames given the number of bits (or bitrate) for compression. RDO is critical to video compression performance. To leverage the power of end-to-end training for learning-based compression systems, an RDO strategy that optimizes the entire system is required.
[0180] In "DVC: An End-to-end Deep Video Compression Framework" by Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao, Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 11006-11015, the authors proposed an end-to-end deep video compression (DVC) model that collaboratively learns motion estimation, motion compression, and residual coding.
[0181] Such an encoder is shown in FIG. 6A. Specifically, FIG. 6A shows the overall structure of an end-to-end trainable video compression framework. To compress motion information, a CNN is designated to convert optical flow into a corresponding representation better suited for compression. Specifically, an autoencoder-style network is used to compress the optical flow. The motion vector (MV) compression network is shown in FIG. 6B. The network architecture is somewhat similar to ga / gs in FIG. 4. Specifically, a series of convolutional operations and non-linear transformations, including GDN and IGDN, are involved. The number of output channels for convolution (deconvolution) is 128 except for the last deconvolution layer which is equal to 2. Given an optical flow of size M×N×2, the MV encoder generates a motion representation of size M / 16×N / 16×128. Next, the motion representation is quantized, entropy-coded, and sent to the bitstream. The MV decoder receives the quantized representation and reconstructs the motion information using the MV encoder.
[0182] FIG. 6C shows the structure of the motion compensation unit. Here, using the previous previous reconstructed frame x t-1 and the reconstructed motion information, the warping unit generates a warped frame (usually using an interpolation filter such as a bilinear interpolation filter). Then, a separate CNN with three inputs generates the predicted picture. The architecture of the motion compensation CNN is also shown in FIG. 6C.
[0183] The residual information between the original frame and the predicted frame is encoded by the residual encoder network. A highly non-linear neural network is used to convert the residual into a corresponding latent representation. Compared with the discrete cosine transform in conventional video compression systems, this approach can better utilize the power of non-linear transformation and achieve higher compression efficiency.
[0184] From the above overview, it can be seen that CNN-based architectures can be applied to both image and video compression, taking into account different parts of the video framework, including motion estimation, motion compensation, and residual coding. Entropy coding is a common method used for data compression, which is widely adopted in the industry and can also be applied to feature map compression for either human perception or computer vision tasks.
[0185] In lossless video coding, the reconstructed video picture has the same quality as the original video picture (assuming no transmission errors or other data loss during storage or transmission). In lossy video coding, further compression is performed, for example by quantization, to reduce the amount of data representing the video picture, but the video picture cannot be fully reconstructed by the decoder; that is, the quality of the reconstructed video picture is lower or worse than the quality of the original video picture.
[0186] arithmetic encoding Entropy coding is typically used as reversible coding. Arithmetic coding is a type of entropy coding that encodes a message as a binary real number within an interval (range) representing the message. Here, the term message refers to a sequence of symbols. Symbols are selected from a predefined alphabet of symbols. For example, the alphabet may consist of two values, 0 and 1. A message using such an alphabet is a sequence of bits. The symbols (0 and 1) may appear with different frequencies within the message. In other words, the symbol probabilities may be non-uniform. In fact, the more non-uniform the distribution, the higher the compression achievable by general entropy codes, especially arithmetic codes. Arithmetic coding utilizes a priori known probability models that specify the symbol probability of each symbol in the alphabet. The alphabet does not have to be binary. Rather, the alphabet may consist of M values, for example, from 0 to M-1. In general, any alphabet of any size may be used. Typically, the alphabet is given by a range of values in the coded data.
[0187] A practically improved variation of the arithmetic coda is called a range coda, which does not use the interval [0,1) but instead uses a finite range of integers, for example, from 0 to 255. This range is divided according to the probability of each alphabet symbol. If the remaining range is too small, the range may be renormalized to describe all alphabet symbols according to their probabilities.
[0188] One of the main types of entropy encoders assigns a unique code to each unique symbol that occurs in the input. These entropy encoders compress data by replacing each fixed-length input symbol with a corresponding variable-length output codeword. For data streams with certain entropy characteristics, simple static codes may be useful. These static codes include universal codes (such as Elias gamma coding or Fibonacci coding) and Golomb codes (such as unary coding or Rice coding). For general data streams, codes can be constructed based on the following rule: the length of each codeword is approximately proportional to the negative logarithm of the probability of that codeword occurring. Therefore, the most common symbols use the shortest code. Based on the constructed code table, the coder compresses data by replacing each fixed-length input symbol with a corresponding variable-length output codeword without a prefix. An example of such coding is Huffman coding. The main problem with such coding is that at least one bit is required for each input symbol, even if its probability is close to 1. To speed up arithmetic coders, we invented the Asymmetric Numerical Systems (ANS) family of entropy coding techniques. Such coders offer a combination of the compression ratio of arithmetic coding and the processing cost similar to Huffman coding.
[0189] An entropy encoder encodes symbols of the input alphabet A, of size M, into symbols of the alphabet B, of size R, by using a quantity of output symbols that is inversely proportional to the probability of the coded symbols. Typically, symbols a from alphabet A... i The probability p i This is the symbol a in any sequence of symbols starting with the alphabet A. i This represents the probability of occurrence. In other words, probability p i The received symbol y is a iThis represents the probability of an event being equal to . The unequal, heterogeneous probabilities of different symbols from the alphabet give the possibility of compression. All symbols of the alphabet have the same probability pi = 1 / M, where M is the size of the alphabet A, and compression is impossible.
[0190] A general scheme for an entropy coder is shown in Figure 7. In most cases, the output alphabet is {0,1}, the size of the output alphabet is usually equal to 2, the symbols of the output binary alphabet are called bits, and the sequence of bits corresponding to the sequence of coded symbols from the input alphabet is called a bitstream. As seen in Figure 7, the output symbols of the entropy encoder, called the bitstream, are the inputs of the entropy decoder. The output of the entropy decoder is the same alphabet as the inputs of the entropy encoder. The output of the entropy decoder can also be called the decoded symbols. In addition, the alphabet means a set of symbols, and the alphabet size means the number of symbols in the input alphabet. Here, the input symbols are shown as 0,1,2,...,M-1, so the input alphabet has a total of M different symbols. Note that in this application, the alphabet size M always means the size of the input alphabet.
[0191] In autoencoder-based coding schemes, an entropy coder is used to compress latent space symbols. Distribution estimation can be performed in advance (using a pre-trained histogram) or using some additional information from the bitstream and / or information from adjacent latents. A general scheme of entropy coding used in an autoencoder-based coder is shown in Figure 8: an input signal x is transformed into a feature (or latent) tensor y, where "x" here means the input signal corresponding to the image data and "y" here means the latent space tensor, the transformation process here can be called feature extraction, the latent space tensor contains latent space elements, which are quantized and placed as input to the entropy encoder. In some possible implementations, the latent space elements are processed by a gain unit as shown in Figure 9, then quantized and then placed as input to the entropy encoder. The tensor y contains real numbers (not integers) (e.g., floating-point values), and the range of these numbers is unknown in advance. Since the entropy coder can only operate with a finite input alphabet, the tensor y is transformed into an integer tensor y^ containing values from 0 to M-1, where M is the alphabet size of the entropy coder. Such a transformation from a continuous set of values to a discrete set is called quantization. The transformation can include clamping, rounding, and scaling operations. In one exemplary implementation, first, y is ranged
number
[0192] As shown in Figure 9, a gain unit can be added to the coding scheme to control the quantization error and bitstream size (bitrate). Note that a higher bitrate allows for better quality compression of the data, and a lower bitrate allows for lower quality compression. The process of adjusting the compression system parameters to achieve a desired ratio between bitrate and quality (or to achieve a desired bitrate) is called rate control. The parameters used during rate control are called rate control parameters β. The gain unit is used for rate control. The gain unit is calculated by multiplying the gain vector g by the latent space tensor y:
number
[0193] Figure 10 shows the rate - distortion curve for one of the auto - encoder - based coders with a gain unit. The horizontal axis represents the bit rate in bits per pixel (bpp), and the vertical axis represents the peak signal - to - noise ratio (PSNR). As can be seen here, when the bit rate (horizontal axis) is increased above 0.5 bpp, the PSNR of the reconstructed signal (vertical axis) does not increase but rather decreases. This is due to a very large y g clipping error occurring.
[0194] One possible solution is to select a very large alphabet size and use it in all cases. However, increasing the alphabet size imposes a penalty on compression efficiency under some conditions, such as that a large alphabet size is not required at low bit rates. Using a large alphabet size can significantly increase the bit rate, but the reconstructed quality is not improved.
[0195] In conventional methods, entropy coding parameters are usually predefined; for example, the alphabet size M is typically predefined by selecting it once based on the expected tensor range (or latent tensor range) and using the predefined alphabet size M for all cases. In such cases, if the actual tensor range is wider than the expected tensor range, the input alphabet size determined based on the expected tensor range becomes inappropriate, and clipping is required for the coded tensor values. Such clipping degrades the signal, especially when the coded tensor range differs significantly from the alphabet size. In this case, the corruption of the coded tensor is a nonlinear distortion that causes unpredictable errors in the reconstructed signal, and therefore the quality of the reconstructed signal can be greatly reduced. In one implementation, a very large alphabet size can be selected and used in all cases, but increasing the alphabet size penalizes compression efficiency under low bitrate conditions, and while using a large alphabet size can significantly increase the bitrate, the reconstruction quality does not improve.
[0196] To solve the above problems, embodiments of this application propose content / bitrate adaptive entropy coding parameter selection, in particular, the entropy coding parameter can be the input alphabet size, and thus clipping effects at high rates can be avoided without the rate overhead caused by an unduly large alphabet size at low rates. The adaptability of the entropy coding parameter, especially the alphabet size, enables optimal operation of the entropy encoder at low rates (narrow range of coded values), resulting in bitrate savings, and at high rates (wide range of coded values), there is no clipping effect, resulting in high reconstructed signal quality.
[0197] The basic idea of this solution is the entropy coding parameter, especially the bit rate / content adaptability of the alphabet size. For proper operation of entropy coding, all parameters should be matched between the encoder and the decoder, and thus, basically two problems need to be solved: 1. How to select an appropriate alphabet size on the encoder side? 2. How to derive the alphabet size (selected on the encoder side) on the decoder side?
[0198] For the alphabet selection on the encoder side, several possible solutions are proposed.
[0199] In one possible implementation, the alphabet size can be selected as the smallest possible number higher than the range of the coded values. For example, the minimum and maximum values of the tensor y are first obtained, and the alphabet size is then selected as follows: M = ceil(max{y} - min{y}) Here, ceil(x) is the smallest integer greater than x. In most entropy encoders, the alphabet size should be a power of 2, and in this case, the alphabet size can be selected as M = 2^(ceil(log2(max{y} - min{y}))). Here, {y} means the latent space elements in the latent space, and the latent space elements are the result of the progression of the input signal. The process of converting the input signal into the latent space tensor y is sometimes called feature extraction. Generally, an input signal such as an input image is converted into a latent space (feature space), the latent space elements are quantized, and then encoded by an entropy encoder. Also, the latent space can be additionally processed before quantization (for example, multiplied by a gain vector). Note that there are some cases, for example, when the modulus of all y values is less than 1, an additional scaling operation can be performed before entropy coding.
[0200] In another possible embodiment, the alphabet size can be selected based on a rate-distortion optimization process. First, several values of M are tried around M0 = ceil(max{y} - min{y}), and then the loss function is calculated for all these values. The alphabet size M_i that minimizes the loss function is selected. The loss function may include rate and distortion components, such as PSNR, multiscale structural similarity index (MS-SSIM), video multimethod evaluation fusion (VMAF), or other quality metrics. In this approach, clipping may occur, but the bitrate savings from using a smaller alphabet compensate for the slight increase in distortion. For example, the loss function can be Loss = Beta * Distortion + Bits, where distortion is measured by PSNR or MS-SSIM or VMAF, bits is the number of bits spent, and beta is a weighting parameter that controls the ratio of bitrate to reconstruction quality, and beta may also be called the rate control parameter.
[0201] Several possible solutions are proposed for deriving the alphabet size on the decoder side.
[0202] Embodiment 1 In one possible implementation, the alphabet size can be explicitly signaled in the bitstream. In one embodiment, the alphabet size can be directly signaled in the bitstream using, for example, fixed-length coding, exp-Golomb coding, or some other coding algorithm. Typical values of M can be 256, 512, or 1024, and for example, when signaling 1024 with fixed-length coding, 11 bits are required (1024 10When signaling log2(1024)-9=1 (=100000000002), only one bit is needed if only the values 512 and 1024 are allowed, or two bits are needed if four different alphabet sizes such as 512, 1024, 2048, and 4096 are allowed. As a result, direct signaling of M consumes more bits. However, in some exotic cases (e.g., when the alphabet size M is not a power of 2), direct signaling of M may be useful.
[0203] In an alternative embodiment, instead of M itself, the output p of some reversible function f(M) can be signaled in the bitstream, and this output p can be called the first instruction information. Such p can be signaled using fixed-length coding, exp-Golomb coding, or some other coding algorithm. Thus, on the decoder side, M is derived based on the first instruction information, specifically, M = f -1 This is derived as (p). Examples of such invertible functions f(M) are as follows: 1. f(M) = log k (M), where k is a natural number, for example k may be equal to 2; 2. f(M) = log k (M)-C, where C is an integer used as a predictor, for example C may be equal to 9; 3. f(M) = M + R, where R is an integer used as the predictor; 4. f(M) = sqrt(M)
[0204] A preferred method is to signal p = f(M) = log2(M) - 9.
[0205] In some implementations, p is non-negative, but in others, p can be negative. For example, the value of p can be within the range [0,5], and 3 bits are used for signaling. The function f(M) is pre-negotiated between the encoder and decoder.
[0206] In one possible embodiment, the size of the alphabet is signaled, for example, in the parameter set section of the bitstream, for example, in the picture parameter set section of the bitstream.
[0207] The advantage of Embodiment 1 described above is that it can signal any optimal alphabet size selected on the encoder side, thus improving the flexibility of signaling the alphabet size. The only drawback is that the bitstream size increases slightly because a few bits are used for signaling.
[0208] Embodiment 2 In one possible embodiment, the alphabet size can be derived based on several other parameters. In one exemplary implementation, the alphabet size is derived from quantization parameters or rate control parameters, but alternatively, the alphabet size can also be derived from image resolution, video resolution, frame rate, pixel density in a 3D object, etc. In a trainable codec, the alphabet size can be derived from several parameters of the loss function used during training, such as rate / distortion weighting coefficients, or several parameters that affect the selection of the gain vector g. Alternatively, it can be a quantization parameter, such as the quantization parameter (qp) in common codecs like JPEG, HEVC, and VVC. For example, the loss function can be Loss = Beta * Distortion + Bitrate, where Beta is the weighting coefficient.
[0209] In one exemplary implementation, if such a rate control parameter is β, the range of beta(β) is divided into K intervals (K subranges) as follows: [β_0,β_1),[β_1,β_2),..,[β_(K-1),β_K)
[0210] Each interval / subrange corresponds to one alphabet-sized value Mi. Note that β_0 can be equal to -∞ and β_K can be equal to +∞. Note that there is an acceptable range for a particular codec value β; for example, some codec β may be allowed to be within the range [-∞,∞], while others may only be allowed to be within the range [0,∞]. In any case, there is some large range of acceptable β (beta) values. In the context of this embodiment, the original large range of acceptable β values is divided into several subranges, and for each subrange, there is a specific value of alphabet size. One specific division of β values for intervals is shown in Figure 11.
[0211] In this case, after obtaining the parameter β, the decoder can select a target interval based on the β value obtained from the bitstream. Specifically, the decoder determines that β_i ≤ β ≤ β_(i+1), then selects the interval [β_i, β_(i+1)] as the target interval, and derives the corresponding alphabet size Mi as the input alphabet size M on the decoder side.
[0212] In some embodiments, each βi within the range of β{βi} can correspond to one alphabet size Mi, and the alphabet size M corresponding to a particular β is calculated based on one or more Mi corresponding to βi adjacent to β. The value used to calculate M can be the nearest just value Mi corresponding to the target interval, and can be a linear interpolation, bilinear interpolation, or some other interpolation from two or more Mi corresponding to βi adjacent to β, or another interpolation from two or more Mi corresponding to intervals adjacent to the target interval.
[0213] The advantage of Embodiment 2 is that quantization parameters or rate control parameters already present in the bitstream used for other procedures can be used on the decoder side to derive the alphabet size M, thus eliminating the need for additional signaling for information specifically used to indicate the alphabet size M, and thus saving bitrate. The disadvantage of Embodiment 2 is its lack of flexibility, meaning that if the derived alphabet size is not optimal for any reason, the encoder and decoder must use it despite the lower compression efficiency.
[0214] Embodiment 3 In one possible implementation, the alphabet size can be derived based on a predictor P and second instruction information, which is signaled within the bitstream and used to indicate the difference between P and M. The predictor P can be derived by the decoder based on one of the techniques described in Embodiment 2 above, such as quantization parameters, rate control parameters, parameters of the loss function used during training of the trainable codec, or several parameters that affect the selection of the gain vector g. The parameters used to derive the predictor P can be selected by the encoder or predefined by a standard. Thus, upon receiving a bitstream, the decoder can derive a predictor P based on predetermined parameters, analyze second instruction information from the bitstream, and then derive the alphabet size M based on the predictor P and the second instruction information.
[0215] In one embodiment, the difference between P and M can be directly signaled within the bitstream using, for example, fixed-length coding, exp-Golomb coding, or some other coding algorithm. In an alternative embodiment, the output D of some invertible function s(M, P) can be signaled within the bitstream. Such D can be signaled with fixed-length coding, exp-Golomb coding, or some other coding algorithm. In this case, M is derived on the decoder side as M = s -1 (d, P). Examples of such invertible functions s(M, P) can be as follows: 1. s(M, P) = log k (M) - log k (P), where k is a natural number, for example, k may be equal to 2, 2. s(M, P) = log k (P) - log k (M), where k is a natural number, for example, k may be equal to 2, 3. s(M, P) = log k (M) - log k (P) - C, where C is an integer, 4. s(M, P) = log k (P) - log k (M) - C, where C is an integer, 5. s(M, P) = a * log k (P) + b * log k (M) - c, where a, b, and c are constants, 6. s(M, P) = a * M + b * P + c, where a, b, and c are constants.
[0216] Note here that A * B means A times B or A multiplies B.
[0217] A preferred method is to signal D = s(M, P) = log2(P) - log2(M).
[0218] Correspondingly, in this case, M is one of the following, namely
number
[0219] Since only the difference between P and M is signaled within the bitstream, the additional bits consumed are reduced compared to M being signaled within the bitstream. In addition, the difference between P and M can be selected based on content or bitrate, improving the flexibility of signaling the alphabet size. Thus, Embodiment 3 combines the advantages of Embodiments 1 and 2 to provide flexibility in alphabet size selection while minimizing the additional bits consumed for signaling. Even in some rare cases where the alphabet size predicted from β does not work well, the encoder can still signal the difference between M and P. This costs a few bits, but can solve serious problems related to clipping effects.
[0220] In one possible implementation, a flag could be introduced into the bitstream to indicate a switch between Embodiment 1, Embodiment 2, and Embodiment 3, in which case this flag might require 2 bits. In another possible embodiment, a flag could be used to indicate a switch between Embodiment 1 and Embodiment 2, in which case only 1 bit would be required. Such solutions offer a balance between bit saving and flexibility, where in most cases only 1 bit is used for indication if the derived entropy parameter is appropriate, on the other hand, in some specific cases the entropy parameter may be explicitly signaled.
[0221] One possible implementation specifies that when the flag is equal to a first value, Embodiment 1 is used, and the entropy coding parameter or the result of the transformation of the entropy coding parameter is carried in the bitstream. When the flag is equal to a second value, Embodiment 2 is used, and the entropy coding parameter is not carried in the bitstream, but the entropy coding parameter can be derived by the decoder. When the flag is equal to a third value, Embodiment 3 is used, and the difference between M and P, or the result of the transformation of the difference between M and P, is carried in the bitstream, where M is the size of the input alphabet and P is a predictor that can be derived by the decoder.
[0222] In addition to the embodiments described above, alternative signaling schemas can also be considered. Furthermore, the alphabet size M can be derived from a given value by using an interpolation or extrapolation process. For example, p is signaled using one of the following codes: a binary code, or a unary code, or a truncated unary code, or an exp-Golomb code. In one possible embodiment, p is signaled using an exp-Golomb code of degree 0.
[0223] The above embodiments can be applied to different entropy coders, such as arithmetic coders, range coders, or ANS (Asymmetric Numerical Systems) coders.
[0224] In some possible implementations, more parameters of entropy coding can be adaptively selected based on at least the content or bitrate. For example, the parameters of entropy coding may include: the minimum symbol probability supported by the entropy coder; the probability precision supported by the entropy coder; or the renormalization period of the entropy encoder. In some embodiments, the renormalization period can be 8 bits, 16 bits, etc. Note that here, “entropy coder” can be used as a synonym for “entropy coding algorithm,” which includes both the coding algorithm and the decoding algorithm. The entropy encoder is a module that is part of the encoder, and the entropy decoder is another module that is part of the decoder. The parameters of the entropy encoder and entropy decoder should be synchronized for correct operation, and therefore the terms “entropy coder parameters” or “entropy coding parameters” mean the parameters of both the entropy encoder and the entropy decoder. In other words, “entropy coding parameters” can be considered equivalent to “entropy encoder and entropy decoder parameters.” An entropy encoder encodes an alphabetic symbol into one or more bits in a bitstream, and an entropy decoder decodes one or more bits in the bitstream back into an alphabetic symbol. On the entropy encoder side, the alphabet represents the input alphabet, and on the entropy decoder side, the alphabet represents the output alphabet. The size of the input alphabet on the entropy encoder side is equal to the size of the output alphabet on the entropy decoder side.
[0225] Figure 12 is a flowchart illustrating an exemplary decoding method implemented by a decoding device, which includes the following:
[0226] 1201: Receive a bitstream containing encoded data of the input signal and the first parameter.
[0227] 1202: Analyze the bitstream and obtain the first parameter.
[0228] In one possible embodiment, the input signal is video data, image data, point group The encoded data includes data, motion flow, or motion vectors, or any other type of media data, where the encoded data represents the encoded result of the input signal, the encoded data consists of multiple bits, and the entropy coding parameters include: the size of the alphabet of the entropy coder, where the size of the alphabet is the size of the input alphabet of the entropy encoder or the size of the output alphabet of the entropy decoder; or the minimum symbol probability supported by the entropy coder; or the renormalization period of the entropy coder. In some embodiments, the renormalization period can be 8 bits, 16 bits, etc.
[0229] 1203: Obtain the entropy coding parameter based on the first parameter.
[0230] 1204: Reconstruct at least a portion of the input signal based on the entropy coding parameters and encoded data.
[0231] In embodiments of this application, the decoder can obtain entropy coding parameters (particularly alphabet size) based on parameters carried in the bitstream, and since the parameters carried in the bitstream can be changed, the encoder can adaptively adjust the entropy coding parameters by changing the parameters carried in the bitstream. Thus, clipping effects can be avoided under high bitrate conditions, and rate overhead caused by an unreasonably large alphabet size can also be avoided under low bitrate conditions. In other words, due to the adaptability of the entropy coding parameters, particularly the alphabet size, optimal operation of the entropy encoder is possible at low bitrates (corresponding to a narrow range of coding values), resulting in bitrate savings, and clipping effects are eliminated at high bitrates (corresponding to a wide range of coding values), resulting in higher reconstructed signal quality.
[0232] In one possible embodiment, the step of reconstructing at least a portion of the input signal based on an entropy coding parameter is: The method includes the steps of obtaining at least one probabilistic model, wherein the probabilistic model of the output symbol is used to indicate the probability of each possible value of the output symbol; obtaining one or more output symbols by entropy decoding one or more bits in a bitstream using at least one probabilistic model and an entropy coding parameter; and reconstructing at least a portion of the input signal based on the one or more output symbols.
[0233] In one possible embodiment, the method further includes the step of updating the probability model. For example, the probability model is updated after each output symbol so that each output symbol has its own probability distribution of possible values. Note that the probability model is also called a probability distribution.
[0234] In one possible embodiment, the probability model is selected according to the entropy coding parameter. For example, the symbol probability is distributed according to a normal distribution N(μ,σ), where N(μ,σ) has a mean equal to μ and a variance equal to σ. 2 This means a Gaussian distribution equal to . However, actual probability models (meaning mathematical or theoretical models as well) such as quantized histograms depend on the alphabet size and probability precision within the entropy coding engine or entropy coder. Probability precision can be the minimum probability supported by the entropy coding engine. In other words, the entropy coding parameters can affect the histogram construction within the entropy coder. Basically, the alphabet size is the number of possible symbol values, so if the alphabet size is equal to 4, for example, larger values, such as the value "7", cannot be coded / decoded.
[0235] The histogram used by the entropy coder consists of the quantized probabilities of each symbol value. For example, the alphabet is {0, 1, 2, 3}, and the corresponding probabilities are {7 / 16, 7 / 16, 1 / 16, 1 / 16}, where each probability is not 0, the sum of the probabilities is equal to 1, and each probability is greater than the minimum probability (probability precision) supported by the entropy coding engine (1 / 16 in this example). If the probabilities of some symbols are lower than the minimum probability supported by the entropy coding engine, then the probabilities of at least some symbols must be adjusted to ensure that the probability of each symbol is greater than the minimum probability supported by the entropy coding engine. Since the probability should be 1 / 16 or greater, if the probability of two symbol values is equal to 7 / 16, for example, {7 / 16, 7 / 16, 1 / 16, 1 / 16, 0 / 16, 0 / 16, 0 / 16, 0 / 16}, then the alphabet size is 8: {0, 1, 2, 3, 4, 5, 6, 7}. Therefore, the probabilities of symbols "0" and "1" must be reduced from 7 / 16 to 5 / 16 in this model, and the probabilities of each symbol must be adjusted, for example, to {5 / 16, 5 / 16, 1 / 16, 1 / 16, 1 / 16, 1 / 16, 1 / 16, 1 / 16}. Essentially, this is one reason why entropy encoders with larger alphabets are less efficient. If there are many different possible symbol values, each of them should then have a probability greater than or equal to the minimum probability supported by the entropy encoder. Therefore, even if the probability of a single symbol is enormous, like 0.99999, in a quantized histogram, it simply becomes 1-(M-1)*p min And here, M is the size of the alphabet, and p min is the minimum probability supported by the entropy encoder. Therefore, the maximum probability in the model depends on the size of the alphabet and the minimum probability supported by the entropy encoder, i.e., :p max =1-(M-1)*p min . p minSince it is tied to computational precision, it cannot actually be made very small. Therefore, for example, p min = 1 / 256, and the maximum possible probability is when the alphabet size M is equal to 128.
number
[0236] In one possible embodiment, the first parameter is the size of the alphabet, where the step of obtaining an entropy coding parameter based on the first parameter includes using the size of the alphabet as the first parameter.
[0237] In another possible embodiment, the first parameter is the output p of some invertible function f(M) instead of M itself, for example, the first parameter is p = f(M). In this case, the entropy coding parameter is M = f -1 (p) is obtained, and here, f -1 (p) is the inverse function of f(M).
[0238] In one possible embodiment, f(M) can be: f(M) = log k (M), where k is a natural number, or f(M) = log k (M)-C, where k is a natural number and C is an integer, or f(M) = M + R, where R is an integer, or f(M)=sqrt(M) It can be done this way.
[0239] In one possible embodiment, p = f(M) = log2(M) - 9.
[0240] Correspondingly, in this case, M is one of the following, namely: M = k^p, where k is a natural number, or M = k^(p + C), where k is a natural number and C is an integer, or M = k^(a*p+b), where k is a natural number and a and b are constants, or M = a*p + b, where a and b are constants, or M = p^2 It satisfies one of the following conditions.
[0241] In any one of the embodiments, A^B is A B Please note that this means...
[0242] In one possible embodiment, p = log2(M) - 9 and M = f -1 (p) = 2^(p+9), where f -1 (p) is the inverse function of f(M), and f(M) = log2(M) - 9.
[0243] In one possible embodiment, p is signaled using one of the following codes, namely: binary code, or unary code, or truncated unary code, or exp-Golomb code.
[0244] In one possible embodiment, p is signaled using an exp-Golomb code of order 0.
[0245] In one possible embodiment, the alphabet size is signaled, for example, in the parameter set section of the bitstream, for example, in the picture parameter set section of the bitstream.
[0246] In one possible embodiment, the first parameter can be several other parameters, such as a rate control parameter, image resolution, video resolution, frame rate, pixel density in a 3D object, several parameters of a loss function used during training of a trainable codec, e.g., rate / distortion weighting coefficients, or several parameters that affect the selection of the gain vector g. The loss function may include rate and distortion components, such as peak signal-to-noise ratio (PSNR), multiscale structural similarity index (MS-SSIM), video multimethod evaluation fusion (VMAF), or some other quality metric. For example, the loss function can be loss = beta * distortion + bits, where distortion is measured by PSNR or MS-SSIM or VMAF, bits is the number of bits spent, and beta is a weighting parameter that controls the ratio of bitrate to reconstruction quality, and beta may also be called the rate control parameter. It can also be a quantization parameter, such as the quantization parameter (qp) in common codecs such as JPEG, HEVC, and VVC. In this case, the entropy coding parameters can be derived on the decoder side based on the other parameters mentioned above.
[0247] The advantage of the above embodiment is that since the quantization parameters or rate control parameters are already present in the bitstream and used for other procedures, such parameters can be used by the decoder to derive the alphabet size M, eliminating the need for additional signaling for information specifically used to indicate the alphabet size M, and thus saving bitrate.
[0248] In one possible embodiment, the step of obtaining an entropy coding parameter based on a first parameter includes the steps of determining a target subrange in which the first parameter is located, wherein the acceptable range of values for the first parameter includes a plurality of subranges, the target subrange is one of the plurality of subranges, each of the plurality of subranges includes at least one value of the first parameter, and each of the plurality of subranges corresponds to one value of the entropy coding parameter, the steps of using the value of the entropy coding parameter corresponding to the target subrange as the value of the entropy coding parameter, or calculating a value of the entropy coding parameter based on one or more values of the entropy coding parameter corresponding to one or more subranges adjacent to the target subrange.
[0249] In one possible embodiment, the first parameter is D, and the entropy coding parameter includes the size of the alphabet M, where M is obtained based on P and D, and P is a predictor that can be derived by the decoder.
[0250] In one possible embodiment, the first parameter may be the difference between M and P, where M is the size of the input alphabet and P is a predictor that can be derived by the decoder using one of the techniques described in Embodiment 2 above.
[0251] The advantage of the above embodiment is that, since only the difference between P and M is signaled in the bitstream, the number of additional bits used is reduced compared to M being signaled in the bitstream. Furthermore, the difference between P and M can be selected based on content or bitrate, improving the flexibility of signaling the alphabet size. Thus, this embodiment provides flexibility in alphabet size selection while minimizing the additional bits spent on signaling. In some rare cases where the alphabet size predicted from β does not work well, the encoder can still signal the difference value between M and P. This costs a few bits, but it can solve serious problems related to clipping effects.
[0252] In one possible embodiment, the first parameter is a value obtained by processing the difference between M and P, for example, the first parameter is D = s(M,P), where s(M,P) is an invertible function, where s(M,P) can be: s(M,P)=log k (M)-log k (P), where k is a natural number, or s(M,P)=log k (P)-log k (M), where k is a natural number, or s(M,P)=log k (M)-log k (P)-C, where k is a natural number and C is an integer, or s(M,P)=log k (P)-log k (M)-C, where k is a natural number and C is an integer, or s(M,P) = a*Mb*P + c, where a, b, and c are constants.
[0253] In one possible embodiment, D = s(M, P) = log2(P) - log2(M).
[0254] In one possible embodiment, the entropy coding parameter is M=s -1 It can be obtained as (D,P), where s -1 (D,P) is the inverse function of s(M,P).
[0255] The invertible function D=s(M,P) is D=s P (M) can be considered as, and M=s -1 (D,P) M=s -1 P (D) can be considered as such, and it should be noted that P is an arbitrary fixed number, i.e., P is a constant coefficient.
[0256] In one possible embodiment, D is signaled using one of the following codes, namely: a binary code, or a unary code, or a truncated unary code, or an exp-Golomb code.
[0257] In one possible embodiment, D is signaled using an exp-Golomb code of order 0.
[0258] In one possible embodiment, P can be derived based on at least one parameter other than the first parameter carried in the bitstream.
[0259] In one possible embodiment, at least one parameter other than the first parameter includes at least one of the following: a rate control parameter, a quantization parameter (qp), image resolution, video resolution, frame rate, pixel density in a 3D object, or a rate distortion weighting coefficient.
[0260] In one possible embodiment, P is derived based on at least one parameter, including: obtaining a rate control parameter beta (β) from a bitstream; determining a target subrange in which the obtained β lies, wherein the acceptable range of values for the rate control parameter β is [β_0, β_K], the acceptable range [β_0, β_K] is divided into a plurality of subranges, the target subrange being one of the plurality of subranges, each of the plurality of subranges containing at least one value of β, and each of the plurality of subranges corresponding to one value of P; selecting a value corresponding to the target subrange as the value of P; or calculating a value of P based on one or more values corresponding to one or more subranges adjacent to the target subrange.
[0261] In one possible embodiment, the entropy coder is an arithmetic coder, a range coder, or an ANS (Asymmetric Numerical Systems) coder.
[0262] Optionally, the method is: 1205. Further includes the step of parsing the bitstream and obtaining a flag, which is used to indicate whether the entropy coding parameter is directly carried within the bitstream.
[0263] In the above embodiment, a flag can be introduced into the bitstream to indicate a switch between the three embodiments, in which case this flag may require two bits. In another possible embodiment, the flag can be used to indicate a switch between two embodiments, in which case only one bit is required.
[0264] In one possible embodiment, when the flag is equal to a first value, it is specified that the entropy coding parameter is carried in the bitstream, in which case the first parameter is either the entropy coding parameter or the result of a transformation of the entropy coding parameter; or when the flag is equal to a second value, it is specified that the entropy coding parameter is not carried in the bitstream, and the entropy coding parameter can be derived by the decoder.
[0265] Such a solution offers a balance between bit saving and flexibility, where in most cases only one bit is used for indication if the derived entropy parameter is appropriate, while in some specific cases the entropy parameter may be explicitly signaled.
[0266] In one possible embodiment, when a flag is equal to a third value, it is specified that the difference between M and P, or the result of the conversion of the difference between M and P, is carried in the bitstream, in which case the first parameter is the difference between M and P, or the result of the conversion of the difference between M and P, where M is the size of the input alphabet and P is a predictor that can be derived by the decoder.
[0267] Figure 13 is a flowchart illustrating an exemplary decoding method implemented by the decoding device, which includes the following: 1301: Receive a bitstream containing encoded data and flags of the input signal; 1302: The bitstream is parsed to obtain a flag, which is used to indicate whether the entropy coding parameters are directly carried within the bitstream; 1303: Based on the flag, retrieve the entropy coding parameter; 1304: Reconstruct at least a portion of the input signal based on the entropy coding parameters and encoded data.
[0268] In the above embodiment, a flag can be introduced into the bitstream to indicate a switch between the three embodiments, in which case this flag may require two bits. In another possible embodiment, the flag can be used to indicate a switch between two embodiments, in which case only one bit is required. Such a solution provides a balance between bit saving and flexibility, where in most cases only one bit is used for indication if the derived entropy parameter is appropriate, on the other hand, in some specific cases the entropy parameter may be explicitly signaled.
[0269] In one possible embodiment, the entropy coding parameter includes at least one of the following: the size of the alphabet of the entropy coder, where the size of the alphabet of the entropy coder is the size of the input alphabet of the entropy encoder or the size of the output alphabet of the entropy decoder; or the minimum symbol probability supported by the entropy coder; or the renormalization period of the entropy coder. In some embodiments, the renormalization period can be 8 bits, 16 bits, etc.
[0270] In one possible embodiment, the input signal is video data, image data, point group This includes data, motion flow, or motion vectors, or any other type of media data.
[0271] In one possible embodiment, when the flag is equal to a first value, it specifies that the entropy coding parameter is carried in the bitstream, or that the result of the entropy coding parameter is carried in the bitstream. Note that the result of the entropy coding parameter means the result obtained by processing the entropy coding parameter, e.g., a value. When the flag is equal to a second value, it specifies that the entropy coding parameter is not carried in the bitstream, but that the entropy coding parameter can be derived by the decoder. In this case, the flag is used to indicate a switch between Embodiment 1 and Embodiment 2 above, and only one bit is required.
[0272] Such a solution offers a balance between bit saving and flexibility, where in most cases only one bit is used for indication if the derived entropy parameter is appropriate, while in some specific cases the entropy parameter may be explicitly signaled.
[0273] Optionally, a flag may be used to indicate a switch between Embodiments 1, 2, and 3 described above, in which case the flag has three possible values, two bits are required, and when the flag is equal to the third value, it specifies that the difference value between M and P is carried in the bitstream, or the result of the conversion of the difference value between M and P is carried in the bitstream, where M is the entropy coding parameter and P is a predictor that can be derived by the decoder. Note that the result of the conversion of the difference value between M and P means the result of processing the difference value between M and P.
[0274] In one possible embodiment, the step of obtaining an entropy coding parameter based on a flag includes the steps of: parsing a bitstream to obtain a first parameter when the flag is equal to a first value, the first parameter being an entropy coding parameter; using the first parameter as an entropy coding parameter; or the first parameter being a conversion result of an entropy coding parameter, and obtaining an entropy coding parameter based on the first parameter.
[0275] In one possible embodiment, the result of the entropy coding parameter transformation is p = f(M), where M is the entropy coding parameter and f(M) includes the following: f(M) = log k (M), where k is a natural number; or f(M) = log k (M)-C, where k is a natural number and C is an integer; or f(M)=M+R, where R is an integer; or f(M)=sqrt(M), where the step of obtaining the entropy coding parameter based on the first parameter is M=f -1 (p) is included, where f -1 (p) is the inverse function of f(M).
[0276] Correspondingly, M satisfies one of the following conditions, namely: M=k p Here, k is a natural number, or M=k p+C Here, k is a natural number and C is an integer, or M = ap + b, where a and b are constants, or M=p 2 , It satisfies one of the following conditions.
[0277] In one possible embodiment, k=2.
[0278] In one possible embodiment, the first parameter is p = log2(M) - 9.
[0279] In one possible embodiment, the step of obtaining an entropy coding parameter based on a flag includes, when the flag is equal to a second value, the step of analyzing the bitstream to obtain a second parameter, the second parameter including at least one of a rate control parameter, a quantization parameter (qp), image resolution, video resolution, frame rate, pixel density in a 3D object, or a rate distortion weight coefficient, and the step of deriving an entropy coding parameter based on the second parameter.
[0280] In one possible embodiment, the step of deriving an entropy coding parameter based on a second parameter includes the steps of determining a target subrange in which the second parameter lies, wherein the acceptable range of values for the second parameter includes a plurality of subranges, the target subrange is one of the plurality of subranges, each of the plurality of subranges includes at least one value of the second parameter, and each of the plurality of subranges corresponds to one value of the entropy coding parameter, the steps of using the value of the entropy coding parameter corresponding to the target subrange as the value of the entropy coding parameter, or calculating a value of the entropy coding parameter based on one or more values of the entropy coding parameter corresponding to one or more subranges adjacent to the target subrange.
[0281] In one possible embodiment, the step of obtaining an entropy coding parameter based on a flag includes the step of parsing a bitstream to obtain a third parameter when the flag is equal to a third value, the third parameter being the difference between M and P, or the result of a transformation of the difference between M and P, where M is an entropy coding parameter and P is a predictor that can be derived by a decoder; the step of deriving P based on at least one of a rate control parameter, a quantization parameter (qp), image resolution, video resolution, frame rate, pixel density in a 3D object, or a rate distortion weight coefficient; and the step of obtaining an entropy coding parameter based on the third parameter and P.
[0282] In one possible embodiment, the result of the conversion of the difference between M and P is D = s(M,P), where s(M,P) is an invertible function, where s(M,P) includes, namely: s(M,P)=log k (M)-log k (P), where k is a natural number, or s(M,P)=log k (P)-log k (M), where k is a natural number, or s(M,P)=log k (M)-log k (P)-C, where k is a natural number and C is an integer, or s(M,P)=log k (P)-log k (M)-C, where k is a natural number and C is an integer, or s(M,P)=a*log k (P) + b * log k (M)-c, where a, b, and c are constants, or s(M,P) = a*Mb*P + c, where a, b, and c are constants. The step of obtaining the entropy coding parameter based on the third parameter is M=s -1(D,P) is included, where s -1 (D,P) is the inverse function of s(M,P). Note that A*B here means A times B or A multiplies B.
[0283] In one possible embodiment, M satisfies one of the following conditions, namely,
number
[0284] Figure 14 is a flowchart illustrating an exemplary coding method implemented by an encoding device, which includes the following: 1401: The input signal and the first parameter are encoded into a bitstream, where the first parameter is used to obtain the entropy coding parameter.
[0285] In one possible embodiment, the input signal is video data, image data, point group The entropy coding parameters include data, motion flow, or motion vectors, or any other type of media data, and include at least one of the following: the size of the alphabet of the entropy coder, where the size of the alphabet is the size of the input alphabet of the entropy encoder or the size of the output alphabet of the entropy decoder; or the minimum symbol probability supported by the entropy coder; or the renormalization period of the entropy coder. In some embodiments, the renormalization period can be 8 bits, 16 bits, etc.
[0286] 1402: Send the bitstream to the decoder.
[0287] In one possible embodiment, the first parameter is the size of the alphabet.
[0288] In one possible embodiment, the first parameter is p, where p is the result of the transformation of M, and M is the entropy coding parameter.
[0289] In one possible embodiment, p = f(M), where f(M) is an invertible function.
[0290] In one possible embodiment, f(M) includes, namely: f(M) = log k (M), where k is a natural number, or f(M) = log k (M)-C, where k is a natural number and C is an integer, or f(M) = a*M + b, where a and b are constants, or f(M)=sqrt(M), Includes.
[0291] In one possible embodiment, p = log2(M) - 9.
[0292] In one possible embodiment, p is signaled using one of the following codes, namely: binary code, or unary code, or truncated unary code, or exp-Golomb code.
[0293] In one possible embodiment, the first parameter includes at least one of a rate control parameter, a quantization parameter (qp), image resolution, video resolution, frame rate, pixel density in a 3D object, or a rate distortion weight coefficient, and the first parameter is used by an entropy decoder to derive an entropy coding parameter.
[0294] In one possible embodiment, the first parameter is D, which is obtained based on P and M, where M is an entropy coding parameter and P is a predictor that can be derived by the decoder.
[0295] In one possible embodiment, D = s(M,P), where s(M,P) is an invertible function.
[0296] In one possible embodiment, s(M,P) includes, namely: s(M,P)=log k (M)-log k (P), where k is a natural number, or s(M,P)=log k (P)-log k (M), where k is a natural number, or s(M,P)=log k (M)-log k (P)-C, where k is a natural number and C is an integer, or s(M,P)=log k (P)-log k (M)-C, where k is a natural number and C is an integer, or s(M,P) = a*Mb*P + c, where a, b, and c are constants. Includes.
[0297] In one possible embodiment, D = s(M, P) = log2(P) - log2(M).
[0298] In one possible embodiment, D is signaled using one of the following codes, namely: a binary code, or a unary code, or a truncated unary code, or an exp-Golomb code.
[0299] In one possible embodiment, the encoding method is The process further includes a step of encoding a flag into the bitstream, which is used to indicate whether the entropy coding parameter is carried directly within the bitstream.
[0300] In one possible embodiment, when the flag is equal to a first value, it specifies that the entropy coding parameter is carried in the bitstream and that the first parameter is either the entropy coding parameter or the result of a transformation of the entropy coding parameter; or when the flag is equal to a second value, it specifies that the entropy coding parameter is not carried in the bitstream, but that the entropy coding parameter can be derived by the decoder.
[0301] In one possible embodiment, when the flag is equal to a third value, it is specified that the difference between M and P is carried in the bitstream, or the result of converting the difference between M and P is carried in the bitstream, where M is an entropy coding parameter and P is a predictor that can be derived by the decoder.
[0302] In one possible embodiment, several possible solutions are proposed for alphabet selection on the encoder side.
[0303] In one possible embodiment, before encoding the first parameter into a bitstream, the encoding method is The method further includes determining the size of the alphabet of the entropy encoder based on at least one of the bitrate or coding value of the image data.
[0304] Figure 15 is a flowchart illustrating an exemplary method for determining the size of the alphabet in an entropy encoder, which includes the following: 1501: Obtain the minimum and maximum values of the latent space elements; 1502: Get the size of the input alphabet as follows: M = ceil(max{y} - min{y}) Here, ceil(x) is the smallest integer greater than x, max{y} represents the maximum value of the latent space elements, min{y} represents the minimum value of the latent space elements, and M represents the size of the alphabet.
[0305] Figure 16 is a flowchart illustrating an exemplary method for determining the size of the alphabet in an entropy encoder, which includes the following: 1601: Obtain the minimum and maximum values of the latent space elements; 1602: Get the size of the input alphabet as follows: M=2^(ceil(log2(max{y}-min{y}))) Here, ceil(x) is the smallest integer greater than x, max{y} represents the maximum value of the latent space element, min{y} represents the minimum value of the latent space element, and M represents the size of the alphabet. In most entropy coders, the alphabet size should be a power of 2, in which case the alphabet size can be chosen as M=2^(ceil(log2(max{y}-min{y}))). Note that there are some cases, for example, when the module of all y values is less than 1, in which case an additional scaling operation can be performed before entropy coding.
[0306] Figure 17 is a flowchart illustrating an exemplary method for determining the size of the alphabet in an entropy encoder, which includes the following: 1701: Take at least two values around M_0, where M_0 = ceil(max{y} - min{y}) or M_0 = 2^(ceil(log2(max{y} - min{y}))); 1702: Calculate the loss function for at least two values; 1703: Select the value with the smallest loss function among at least two values as the size of the input alphabet. Here, ceil(x) is the smallest integer greater than x, max{y} represents the maximum value of the latent space element, and min{y} represents the minimum value of the latent space element.
[0307] For example, the loss function can be given as Loss = Beta * Distortion + Bits, where distortion is measured by PSNR, MS-SSIM, or VMAF, bits is the number of bits used, and beta is a weighting parameter that controls the ratio of bitrate to reconstruction quality, and beta is sometimes also called the rate control parameter. In this approach, clipping may occur, but the bitrate savings from using a smaller alphabet compensate for the slight increase in distortion.
[0308] Figure 18 is a flowchart illustrating an exemplary encoding method implemented by an encoding device, which includes the following: 1801: Encode the input signal and flag into a bitstream, where the flag is used to indicate whether the entropy coding parameter is directly carried within the bitstream; 1802: Send the bitstream to the decoder.
[0309] In one possible embodiment, the input signal is video data, image data, point group The entropy coding parameters include at least one of the following: the size of the alphabet of the entropy coder, where the size of the alphabet is the size of the input alphabet of the entropy encoder or the size of the output alphabet of the entropy decoder; or the minimum symbol probability supported by the entropy coder; or the renormalization period of the entropy coder.
[0310] In one possible embodiment, a flag is used to indicate a switch between Embodiment 1 and Embodiment 2 described above, in which case only one bit is required. When the flag is equal to a first value, it specifies that the entropy coding parameter is carried in the bitstream, or that the result of the conversion of the entropy coding parameter is carried in the bitstream; or when the flag is equal to a second value, it specifies that the entropy coding parameter is not carried in the bitstream, but that the entropy coding parameter can be derived by the decoder.
[0311] Such a solution offers a balance between bit saving and flexibility, where in most cases only one bit is used for indication if the derived entropy parameter is appropriate, while in some specific cases the entropy parameter may be explicitly signaled.
[0312] Optionally, a flag can be used to indicate a switch between Embodiments 1, 2, and 3 described above, in which case the flag has three possible values and requires 2 bits. When the flag is equal to the third value, it specifies that the difference between M and P is carried in the bitstream, or the converted result of the difference between M and P is carried in the bitstream, where M is the entropy coding parameter and P is a predictor that can be derived by the decoder.
[0313] In one possible embodiment, when the flag is equal to a first value, the first parameter is encoded into a bitstream, where the first parameter is either an entropy coding parameter or the result of a transformation of an entropy coding parameter.
[0314] In one possible embodiment, the result of the entropy coding parameter transformation is p = f(M), where M is the entropy coding parameter, and f(M) can be defined as follows: f(M) = log k(M), where k is a natural number, or f(M) = log k (M)-C, where k is a natural number and C is an integer, or f(M) = aM + R, where a and R are constants, or f(M) = sqrt(M).
[0315] In one possible embodiment, the first parameter is p = log2(M) - 9.
[0316] In one possible embodiment, p is signaled using one of the following codes, namely: binary code, or unary code, or truncated unary code, or exp-Golomb code.
[0317] In one possible embodiment, p is signaled using an exp-Golomb code of order 0.
[0318] In one possible embodiment, the method further includes the step of encoding a third parameter into a bitstream when the flag is equal to a third value, the third parameter being the difference between M and P, or the result of converting the difference between M and P.
[0319] In one possible embodiment, the result of the conversion of the difference between M and P is D = s(M,P), where s(M,P) is an invertible function, where s(M,P) includes, namely: s(M,P)=log k (M)-log k (P), where k is a natural number, or s(M,P)=log k (P)-log k (M), where k is a natural number, or s(M,P)=log k (M)-log k (P)-C, where k is a natural number and C is an integer, or s(M,P)=log k(P)-log k (M)-C, where k is a natural number and C is an integer, or s(M,P)=a*log k (P)-b*log k (M)-c, where a, b, and c are constants, or s(M,P) = a*Mb*P + c, where a, b, and c are constants. Includes, The step of obtaining the entropy coding parameter based on the third parameter is M=s -1 (D,P) is included, where s -1 (D,P) is the inverse function of s(M,P).
[0320] In one possible embodiment, D is signaled using one of the following codes, namely: a binary code, or a unary code, or a truncated unary code, or an exp-Golomb code.
[0321] In one possible embodiment, D is signaled using an exp-Golomb code of order 0.
[0322] Embodiments of this application provide a decoding device comprising: a receiver configured to receive a bitstream including encoded data of an input signal and a first parameter; an analysis unit configured to analyze the bitstream and obtain the first parameter; an acquisition unit configured to obtain an entropy coding parameter based on the first parameter; and a reconstruction unit configured to reconstruct at least a portion of the input signal based on the entropy coding parameter.
[0323] This device offers the advantages of the method described above.
[0324] In one possible embodiment, the input signal is video data, image data, point group This includes data, motion flow, or motion vectors, or any other type of media data.
[0325] In one possible embodiment, the entropy coding parameter includes at least one of the following: the size of the alphabet of the entropy coder, where the size of the alphabet of the entropy coder is the size of the input alphabet of the entropy encoder or the size of the output alphabet of the entropy decoder; or the minimum symbol probability supported by the entropy coder; or the renormalization period of the entropy coder. In some embodiments, the renormalization period can be 8 bits, 16 bits, etc.
[0326] In one possible embodiment, the first parameter is the size of an alphabet, where the acquisition unit is further configured to use the first parameter as the size of an alphabet.
[0327] In one possible embodiment, the first parameter is p, and the entropy coding parameter includes the size of the alphabet M, where M is a function of p.
[0328] In one possible embodiment, the acquisition unit sets M to M=f -1 It is further configured to be obtained as (p), where f -1 (p) is the inverse function of f(M), and f(M) = p.
[0329] In one possible embodiment, the acquisition unit is further configured to: determine a target subrange in which a first parameter lies, where the acceptable range of values for the first parameter includes a plurality of subranges, the target subrange is one of the plurality of subranges, each of the plurality of subranges includes at least one value of the first parameter, and each of the plurality of subranges corresponds to one value of the entropy coding parameter; use the value of the entropy coding parameter corresponding to the target subrange as the value of the entropy coding parameter; or calculate the value of the entropy coding parameter based on one or more values of the entropy coding parameter corresponding to one or more subranges adjacent to the target subrange.
[0330] The embodiments of this application are as described above. decrypt The present invention provides a decoding device that includes a functional unit implementing one of the methods.
[0331] Embodiments of the present application provide an encoding unit configured to encode an input signal and a first parameter into a bitstream, wherein the first parameter includes an encoding unit used to obtain an entropy coding parameter and a transmitting unit configured to transmit the bitstream to a decoder. The encoding unit further includes other functional units for implementing any one of the above encoding methods.
[0332] Embodiments of this application provide a decoding device including a processing circuit configured to perform any one of the above decoding methods.
[0333] Embodiments of this application provide an encoding device that includes a processing circuit configured to perform any one of the above encoding methods.
[0334] Embodiments of this application provide a decoder comprising one or more processors and a non-temporary computer-readable storage medium coupled to the one or more processors, wherein the storage medium stores a program for execution by the one or more processors, and the decoder is configured such that, when the program is executed by the one or more processors, it performs one of the above decoding methods.
[0335] Embodiments of this application include one or more processors and non-temporary computer-readable storage media coupled to one or more processors. encoder The storage medium provides a program for execution by one or more processors, and when the program is executed by one or more processors, it executes one of the above encoding methods. encoder It constitutes.
[0336] Embodiments of this application provide a non-temporary computer-readable medium that, when executed by a computer device or one or more processors, carries computer instructions that cause the computer device or one or more processors to execute one of the above-described encoding methods.
[0337] Embodiments of this application provide a non-temporary computer-readable medium that, when executed by a computer device or one or more processors, carries computer instructions causing the computer device or one or more processors to execute one of the above-described decoding methods.
[0338] Embodiments of this application provide a non-temporary storage medium containing a bitstream encoded by any one of the above encoding methods.
[0339] Embodiments of this application provide a computer program stored in a non-temporary medium, which includes a code instruction, and which, when executed on one or more processors, causes one of the steps of the above encoding method to be performed.
[0340] Embodiments of this application provide a computer program, which includes code instructions, stored in a non-temporary medium, and which, when executed on one or more processors, causes one of the steps of the above decoding method to be performed.
[0341] Embodiments of this application are systems for distributing bitstreams, The above Encoding method one of the following The system includes at least one storage medium configured to store at least one bitstream generated by and a video streaming device configured to retrieve the bitstream from one of the at least one storage mediums and transmit the bitstream to a terminal device, wherein the video streaming device includes a content server or a content distribution server.
[0342] In one possible embodiment, the system further includes one or more processors configured to perform an encryption process on at least one bitstream to obtain at least one encrypted bitstream; at least one storage medium configured to store the encrypted bitstream, or one or more processors configured to convert a bitstream of a first format to a bitstream of a second format; and at least one storage medium configured to store the bitstream of a second format.
[0343] In one possible embodiment, the system further includes a receiver configured to receive a first operation request, one or more processors configured to determine a target bitstream in at least one storage medium in response to the first operation request, and a transmitter configured to transmit the target bitstream to a terminal device.
[0344] In one possible embodiment, one or more processors are further configured to encapsulate a bitstream to obtain a transport stream in a first format, and a transmitter is further configured to transmit the transport stream in the first format to a terminal device for display or to transmit the transport stream in the first format to a storage space for storage.
[0345] In one possible embodiment, an exemplary method for storing a bitstream is provided, which is: The steps include obtaining the bitstream according to one of the encoding methods previously described, The steps include storing the bitstream in a storage medium, Includes.
[0346] Optionally, this method is The steps include: performing an encryption process on the bitstream to obtain an encrypted bitstream, The steps include storing the encrypted bitstream in a storage medium, Includes.
[0347] Please understand that one of the known encryption methods may be used.
[0348] Optionally, this method is The steps include: performing segmentation on a bitstream to obtain multiple bitstream segments; The steps include storing multiple bitstream segments in a storage medium, It also includes.
[0349] Optionally, this method is The further step includes taking at least one backup of the bitstream and storing at least one backup on a storage medium. It should be understood that at least one backup of the bitstream may be stored on a storage medium different from the storage medium that stores the original bitstream.
[0350] Optionally, this method is The steps include receiving multiple bitstreams generated according to one of the previously described encoding methods, A step of individually assigning address information or identification information to multiple bitstreams, A step of storing multiple bitstreams at corresponding locations according to address information or identification information corresponding to the multiple bitstreams, It also includes.
[0351] Optionally, this method is A step of classifying a bitstream to obtain at least two bitstreams, wherein the at least two bitstreams include a first bitstream and a second bitstream, The steps include storing the first bitstream in the first memory space and the second bitstream in the second memory space, It also includes.
[0352] Optionally, this method is The process further includes the step of sending a bitstream to a terminal device by a video streaming device, the video streaming device may be a content server or a content distribution server.
[0353] In one possible embodiment, an exemplary system for storing a bitstream is provided, and this system is A receiver configured to receive a bitstream generated by one of the previous encoding methods, A processor configured to perform encryption on a bitstream and obtain an encrypted bitstream, A computer-readable storage medium configured to store an encrypted bitstream, Includes.
[0354] Optionally, the system may include several storage media, which can be deployed in different locations. Furthermore, multiple bitstreams may be distributed and stored across different storage media. For example, some storage media may include a first storage medium configured to store a first bitstream and a second storage medium configured to store a second bitstream.
[0355] Optionally, the system includes a video streaming device, which may be a content server or a content distribution server, and is configured to retrieve a bitstream from one of the storage media and transmit the bitstream to a terminal device.
[0356] In one possible embodiment, an exemplary method for converting the format of a bitstream is provided, and this method is The steps include receiving a bitstream of a first format generated by one of the previously described encoding methods, A step of converting a bitstream in the first format to a bitstream in the second format, The steps include storing a bitstream in a second format onto a storage medium, Includes.
[0357] Optionally, this method is A step of sending a stored bitstream in a second format to a terminal device in response to an access request from the terminal device, It also includes.
[0358] In one possible embodiment, an exemplary system for converting the format of a bitstream is provided, and the system is: A receiver configured to receive a bitstream of a first format generated by one of the previously described encoding methods, A processor configured to convert a bitstream of a first format to a bitstream of a second format, The processor is further configured to store a bitstream of a second format in a storage medium. The storage medium comprises a processor configured to store a bitstream of a second format, A transmitter configured to send a stored bitstream in a second format to a terminal device in response to an access request from the terminal device, Includes.
[0359] In one possible embodiment, an exemplary method for processing a bitstream is provided, which is: A step of receiving a transport stream containing a video stream and an audio stream, wherein the video stream is generated by one of the encoding methods previously described, The steps include demultiplexing the transport stream to separate the video stream and audio stream, The steps include: decrypting the video stream using a video decoder to obtain video data; The steps include: decrypting the audio stream using an audio decoder to obtain audio data, Includes.
[0360] Optionally, this method is Steps to synchronize audio and video data, The steps include outputting the synchronization results to the player for playback, It also includes.
[0361] Optionally, this method is The steps include: decoding the bitstream to obtain video data or image data; A step of performing at least one of the following on video data or image data: luminance mapping, chroma mapping, resolution adjustment, or format conversion. A step of transmitting video data or image data to a display, It also includes.
[0362] In one possible embodiment, an exemplary method is provided for transmitting a bitstream based on a user operation request, and this method is: The steps include receiving a first operation request from an end-side device, wherein the first operation request is used to request the playback of a target video, and Steps include: determining a bitstream corresponding to a target video in a storage medium in response to a first operation request, wherein the bitstream corresponding to the target video is a bitstream generated according to one of the previously described encoding methods; The steps include: transmitting the target bitstream to the end-side device, Includes.
[0363] Optionally, this method is The steps include: encapsulating the bitstream to obtain a transport stream in a first format, A step of sending a transport stream in a first format to a terminal device for display, or A step of sending a transport stream of the first format to a storage space for storage, It also includes.
[0364] In one possible embodiment, an exemplary system is provided for transmitting a bitstream based on a user operation request, and the system is: A storage medium configured to store a bitstream, wherein the bitstream is a bitstream generated according to one of the previously described encoding methods, A receiver configured to receive a first operation request, A processor configured to determine a target bitstream in a storage medium in response to a first operation request, A transmitter configured to transmit a target bitstream to a terminal device, Includes.
[0365] Optionally, the processor, The system is further configured to encapsulate the bitstream and obtain a transport stream in a first format, and this system Send the transport stream in the first format to the terminal device for display, A transport stream of the first format is sent to the memory space for storage. Further includes a transmitter configured as follows.
[0366] In one possible embodiment, an exemplary method for downloading a bitstream is provided, which means A step of obtaining a bitstream from a storage medium, wherein the bitstream is generated according to one of the previously described encoding methods, The steps include decrypting the bitstream to obtain the streaming media file, The steps include splitting a streaming media file into multiple streaming media segments, Steps to download multiple streaming media segments individually, Includes.
[0367] In one possible embodiment, an exemplary system for downloading a bitstream is provided, and this system is: An acquisition unit configured to acquire a bitstream from a storage medium, wherein the bitstream is generated according to one of the previously described encoding methods, A decoder configured to decode a bitstream and obtain a streaming media file, A processor configured to split a streaming media file into multiple streaming media segments, The processor is further configured to download multiple streaming media segments individually. However, the present invention is not limited to any of these exemplary implementations.
[0368] Arithmetic decoding may be performed in parallel, for example, by a multicore decoder. In addition, only a portion of the arithmetic decoding may be performed in parallel. The arithmetic decoding method may be implemented as range coding.
[0369] The arithmetic coding of this disclosure can be readily applied to the coding of feature maps in neural networks, or to the coding and decoding of classical pictures (still images or moving images). Neural networks may be used for any purpose, in particular for the coding and decoding of pictures (still images or moving images), or for the coding and decoding of picture-related data such as motion flow or motion vectors or other parameters. Neural networks may also be used in computer vision applications such as image classification, deep detection, segmentation map determination, and object recognition.
[0370] Entropy decoding may be performed in parallel, for example, by a multicore decoder. In addition, only a portion of the entropy decoding may be performed in parallel. Figure 19 shows an exemplary scheme of a parallel (e.g., multicore) encoder 620. Each of the input data channels 610 may be encoded into individual substreams containing coding bits 630-633 and trailing bits 640-643. The length of the substreams 650 is signaled. In a parallel processing implementation, the bitstream consists of several substreams, which are concatenated in the final step. Each of the substreams needs to be finalized. This is because the encoding (and therefore decoding) of one substream does not require the prior encoding (or decoding) of one or more other substreams, since the substreams are encoded independently of each other.
[0371] The input data channel may refer to a channel obtained by processing several data by a neural network. For example, the input data may be an output channel or a feature channel such as the latent representation channel of the neural network. In exemplary implementations, the neural network may be a deep neural network and / or a convolutional neural network, etc. The neural network may be trained to process pictures (still images or moving images). This processing may be for picture coding and reconstruction, or for computer vision such as object recognition, classification, segmentation, etc. In general, this disclosure is not limited to any particular type of task or neural network. Rather, this disclosure is applicable to coding any type of data coming from multiple channels, which should generally be understood as any data source. Furthermore, the channels may be provided by preprocessing of the source data.
[0372] Implementation within picture coding One possible development can be seen in Figures 20 and 21.
[0373] Figure 20 shows a schematic block diagram of an exemplary encoder 20 configured to implement the technology of the present application. In the example of Figure 20, the encoder 20 includes an input 201 (or input interface 201), a residual calculation unit 204, a transformation unit 206, a quantization unit 208, an inverse quantization unit 210, and an inverse transformation unit 212, a reconstruction unit 214, a loop filter unit 220, a decoded picture buffer (DPB) 230, a mode selection unit 260, an entropy coding unit 270, and an output 272 (or output interface 272). The entropy coding 270 may implement the arithmetic coding method or apparatus described above.
[0374] The mode selection unit 260 may include an interpretation unit 244, an intraprediction unit 254, and a segmentation unit 262. The interpretation unit 244 may include a motion estimation unit and a motion compensation unit (not shown). The encoder 20 shown in Figure 20 is sometimes called a hybrid encoder or an encoder that follows a hybrid video / image codec.
[0375] The encoder 20 receives, for example, a picture 17 (or picture data 17 or point) via input 201. group It may be configured to receive data (motion flow or other types of media data), for example, a picture from a series of pictures that form a video or video sequence. The received picture or picture data may be a pre-processed picture 19 (or pre-processed picture data 19). For simplicity, in the following description, it will be referred to as picture 17. Picture 17 may also be called the current picture or the picture to be coded (particularly in video coding to distinguish the current picture from other pictures, for example, previously coded and / or decoded pictures of the same video sequence, i.e., the video sequence which also includes the current picture).
[0376] A (digital) picture is, or can be considered as, a two-dimensional array or matrix of samples having intensity values. Samples within an array are sometimes called pixels (a shortened form of picture element) or pels. The number of samples in the horizontal and vertical (or axis) directions of an array or picture defines the size and / or resolution of the picture. For color representation, typically three color components are used; that is, a picture may be represented by or contain three sample arrays. In the RGB format or color space, a picture contains corresponding red, green, and blue sample arrays. However, in video coding, each pixel is typically represented in a luminance and chrominance format or color space, e.g., YCbCr, which contains a luminance component represented by Y (sometimes L is used instead) and two chrominance components represented by Cb and Cr. The luminance (or, in short, luma) component Y represents brightness or gray level intensity (for example, as in a grayscale picture), while the two chrominance (or, in short, chroma) components Cb and Cr represent chromaticity or color information components. Therefore, a picture in YcbCr format includes a luminance sample array of luminance sample values (Y) and two chrominance sample arrays of chrominance values (Cb and Cr). A picture in RGB format may be converted to YcbCr format, and vice versa; this process is also known as transformation or conversion. If the picture is monochrome, it may contain only the luminance sample array. Therefore, a picture may be, for example, an array of luma samples in a monochrome format, or an array of luma samples and two corresponding chroma sample arrays in 4:2:0, 4:2:2, and 4:4:4 color formats.
[0377] Embodiments of the encoder 20 may include a picture partitioning unit (not shown in Figure 20) configured to divide a picture 17 into multiple (typically non-overlapping) picture blocks 203. These blocks may also be called root blocks, macroblocks (H.264 / AVC), coding tree blocks (CTBs), or coding tree units (CTUs) (H.265 / HEVC and VVC). The picture partitioning unit may use the same block size and corresponding grid defining the block size for all pictures in the video sequence, or it may change the block size between pictures or between subsets or groups of pictures to partition each picture into a corresponding block. The abbreviation AVC stands for Advanced Video Coding.
[0378] In further embodiments, the encoder 20 may be configured to directly receive blocks 203 of picture 17, for example, one, some, or all of the blocks that make up picture 17. Picture blocks 203 may also be referred to as the current picture block or the picture block to be coded.
[0379] Similar to picture 17, picture block 203 is smaller in dimensions than picture 17 but is or can be considered as a two-dimensional array or matrix of samples having intensity values (sample values). In other words, block 203 may include, for example, one sample array (e.g., a lumar array in the case of monochrome picture 17, or a lumar or chromar array in the case of a color picture), or three sample arrays (e.g., a lumar and two chromar arrays in the case of color picture 17), or any other number and / or type of arrays depending on the applied color format. The number of samples in the horizontal and vertical (or axis) directions of block 203 defines the size of block 203. Thus, the block may be, for example, an M×N (M columns × N rows) array of samples or an M×N array of conversion coefficients.
[0380] An embodiment of the encoder 20 shown in Figure 20 may be configured to encode the picture 17 block by block, for example, encoding and prediction may be performed for each block 203.
[0381] Embodiments of encoder 20, as shown in Figure 20, may be further configured to partition and / or encode a picture using slices (also called video slices), where the picture is partitioned into one or more slices (typically non-overlapping) or encoded using one or more of these slices, each slice may contain one or more blocks (e.g., CTUs).
[0382] Embodiments of encoder 20, as shown in Figure 20, may be further configured to segment and / or encode a picture using tile groups (also called video tile groups) and / or tiles (also called video tiles), wherein the picture may be segmented into one or more tile groups (typically non-overlapping) or encoded using one or more tile groups, each tile group may contain one or more blocks (e.g., CTUs) or one or more tiles, where each tile may be, for example, rectangular in shape and may contain one or more blocks (e.g., CTUs), for example, complete or partial blocks.
[0383] Figure 21 shows an example of a decoder 30 configured to implement the technology of the present application. The decoder 30 is configured to receive encoded picture data 21 (e.g., encoded bitstream 21) encoded by, for example, the encoder 20, and to obtain a decoded picture 331. The encoded picture data or bitstream includes information for decoding the encoded picture data, such as data representing picture blocks and associated syntax elements of an encoded slice (and / or tile group or tile or subpicture).
[0384] The entropy decoding unit 304 is configured to analyze the bitstream 21 (or generally the encoded picture data 21) and perform, for example, entropy decoding on the encoded picture data 21 to obtain, for example, quantization coefficients 309 and / or decoding coding parameters (not shown in Figure 21), for example inter-prediction parameters (e.g., reference picture index and motion vector), intra-prediction parameters (e.g., intra-prediction mode or index), transformation parameters, quantization parameters, loop filter parameters and / or other syntactic elements, or any or all of them. The entropy decoding unit 304 may be configured to apply a decoding algorithm or scheme corresponding to the coding scheme described with respect to the entropy coding unit 270 of the encoder 20. The entropy decoding unit 304 may be further configured to provide the inter-prediction parameters, intra-prediction parameters and / or other syntactic elements to the mode application unit 360 and other parameters to other units of the decoder 30. The decoder 30 may receive syntax elements at the video slice level and / or video block level. In addition to or as an alternative to slices and their respective syntactic elements, tile groups and / or tiles and their respective syntactic elements may be received and / or used. Entropy decoding may implement any of the arithmetic decoding methods or devices described above.
[0385] The reconstruction unit 314 (for example, an adder or summer 314) may be configured to add the reconstructed residual block 313 to the prediction block 365 by, for example, adding the sample value of the reconstructed residual block 313 to the sample value of the prediction block 365, thereby obtaining the reconstructed block 315 within the sample region.
[0386] The embodiment of the decoder 30 shown in Figure 21 may be configured to partition and / or decode a picture using slices (also called video slices), where the picture may be partitioned into one or more slices (typically non-overlapping) or decoded using one or more slices, each slice may contain one or more blocks (e.g., CTUs).
[0387] Embodiments of the decoder 30 shown in Figure 21 may be configured to segment and / or decode a picture using tile groups (also called video tile groups) and / or tiles (also called video tiles), wherein the picture may be segmented or decoded into one or more tile groups (typically non-overlapping), each tile group may include, for example, one or more blocks (e.g., CTUs) or one or more tiles, where each tile may be, for example, rectangular in shape, and may include one or more blocks (e.g., CTUs), for example, complete or partial blocks.
[0388] The encoded picture data 21 can be decoded using other variations of the decoder 30. For example, the decoder 30 can generate an output video stream without a loop filtering unit 320. For example, a non-conversion-based decoder 30 can directly dequantize the residual signal for a particular block or frame without using an inverse conversion processing unit 312. In another implementation, the decoder 30 may have an inverse quantization unit 310 and an inverse conversion processing unit 312 coupled into a single unit.
[0389] In the encoder 20 and decoder 30, the processing result of the current step may be further processed and output to the next step. For example, after interpolation filtering, motion vector derivation, or loop filtering, further operations such as clipping or shifting may be performed on the processing result of interpolation filtering, motion vector derivation, or loop filtering.
[0390] Implementation in hardware and software Several further implementations in hardware and software are described below.
[0391] Referring to Figures 22 to 25, any of the above-described encoding devices may provide means for performing the above-described encoding and decoding methods. In particular, the processing circuit in any of these exemplary devices is configured to perform the above-described encoding and decoding methods.
[0392] In the following embodiments of the coding system 10, the encoder 20 and decoder 30 will be described based on Figures 22 and 23, in relation to Figures 20 and 21 described above, or to other encoders and decoders such as neural network-based encoders and decoders.
[0393] Figure 22 is a schematic block diagram illustrating an exemplary coding system 10, such as a video coding system 10 or a picture coding system 10, that may utilize the technology of the present application. The encoder 20 and decoder 30 of the coding system 10 represent examples of devices that may be configured to perform the technology described in the various examples of the present application.
[0394] As shown in Figure 22, the coding system 10 includes a source device 12 configured to provide encoded picture data 21 to a destination device 14 for decoding, for example, encoded picture data 13.
[0395] The source device 12 includes an encoder 20 and, optionally, a picture source 16, a preprocessor (or preprocessing unit) 18, such as an image preprocessor 18, and a communication interface or communication unit 22. The source device 12 can be a cloud server, a content server, or a content distribution server.
[0396] The picture source 16 includes, or may include, any type of picture capture device, e.g., a camera for capturing real-world pictures, and / or any type of picture generation device, e.g., a computer graphics processor for generating computer-animated pictures, or any other type of device for acquiring and / or providing real-world pictures, computer-generated pictures (e.g., screen content or virtual reality (VR) pictures), and / or any combination thereof (e.g., augmented reality (AR) pictures). The picture source may also include any type of memory or storage for storing any of the aforementioned pictures.
[0397] In contrast to the processing performed by the preprocessor 18 and the preprocessing unit 18, the picture or picture data 17 may also be called the raw picture or raw picture data 17.
[0398] The preprocessor 18 is configured to receive (raw) picture data 17, perform preprocessing on the picture data 17, and obtain a preprocessed picture 19 or preprocessed picture data 19. The preprocessing performed by the preprocessor 18 may include, for example, cropping, color format conversion (e.g., RGB to YcbCr), color correction, or noise reduction. It can be understood that the preprocessing unit 18 may be any component.
[0399] The encoder 20 is configured to receive pre-processed picture data 19 and provide encoded picture data 21 (further details are described above, for example, based on Figure 20).
[0400] The communication interface 22 of the source device 12 may be configured to receive encoded picture data 21 and transmit the encoded picture data 21 (or any further processed version thereof) via the communication channel 13 to another device, such as the destination device 14 or any other device, for storage or direct reconstruction.
[0401] The destination device 14 includes a decoder 30 and, optionally, a communication interface or communication unit 28, a post-processor 32 (or post-processing unit 32), and a display device 34.
[0402] The communication interface 28 of the destination device 14 is configured to receive encoded picture data 21 (or any further processed version thereof) from, for example, the source device 12 directly, or from any other source, such as a storage device, such as an encoded picture data storage device, and to provide the encoded picture data 21 to the decoder 30.
[0403] Communication interfaces 22 and 28 may be configured to transmit or receive encoded picture data 21 or encoded data 13 via a direct communication link between the source device 12 and the destination device 14, for example, via a direct wired or wireless connection, or via any type of network, for example, a wired or wireless network or any combination thereof, or any type of private and public network or any combination thereof.
[0404] The communication interface 22 may be configured to package the encoded picture data 21 into a suitable format, such as a packet, and / or to process the encoded picture data using any kind of transmission encoding or processing for transmission over a communication link or communication network or transmission medium. The communication interface 22 may be configured to encapsulate the encoded picture data to obtain a transport stream in a first format, and to transmit the transport stream to a terminal device for display, or to transmit the transport stream in a first format to a storage area for storage.
[0405] The communication interface 28 may form a counterpart to the communication interface 22 and be configured, for example, to receive transmitted data and process the transmitted data using any kind of corresponding transmit decoding or processing and / or depackaging to obtain encoded picture data 21.
[0406] Communication interfaces 22 and 28 may both be configured as unidirectional or bidirectional communication interfaces, as indicated by the arrows on the communication channel 13 in Figure 22 pointing from the source device 12 to the destination device 14, for example, to send and receive messages, to set up connections, to confirm and exchange any other information relating to the communication link and / or data transmission, for example, to transmit encoded picture data.
[0407] The decoder 30 is configured to receive encoded picture data 21 and provide decoded picture data 31 or decoded picture 31 (further details are described above, for example, based on Figure 21).
[0408] The post-processor 32 of the destination device 14 is configured to post-process the decoded picture data 31 (also called reconstructed picture data), for example, the decoded picture 31, to obtain post-processed picture data 33, for example, the post-processed picture 33. The post-processing performed by the post-processing unit 32 may include, for example, color format conversion (e.g., YcbCr to RGB), color correction, cropping or resampling, or any other processing to prepare the decoded picture data 31 for display by, for example, the display device 34.
[0409] The display device 34 of the destination device 14 is configured to receive post-processed picture data 33 in order to display the picture to a user or viewer, for example. The display device 34 may be any type of display for displaying the reconstructed picture, such as an integrated or external display or monitor, or may include these. The display may include, for example, a liquid crystal display (LCD), an organic light-emitting diode (OLED) display, a plasma display, a projector, a microLED display, an LCoS, a digital optical processor (DLP), or any other type of display.
[0410] Although Figure 22 shows the source device 12 and the destination device 14 as separate devices, the device embodiment may also include the functionality of both the source device 12 or its corresponding functionality and the destination device 14 or its corresponding functionality. In such embodiments, the source device 12 or its corresponding functionality and the destination device 14 or its corresponding functionality may be implemented using the same hardware and / or software, by separate hardware and / or software, or by any combination thereof.
[0411] As will become apparent to those skilled in the art based on the description, the functionality of different units, or the presence and (exact) division of functionality within the source device 12 and / or destination device 14 as shown in Figure 22, may vary depending on the actual device and application.
[0412] The encoder 20 or the decoder 30, or both the encoder 20 and the decoder 30, may be implemented via processing circuits such as one or more microprocessors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), discrete logic, hardware, dedicated video coding, or any combination thereof, as shown in Figure 23. The encoder 20 may be implemented via processing circuit 46 to embody various modules and / or any other encoder systems or subsystems described herein, as described with respect to the encoder 20 in Figure 20. The decoder 30 may be implemented via processing circuit 46 to embody various modules and / or any other decoder systems or subsystems described herein, as described with respect to the decoder 30 in Figure 21. The processing circuits may be configured to perform various operations, as will be described later. If the technology is partially implemented in software, as shown in Figure 25, the device may store instructions for the software in a suitable non-temporary computer-readable storage medium and execute the instructions in hardware using one or more processors to perform the technology of this disclosure. Either the encoder 20 or the decoder 30 may be integrated as part of a combined encoder / decoder (CODEC) within a single device, for example, as shown in Figure 23.
[0413] The source device 12 and destination device 14 may include any wide range of devices, including any type of handheld or fixed device such as a notebook or laptop computer, mobile phone, smartphone, tablet or tablet computer, camera, desktop computer, set-top box, television, display device, digital media player, video game console, video streaming device (such as a content service server or content distribution server), broadcast receiver device, broadcast transmitter device, etc., and may or may not use an operating system. In some cases, the source device 12 and destination device 14 may be equipped for wireless communication. Therefore, the source device 12 and destination device 14 may be wireless communication devices.
[0414] In some cases, the video coding system 10 shown in Figure 22 is merely an example, and the technology of this application may be applied to coding configurations (e.g., video / image coding or video / image decoding) that do not necessarily involve data communication between an coding device and a decoding device. In other examples, data is retrieved from local memory and streamed over a network, etc. The coding device may code the data and store it in memory, and / or the decoding device may retrieve the data from memory and decode it. In some examples, coding and decoding do not communicate with each other, but are performed by devices that simply code the data into memory and / or retrieve the data from memory and decode it.
[0415] For convenience of explanation, embodiments of the present invention are described herein by reference to reference software such as HEVC (High-Efficiency Video Coding) or VVC (Versatile Video Coding), a next-generation video coding standard developed by the ITU-T's VCEG (Video Coding Experts Group) and ISO / IEC's MPEG (Motion Picture Experts Group)'s JCT-VC (Joint Collaboration Team on Video Coding). Those skilled in the art will understand that embodiments of the present invention are not limited to HEVC or VVC.
[0416] Figure 24 is a schematic diagram of a coding device (video coding device or image coding device) 400 according to an embodiment of the present invention. The coding device 400 is suitable for implementing the disclosed embodiments described herein. In one embodiment, the coding device 400 may be a decoder such as the decoder 30 in Figure 22 or an encoder such as the encoder 20 in Figure 22.
[0417] The coding device 400 includes an inlet port 410 (or input port 410) and a receiver unit (Rx) 420 for receiving data; a processor, logic unit, or central processing unit (CPU) 430 for processing data; a transmitter unit (Tx) 440 and an exit port 450 (or output port 450) for transmitting data; and memory 460 for storing data. The coding device 400 may also include optical-electrical (OE) components and electrical-optical (EO) components coupled to the inlet port 410, receiver unit 420, transmitter unit 440, and exit port 450 for optical or electrical signal input or output.
[0418] The processor 430 is implemented by hardware and software. The processor 430 may be implemented as one or more CPU chips, cores (e.g., as a multi-core processor), FPGAs, ASICs, and DSPs. The processor 430 communicates with the inlet port 410, the receiver unit 420, the transmitter unit 440, the exit port 450, and the memory 460. The processor 430 includes a coding module 470. The coding module 470 implements the embodiments disclosed above. For example, the coding module 470 implements, processes, prepares, or provides various coding operations. Thus, including the coding module 470 provides a significant improvement to the functionality of the video coding device 400, resulting in the conversion of the video coding device 400 to different states. Alternatively, the coding module 470 is implemented as instructions stored in the memory 460 and executed by the processor 430.
[0419] Memory 460 may include one or more disks, tape drives, and solid-state drives, and may be used as an overflow data storage device to store a program when it is selected for execution, and to store instructions and data that are read during program execution. Memory 460 may be, for example, volatile and / or non-volatile, and may be read-only memory (ROM), random access memory (RAM), tri-level associative memory (TCAM), and / or static random access memory (SRAM).
[0420] Figure 25 is a simplified block diagram of a device 500 that can be used as either or both of the source device 12 and destination device 14 from Figure 22, according to an exemplary embodiment.
[0421] The processor 502 within the device 500 may be a central processing unit. Alternatively, the processor 502 may be any other type of device or multiple devices currently existing or to be developed in the future that are capable of manipulating or processing information. The disclosed implementation may be carried out with a single illustrated processor, such as the processor 502, and the speed and efficiency advantages may be achieved using two or more processors.
[0422] The memory 504 within the device 500 may, in one implementation, be a read-only memory (ROM) device or a random access memory (RAM) device. Any other suitable type of storage device may be used as memory 504. Memory 504 may contain code and data 506 accessed by the processor 502 using the bus 512. Memory 504 may further include an operating system 508 and an application program 510, the application program 510 including at least one program that enables the processor 502 to perform the methods described herein. For example, the application program 510 may include applications 1 through N, which further include video coding applications that perform the methods described herein, including encoding and decoding using the arithmetic coding described above.
[0423] The device 500 may also include one or more output devices, such as a display 518. In one example, the display 518 may be a touch-sensitive display, where the display is combined with a touch-sensing element that can operate to sense touch input. The display 518 may be coupled to the processor 502 via the bus 512.
[0424] Although depicted here as a single bus, the bus 512 of device 500 can be composed of multiple buses. Furthermore, the secondary storage 514 can be directly coupled to other components of device 500 or accessed via a network, and may include a single integrated unit such as a memory card, or multiple units such as multiple memory cards. Thus, device 500 can be implemented in a wide variety of configurations.
[0425] It should be noted that embodiments of the coding system 10, encoder 20 and decoder 30 (and corresponding system 10), as well as other embodiments described herein, may be configured for video, still image processing or coding, i.e., for processing or coding individual pictures independent of preceding or consecutive pictures, as in video coding. Generally, when picture processing coding is limited to a single picture 17, only the inter-prediction units 244 (encoder) and 344 (decoder) may not be available. All other functionalities (also called tools or techniques) of encoder 20 and decoder 30 may be used equally for still image processing, e.g., residual calculation 204 / 304, transformation 206, quantization 208, inverse quantization 210 / 310, (inverse) transformation 212 / 312, segmentation 262 / 362, intra-prediction 254 / 354, and / or loop filtering 220, 320, as well as entropy coding 270 and entropy decoding 304.
[0426] For example, embodiments of the encoder 20 and decoder 30, and functions described herein in relation to the encoder 20 and decoder 30, may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored in a computer-readable medium or transmitted as one or more instructions or codes over a communication medium and executed by a hardware-based processing unit. The computer-readable medium may include a computer-readable storage medium corresponding to a tangible medium such as a data storage medium, or a communication medium including any medium that facilitates the transfer of computer programs from one location to another, for example, according to a communication protocol. Thus, the computer-readable medium may generally correspond to (1) a non-temporary tangible computer-readable storage medium, or (2) a communication medium such as a signal or carrier wave. The data storage medium may be any available medium that can be accessed by one or more computers or one or more processors to retrieve instructions, codes and / or data structures for implementation of the technology described herein. A computer program product may include a computer-readable medium.
[0427] Figure 27 is a block diagram showing a content supply system 3100 for realizing a content distribution service. This content supply system 3100 includes a capture device 3102 and a terminal device 3106, and optionally includes a display 3126. The capture device 3102 communicates with the terminal device 3106 via a communication link 3104. The communication link may include the communication channel 13 described above. The communication link 3104 includes, but is not limited to, Wi-Fi, Ethernet, cable, wireless (3G / 4G / 5G), USB, or any combination thereof.
[0428] The capture device 3102 may generate data and encode the data using the encoding method shown in the above embodiment. Alternatively, the capture device 3102 may distribute the data to a streaming server (not shown), which encodes the data and transmits the encoded data to the terminal device 3106. The capture device 3102 includes, but is not limited to, a camera, a smartphone or tablet, a computer or laptop, a video conferencing system, a PDA, an in-vehicle device, or any combination thereof. For example, the capture device 3102 may include the source device 12 as described above. When the data includes video, the video encoder 20 included in the capture device 3102 may actually perform the video encoding process. When the data includes audio (i.e., voice), the audio encoder included in the capture device 3102 may actually perform the audio encoding process. In some practical scenarios, the capture device 3102 distributes the encoded video data and audio data by multiplexing them together. In other practical scenarios, for example in a video conferencing system, the encoded audio data and encoded video data are not multiplexed. The capture device 3102 distributes encoded audio data and encoded video data separately to the terminal device 3106.
[0429] In the content supply system 3100, the terminal device 3106 receives and plays back encoded data. The terminal device 3106 can be a data receiving and retrieval device such as a smartphone or tablet 3108, a computer or laptop 3110, a network video recorder (NVR) / digital video recorder (DVR) 3112, a TV 3114, a set-top box (STB) 3116, a video conferencing system 3118, a video surveillance system 3120, a personal digital assistant (PDA) 3122, an in-vehicle device 3124, or any combination thereof, or similar devices capable of decoding the encoded data described above. For example, the terminal device 3106 may include the destination device 14 as described above. When the encoded data includes video, the video decoder 30 included in the terminal device takes priority in performing video decoding. When the encoded data includes audio, the audio decoder included in the terminal device takes priority in performing audio decoding.
[0430] In the case of terminal devices having a display, such as a smartphone or tablet 3108, a computer or laptop 3110, a network video recorder (NVR) / digital video recorder (DVR) 3112, a TV 3114, a personal digital assistant (PDA) 3122, or an in-vehicle device 3124, the terminal device can supply the decoded data to its display. In the case of terminal devices without a display, such as an STB 3116, a video conferencing system 3118, or a video surveillance system 3120, an external display 3126 is connected thereto to receive and display the decoded data.
[0431] When each device in this system performs encoding or decoding, it can use a picture encoding device or a picture decoding device as shown in the embodiments described above.
[0432] figure 27This figure shows the structure of an example terminal device 3106. After the terminal device 3106 receives a stream from the capture device 3102, the protocol processing unit 3202 analyzes the transmission protocol of the stream. The protocol includes, but is not limited to, Real-Time Streaming Protocol (RTSP), Hypertext Transfer Protocol (HTTP), HTTP Live Streaming Protocol (HLS), MPEG-DASH, Real-Time Transport Protocol (RTP), Real-Time Messaging Protocol (RTMP), or any combination thereof.
[0433] After the protocol processing unit 3202 processes the stream, a stream file is generated. The file is output to the demultiplexing unit 3204. The demultiplexing unit 3204 can separate the multiplexed data into encoded audio data and encoded video data. As described above, in some real-world scenarios, such as in a video conferencing system, the encoded audio data and encoded video data are not multiplexed. In this situation, the encoded data is transmitted to the video decoder 3206 and audio decoder 3208 without going through the demultiplexing unit 3204.
[0434] Through this demultiplexing process, a video elementary stream (ES), an audio ES, and optionally subtitles are generated. The video decoder 3206 includes the video decoder 30 described in the above embodiment, but decodes the video ES using the decoding method shown in the above embodiment to generate video frames and supplies this data to the synchronization unit 3212. The audio decoder 3208 decodes the audio ES to generate audio frames and supplies this data to the synchronization unit 3212. Alternatively, the video frames are buffered (Figure) before being supplied to the synchronization unit 3212. 27 The audio frame may be stored in a buffer (not shown in the figure) before being supplied to the synchronization unit 3212. 27 (Not shown) It may be stored in [this location].
[0435] The synchronization unit 3212 synchronizes video frames and audio frames and supplies video / audio to the video / audio display 3214. For example, the synchronization unit 3212 synchronizes the presentation of video and audio information. The information may be coded in the syntax using timestamps for the presentation of coded audio and visual data and timestamps for the delivery of the data stream itself.
[0436] If subtitles are included in the stream, the subtitle decoder 3210 decodes the subtitles, synchronizes them with the video and audio frames, and supplies the video / audio / subtitles to the video / audio / subtitle display 3216.
[0437] The present invention is not limited to the above system, and either the picture encoding device or the picture decoding device in the above embodiment can be incorporated into other systems, such as in-vehicle systems.
[0438] Such computer-readable storage media can include, but are not limited to, any other media that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer, such as RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other media that can be used to store desired program code in the form of instructions or data structures. Any connection is also appropriately called a computer-readable medium. For example, if instructions are transmitted from a website, server or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of a medium. However, it should be understood that computer-readable storage media and data storage media do not include connections, carriers, signals, or other temporary media, but rather are directed toward non-temporary tangible storage media. As used herein, "disk" and "disc" include compact discs (CDs), laser discs, optical discs, digital multipurpose discs (DVDs), floppy disks, and Blu-ray discs, where a disc typically reproduces data magnetically, and a disc reproduces data optically using a laser. Any combination of the above should also be included within the scope of computer-readable media.
[0439] Instructions may be executed by one or more processors, such as digital signal processors (DSPs), general-purpose microprocessors, application-specific integrated circuits (ASICs), field-programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuits. Therefore, as used herein, the term “processor” may refer to any of the aforementioned structures or any other structure suitable for implementing the technology described herein. In addition, in some embodiments, the functionality described herein may be provided within dedicated hardware and / or software modules configured for encoding and decoding, or incorporated into a combined codec. Furthermore, the technology can be fully implemented in one or more circuits or logic elements.
[0440] The technology disclosed herein may be implemented in a wide variety of devices or apparatus, including wireless handsets, cloud servers, application servers, integrated circuits (ICs) or sets of ICs (e.g., chipsets). While various components, modules, or units are described in this disclosure to highlight the functional aspects of devices configured to perform the disclosed technology, implementation by different hardware units is not necessarily required. Rather, as described above, various units may be combined within a codec hardware unit, or a set of interoperable hardware units including one or more processors as described above may be provided together with appropriate software and / or firmware.
Claims
1. A decoding method implemented by a decoder, The steps include receiving a bitstream containing encoded data of an input signal and a first parameter, The steps include: analyzing the bitstream to obtain the first parameter, A step of obtaining an entropy coding parameter based on the first parameter, The steps include: reconstructing at least a portion of the input signal based on the entropy coding parameters and the encoded data; Includes, The first parameter is p, and the entropy coding parameter includes the size of the alphabet M of the entropy coder, where M is a function of p, in the decoding method.
2. The step of obtaining the entropy coding parameter based on the first parameter is: M=f -1 (p) Including, here, f -1 (p) is the inverse function of f(M), and f(M) = p. The decoding method according to claim 1.
3. M is one of the following, namely: M=k a*p+C Here, k is a natural number, and a and C are constants, or M = a*p + b, where a and b are constants, or M=p 2 The decoding method according to claim 2, which satisfies one of the following conditions.
4. p=log 2 (M)-9, M = f -1 (p) = 2 p+9 where f -1 (p) is the inverse function of f(M), and f(M) = log 2 (M) - 9 The decoding method according to claim 3.
5. p is one of the following codes, namely: Binary code, or unary code, or Truncated unary code, or exp-Golomb code, The decoding method according to any one of claims 1 to 4, wherein signaling is performed using one of the following.
6. p is signaled using an exp-Golomb code of degree 0. The decoding method according to claim 5.
7. The entropy coder is an arithmetic coder, a range coder, or an ANS (Asymmetric Numerical Systems) coder. The decoding method according to any one of claims 1 to 4.
8. A decoding method implemented by a decoder, The steps include receiving a bitstream containing encoded data of an input signal and a first parameter, The steps include: analyzing the bitstream to obtain the first parameter, A step of obtaining an entropy coding parameter based on the first parameter, The steps include: reconstructing at least a portion of the input signal based on the entropy coding parameters and the encoded data; Includes, The step of obtaining the entropy coding parameter based on the first parameter is: A step of determining a target subrange in which the first parameter is located, wherein the acceptable range of values for the first parameter includes a plurality of subranges, the target subrange is one of the plurality of subranges, each of the plurality of subranges includes at least one value of the first parameter, and each of the plurality of subranges corresponds to one value of the entropy coding parameter. A step of using the value of the entropy coding parameter corresponding to the target subrange as the value of the entropy coding parameter, or a step of calculating the value of the entropy coding parameter based on one or more values of the entropy coding parameter corresponding to one or more subranges adjacent to the target subrange, Decryption methods, including those mentioned.
9. The first parameter includes at least one of the following: rate control parameter, quantization parameter (qp), image resolution, video resolution, frame rate, pixel density in a 3D object, or rate distortion weighting coefficient. The decoding method according to claim 8.
10. A decoding method implemented by a decoder, The steps include receiving a bitstream containing encoded data of an input signal and a first parameter, The steps include: analyzing the bitstream to obtain the first parameter, A step of obtaining an entropy coding parameter based on the first parameter, The steps include: reconstructing at least a portion of the input signal based on the entropy coding parameters and the encoded data; Includes, The first parameter is D, and the entropy coding parameter includes the size of the alphabet M of the entropy coder, where M is obtained based on P and D, and P is a predictor derived by the decoder. Decryption method.
11. The step of obtaining the entropy coding parameter based on the first parameter is: M=s -1 (D,P) This includes, where, s -1 (D,P) is the inverse function of s(M,P), and s(M,P) = D. The decoding method according to claim 10.
12. M is one of the following, namely: [Math 1] Here, a, b, and C are predetermined constants, or M = a1*D + b1*P + c1, where a1, b1, and c1 are predetermined constants. A decoding method according to claim 11 that satisfies one of the following conditions. [Request Item 13] [Number 2] The decoding method according to claim 12.
14. D is one of the following codes, namely: Binary code, or unary code, or Truncated unary code, or exp-Golomb code, The decoding method according to any one of claims 10 to 13, wherein signaling is performed using one of the following.
15. P is derived based on at least one parameter other than the first parameter that is carried in the bitstream. The decoding method according to any one of claims 10 to 13.
16. The at least one parameter other than the first parameter includes at least one of the following: rate control parameter, quantization parameter (qp), image resolution, video resolution, frame rate, pixel density in a 3D object, or rate distortion weighting coefficient. The decoding method according to claim 15.
17. P is Obtaining the rate control parameter beta (β) from the aforementioned bitstream, The method for determining the target subrange in which the acquired β is located is such that the tolerance range for the value of the rate control parameter β is [β_0, β_K], the tolerance range [β_0, β_K] is divided into a plurality of subranges, the target subrange is one of the plurality of subranges, each of the plurality of subranges contains at least one value of β, and each of the plurality of subranges corresponds to one value of P. Selecting a value corresponding to the aforementioned target subrange as the value of P, or The value of P is calculated based on one or more values corresponding to one or more subranges adjacent to the target subrange. The decoding method according to claim 16, including, derived based on the at least one parameter.
18. The entropy coder is an arithmetic coder, a range coder, or an ANS (Asymmetric Numerical Systems) coder. The decoding method according to any one of claims 10 to 13.
19. A decoding method implemented by a decoder, The steps include receiving a bitstream containing encoded data of an input signal and a first parameter, The steps include: analyzing the bitstream to obtain the first parameter, A step of obtaining an entropy coding parameter based on the first parameter, The steps include: reconstructing at least a portion of the input signal based on the entropy coding parameters and the encoded data; Includes, The decryption method is The steps of analyzing the bitstream and obtaining a flag, The further includes, the flag is used to indicate whether the entropy coding parameter is directly transported within the bitstream, When the flag is equal to the third value, it means that the difference between M and P, or the result of the conversion of the difference between M and P, is carried in the bitstream, and the first parameter is the difference between M and P, or the result of the conversion of the difference between M and P, where M is the size of the input alphabet and P is the predictor derived by the decoder. Decryption method.
20. When the flag is equal to a first value, it specifies that the entropy coding parameter is carried within the bitstream, in which case the first parameter is either the entropy coding parameter, or the result of a conversion of the entropy coding parameter, or When the flag is equal to the second value, it specifies that the entropy coding parameter is not transported within the bitstream, and the entropy coding parameter is derived by the decoder. The decoding method according to claim 19.
21. A decoding method implemented by a decoder, The steps include receiving a bitstream containing encoded data of an input signal and a first parameter, The steps include: analyzing the bitstream to obtain the first parameter, A step of obtaining an entropy coding parameter based on the first parameter, The steps include: reconstructing at least a portion of the input signal based on the entropy coding parameters and the encoded data; Includes, The step of reconstructing at least a portion of the input signal based on the entropy coding parameter and the encoded data is: A step of obtaining at least one probabilistic model, wherein the probabilistic model of the output symbol is used to show the probability of each possible value of the output symbol, The steps include: obtaining one or more output symbols by entropy decoding one or more bits of the encoded data using the at least one probabilistic model and the entropy coding parameters; The steps of reconstructing at least a portion of the input signal based on one or more output symbols, Decryption methods, including those mentioned.
22. The aforementioned probability model depends on the entropy coding parameter. The decoding method according to claim 21.
23. An encoding method implemented by an encoder, A step of encoding an input signal and a first parameter into a bitstream, wherein the first parameter is used to obtain an entropy coding parameter, The steps include sending the bitstream to the decoder, Includes, An encoding method in which the first parameter is p, where p is the result of the transformation of M, and M is the entropy coding parameter.
24. p = f(M), where f(M) is an invertible function. The encoding method according to claim 23.
25. f(M) includes the following, namely: f(M) = a * log k (M) + b, where k is a natural number and a and b are constants, or f(M) = a*M + b, where a and b are constants, or f(M)=sqrt(M), The encoding method according to claim 24.
26. p=log 2 (M)-9 The encoding method according to claim 25.
27. p is one of the following codes, namely: Binary code, or unary code, or Truncated unary code, or exp-Golomb code, The encoding method according to any one of claims 23 to 26, wherein signaling is performed using one of the following.
28. An encoding method implemented by an encoder, A step of encoding an input signal and a first parameter into a bitstream, wherein the first parameter is used to obtain an entropy coding parameter, The steps include sending the bitstream to the decoder, Includes, The first parameter is D obtained based on P and M, where M is the entropy coding parameter and P is the predictor derived by the decoder. Encoding method.
29. D = s(M,P), where s(M,P) is an invertible function. The encoding method according to claim 28.
30. s(M,P) is, s(M,P) = a * log k (P) + b * log k (M)-c, where k is a natural number and a, b and c are constants, or s(M,P) = a*M + b*P + c, where a, b, and c are constants. including, The encoding method according to claim 29.
31. D=s(M,P)=log 2 (P)-log 2 (M) The encoding method according to claim 30.
32. D is one of the following codes, namely: Binary code, or unary code, or Truncated unary code, or exp-Golomb code, The encoding method according to any one of claims 28 to 31, wherein signaling is performed using one of the following.
33. An encoding method implemented by an encoder, A step of encoding an input signal and a first parameter into a bitstream, wherein the first parameter is used to obtain an entropy coding parameter, The steps include sending the bitstream to the decoder, Includes, The encoding method is A step of encoding the flag into the bitstream, The further includes, the flag is used to indicate whether the entropy coding parameter is directly transported within the bitstream, When the flag is equal to a third value, it is specified that the difference between M and P is carried in the bitstream, or the result of converting the difference between M and P is carried in the bitstream, where M is the entropy coding parameter and P is the predictor derived by the decoder. Encoding method.
34. When the flag is equal to the first value, it specifies that the entropy coding parameter is carried within the bitstream and that the first parameter is either the entropy coding parameter or the result of a conversion of the entropy coding parameter, When the flag is equal to the second value, the entropy coding parameter is not transported within the bitstream, but the entropy coding parameter is derived by the decoder. The encoding method according to claim 33.
35. An encoding method implemented by an encoder, A step of encoding an input signal and a first parameter into a bitstream, wherein the first parameter is used to obtain an entropy coding parameter, The steps include sending the bitstream to the decoder, Includes, The encoding method is A step of obtaining the minimum and maximum values of the latent space elements of an entropy encoder, wherein the latent space elements are the result of the progression of the input signal, The size of the alphabet in the entropy coder, M = ceil(max{y} - min{y}) or M = 2^(ceil(log 2 (max{y} - min{y}))) The steps to obtain according to, This further includes, where ceil(x) is the smallest integer greater than x, max{y} represents the maximum value of the latent space element, min{y} represents the minimum value of the latent space element, and M represents the size of the alphabet. Encoding method.
36. An encoding method implemented by an encoder, A step of encoding an input signal and a first parameter into a bitstream, wherein the first parameter is used to obtain an entropy coding parameter, The steps include sending the bitstream to the decoder, Includes, The encoding method is M 0 A step of obtaining at least two values around M 0 =ceil(max{y}-min{y}) or M 0 =2^(ceil(log 2 The step is (max{y}-min{y}))), The steps include: calculating a loss function for at least two of the aforementioned values; The steps include selecting the value having the smallest loss function among the at least two of the aforementioned values as the size of the alphabet of the entropy coder, This further includes, where ceil(x) is the smallest integer greater than x, max{y} represents the maximum value of the latent space element, and min{y} represents the minimum value of the latent space element. Encoding method.
37. One or more processors, A non-temporary computer-readable storage medium coupled to one or more processors, A decoder comprising the non-temporary computer-readable storage medium storing a program for execution by one or more processors, wherein the decoder is configured such that, when executed by the one or more processors, it performs the method according to any one of claims 1 to 4, the method according to claim 8 or 9, the method according to any one of claims 10 to 13, the method according to claim 19 or 20, or the method according to claim 21 or 22. decoder.
38. One or more processors, A non-temporary computer-readable storage medium coupled to one or more processors, An encoder comprising the non-temporary computer-readable storage medium storing a program for execution by one or more processors, wherein the program, when executed by the one or more processors, is configured to perform the method according to any one of claims 23 to 26, the method according to any one of claims 28 to 31, the method according to claim 33 or 34, the method according to claim 35, or the method according to claim 36. Encoder.
39. A non-temporary computer-readable medium that, when executed by a computer device or one or more processors, carries computer instructions causing the computer device or one or more processors to execute the method according to any one of claims 1 to 4, the method according to claim 8 or 9, the method according to any one of claims 10 to 13, the method according to claim 19 or 20, the method according to claim 21 or 22, the method according to any one of claims 23 to 26, the method according to any one of claims 28 to 31, the method according to claim 33 or 34, the method according to claim 35, or the method according to claim 36.
40. A computer program stored in a non-temporary storage medium, comprising code instructions, wherein, when executed on one or more processors, the code instructions cause the one or more processors to execute the method according to any one of claims 1 to 4, the method according to claim 8 or 9, the method according to any one of claims 10 to 13, the method according to claim 19 or 20, the method according to claim 21 or 22, the method according to any one of claims 23 to 26, the method according to any one of claims 28 to 31, the method according to claim 33 or 34, the method according to claim 35, or the method according to claim 36.
41. A system for distributing bitstreams, A storage medium configured to store at least one bitstream generated by the method according to any one of claims 23 to 26, the method according to any one of claims 28 to 31, the method according to claim 33 or 34, the method according to claim 35, or the method according to claim 36, A video streaming device configured to acquire a bitstream from one of the at least one storage mediums and transmit the bitstream to a terminal device, The video streaming device includes a content server or a content distribution server. system.
42. The system further comprises one or more processors configured to perform an encryption process on at least one bitstream and obtain at least one encrypted bitstream, The at least one storage medium is configured to store the encrypted bitstream, or The one or more processors are configured to convert a bitstream of a first format to a bitstream of a second format. The at least one storage medium is configured to store the bitstream in the second format. The system according to claim 41.
43. A receiver configured to receive a first operation request, One or more processors configured to determine a target bitstream in at least one storage medium in response to the first operation request, A transmitter configured to transmit the target bitstream to a terminal device, The system according to claim 41.
44. The one or more processors described above are: Encapsulate the bitstream to obtain a transport stream in the first format. It is further configured in this way, The aforementioned transmitter is The transport stream in the first format is transmitted to the terminal device for display, or The transport stream in the first format is sent to a storage area for storage. It is further constructed in such a way. The system according to claim 43.