Apparatus, method, and computer program for encoding and decoding video.
The enhanced method for selecting reference samples in video coding improves accuracy and efficiency by rounding motion vectors and expanding reference areas, addressing delays and signal quality issues in modern video coding standards.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- NOKIA TECHNOLOGIES OY
- Filing Date
- 2024-05-13
- Publication Date
- 2026-06-26
AI Technical Summary
Modern video coding standards face issues with reference sample selection in intercoding, particularly due to the use of additional tools like weighted prediction, bidirectional prediction, and local illumination compensation, which can introduce delays and affect signal characteristics, especially when block sizes are small.
An enhanced method for selecting reference samples by determining image blocks in multiple color channels, rounding motion vectors to full sample accuracy, expanding the reference area symmetrically or asymmetrically, and omitting samples outside the reference image to improve cross-component prediction.
This approach enhances the accuracy and efficiency of video encoding and decoding by improving the selection of reference samples, reducing delays, and maintaining signal quality, especially in small block sizes.
Smart Images

Figure 2026521206000001_ABST
Abstract
Description
[Technical Field]
[0001] The present invention relates to an apparatus, method, and computer program for encoding and decoding video. [Background technology]
[0002] In modern video coding, intercoding in the latest video coding standards (such as H.266) uses single-tree motion-compensated prediction. This copies a region from a reference frame or set of frames to the target frame using sub-pixel interpolation. Additional tools such as weighted prediction (WP), bidirectional prediction with CU level weights (BCW), a combination of inter- and intra-prediction (CIIP), and local illumination compensation (LIC) are used to account for changes in lighting and color differences between the reference and target frames.
[0003] However, the use of such adjustment tools can negatively impact the reference samples used. For example, in affine prediction, motion compensation applies a transformation to the reference block to generate predictions, which can be done independently of the calculation of model parameters. Also, some processes, such as LIC, can have undesirable effects on the calculation of model parameters, and interpolation filters can affect signal characteristics, such as the loss of some high-frequency components. In general, all additional processing introduces delays in the pipeline that calculates the model. Furthermore, if the block size is small, the prediction block may not contain enough samples, and it may be undesirable to use samples near the current block because it creates a dependency on the reconstructed samples of neighboring blocks.
[0004] Therefore, the method for selecting reference samples needs to be improved. [Overview of the Initiative]
[0005] Therefore, to mitigate the above-mentioned problems, an enhanced method for improving the selection of reference samples is proposed herein.
[0006] The scope of protection sought for various embodiments of the present invention is defined by the independent claims. Any embodiments and features described herein that are not included in the scope of the independent claims should be interpreted as useful examples for understanding various embodiments of the present invention.
[0007] An apparatus according to the first embodiment comprises means for determining an image block unit of a frame, the image block unit including samples in at least first and second color channels; means for determining at least one reference image, a first reference region in at least one reference image for predicting target samples in at least first color channels of the image block unit, and a second reference region in at least one reference image for predicting target samples in at least second color channels of the image block unit; means for reconstructing samples in the first color channel of the image block unit; means for deriving reference samples for generating a cross-component prediction model from reference samples in the first and second reference regions; and means for predicting target samples in at least second color channels of the image block unit using reconstructed samples in at least first color channels as input to the cross-component prediction model.
[0008] According to one embodiment, the first and second color channels include at least one luminance channel and at least one chrominance channel.
[0009] According to one embodiment, the apparatus comprises means for rounding the luminance and / or chrominance motion vectors determined for a block in the current image to full sample accuracy, and means for determining the position of a reference region in a reference image using the rounded motion vectors.
[0010] According to one embodiment, the apparatus comprises means for rounding the motion vectors of one color channel to full sample accuracy and for determining the sample values corresponding to the other color channels by interpolating samples taking into account the offset between the color channels by removing the fractional part of the motion vectors.
[0011] According to one embodiment, the apparatus comprises means for rounding the motion vectors of one color channel to full sample accuracy and for determining the sample values corresponding to the other color channels by interpolating samples taking into account the offset between the color channels.
[0012] According to one embodiment, the apparatus comprises means for expanding the area used for the selection of reference samples beyond the reference area.
[0013] According to one embodiment, the apparatus comprises means for symmetrically expanding the area used for the selection of reference samples upward, downward, leftward, and rightward with respect to the area determined by the size and shape of the prediction unit.
[0014] According to one embodiment, the apparatus comprises means for asymmetrically expanding the area used for the selection of reference samples with respect to the area determined by the size and shape of the prediction unit.
[0015] According to one embodiment, the apparatus comprises means for expanding the area used for the selection of reference samples in the motion direction in response to the number of samples in the prediction block being less than a predetermined threshold.
[0016] According to one embodiment, the apparatus comprises means for omitting reference samples located outside the reference image in the first and second reference areas when generating a cross-component prediction model.
[0017] According to one embodiment, the shape of the expanded reference area is non-rectangular.
[0018] According to one embodiment, the means is implemented in an encoder.
[0019] According to one embodiment, the means is implemented in the decoder.
[0020] In a second embodiment, an apparatus is provided, comprising at least one processor and at least one memory, the memory of which stores code, and when the code is executed by at least one processor, the apparatus is caused to perform at least: determine an image block unit of a frame, wherein the image block unit includes samples in at least first and second color channels; determine at least one reference image and a first reference region in at least one reference image for predicting target samples in at least first color channels of the image block unit, and a second reference region in at least one reference image for predicting target samples in at least second color channels of the image block unit; reconstruct samples in first color channels of the image block unit; derive reference samples from reference samples in first and second reference regions for generating a cross-component prediction model; and predict target samples in at least second color channels of the image block unit using reconstructed samples in at least first color channels as input to the cross-component prediction model.
[0021] A method according to a third aspect includes determining an image block unit of a frame, wherein the image block unit includes samples of at least a first and a second color channel; determining at least one reference image, wherein the reference image includes a first reference region for predicting target samples of at least a first color channel of the image block unit, and a second reference region for predicting target samples of at least a second channel of the image block unit; reconstructing samples of the first color channel of the image block unit; deriving reference samples from the reference samples in the first and second reference regions for generating a cross-component prediction model; and predicting target samples of at least a second color channel of the image block unit using at least the reconstructed samples of the first color channel as input to the cross-component prediction model.
[0022] The above-described apparatus and computer-readable storage medium on which the code is recorded are configured to perform the above method and one or more embodiments related thereto. [Brief explanation of the drawing]
[0023] To better understand the present invention, it will be described below illustratively with reference to the attached drawings. [Figure 1] Figure 1 schematically shows an electronic device employing an embodiment of the present invention. [Figure 2] Figure 2 schematically shows a user device suitable for employing an embodiment of the present invention. [Figure 3] Figure 3 further schematically shows electronic devices employing embodiments of the present invention, connected using wireless and wired network connections. [Figure 4a] Figure 4a schematically shows an encoder and decoder suitable for carrying out an embodiment of the present invention. [Figure 4b] Figure 4b schematically shows an encoder and decoder suitable for carrying out an embodiment of the present invention. [Figure 5]Figure 5 shows the sample locations used to derive the parameters of the cross-component linear model (CCLM). [Figure 6a] Figure 6a shows an example of classifying luminance samples in the sample region and spatial region into two classes. [Figure 6b] Figure 6b shows an example of classifying luminance samples in the sample region and spatial region into two classes. [Figure 7] Figure 7 shows an example of a co-position reference sample region consisting of reconstructed luminance and chrominance samples defined for both luminance and chrominance, for a convolutional cross-component model (CCCM). [Figure 8] Figure 8 shows various examples of filter kernel dimensions in CCCM. [Figure 9] Figure 9 shows an example of four reference lines adjacent to a prediction block. [Figure 10] Figure 10 shows the matrix-weighted intra-prediction process. [Figure 11] Figure 11 shows an example of coupled inter / intra prediction (CIIP) processing. [Figure 12] Figure 12 shows a flowchart illustrating a method for providing improved reference sample selection according to an embodiment of the present invention. [Figure 13a] Figure 13a shows examples of reference region extension according to various embodiments of the present invention. [Figure 13b] Figure 13b shows examples of reference region extension according to various embodiments of the present invention. [Figure 13c] Figure 13c shows examples of reference region extension according to various embodiments of the present invention. [Figure 14a] Figure 14a shows an example of using an IBC reference region for parameter calculation according to an embodiment of the present invention. [Figure 14b] Figure 14b shows an example where the IBC reference region overlaps with the current CTU and adjacent CTUs. [Figure 15] Figure 15 shows a schematic diagram of an exemplary multimedia communication system in which various embodiments can be implemented. [Modes for carrying out the invention]
[0024] The following describes in more detail suitable apparatuses and possible mechanisms for using a distortion metric in inter-coding and intra-coding. In this regard, we first refer to Figures 1 and 2. Figure 1 shows a schematic block diagram of a video coding system according to an exemplary embodiment, as an exemplary apparatus or electronic device 50 into which a codec according to an embodiment of the present invention can be incorporated. Figure 2 shows the layout of the apparatus according to the embodiment. The components of Figures 1 and 2 are described below.
[0025] The electronic device 50 may be, for example, a mobile terminal or user device of a wireless communication system. However, it should be understood that embodiments of the present invention may be implemented in any electronic device or apparatus that encodes and decodes video, or requires encoding or decoding.
[0026] The device 50 may include a housing 30 for housing and protecting the device. The device 50 may further include a display 32 in the form of a liquid crystal display. In other embodiments of the present invention, the display may be any suitable display technology suitable for displaying images or videos. The device 50 may further include a keypad 34. In other embodiments of the present invention, any suitable data input mechanism or user interface mechanism may be employed. For example, the user interface may be implemented as a virtual keyboard or data input system as part of a touch-sensitive display.
[0027] The device may be equipped with a microphone 36 or any suitable audio input means, which may be a digital signal input or an analog signal input. The device 50 may further be equipped with an audio output device, which in embodiments of the present invention may be an earpiece 38, a speaker, or either an analog audio output connection or a digital audio output connection. The device 50 may also be equipped with a battery (or, in other embodiments of the present invention, the device may be powered by any suitable portable energy source such as a solar cell, a fuel cell, or a spring-driven generator). The device may further be equipped with a camera capable of recording or capturing images and / or video. The device 50 may further be equipped with an infrared port for short-range line-of-sight communication with other devices. In other embodiments, the device 50 may further be equipped with any suitable short-range communication means, such as a Bluetooth® wireless connection or a USB / FireWire wired connection.
[0028] The device 50 may include a controller 56, a processor, or a processor circuit for controlling the device 50. The controller 56 is connected to a memory 58. In embodiments of the present invention, the memory 58 stores data in image data and audio data formats, and can also store instructions executed on the controller 56. The controller 56 may further be connected to a codec circuit 54 suitable for performing encoding and decoding of audio and / or video data, or may be configured to assist encoding and decoding performed by the controller.
[0029] The device 50 may further include a card reader 48 and a smart card 46, for example, a UICC and a UICC reader. These are suitable for providing user information and authentication information for user authentication and authorization on a network.
[0030] The device 50 may be connected to a controller and include a radio interface circuit 52 suitable for generating radio communication signals for communication with, for example, a cellular communication network, a wireless communication system, or a wireless local area network. The device 50 may further include an antenna 44 connected to the radio interface circuit 52. This antenna is suitable for transmitting radio frequency signals generated by the radio interface circuit 52 to other devices and for receiving radio frequency signals from other devices.
[0031] The device 50 may include a camera capable of recording or detecting individual frames, and the detected frames are passed to a codec 54 or controller for processing. The device may receive and process video data from other devices before transmission and / or storage. The device 50 may also receive images for encoding / decoding by either a wireless or wired connection. The components of the device 50 described above are examples of means for performing their respective functions.
[0032] Figure 3 shows an example of a system that can utilize embodiments of the present invention. System 10 includes a plurality of communication devices that can communicate over one or more networks. System 10 may include any combination of wired and wireless networks. This includes, but is not limited to, wireless cellular networks (GSM, UMTS, CDMA networks, etc.), wireless local area networks (WLANs) as defined by any of the IEEE 802.x standards, Bluetooth® personal area networks, Ethernet® local area networks, Token Ring local area networks, wide area networks (WANs), the Internet, etc.
[0033] System 10 may include both wired and wireless communication devices and / or device 50 suitable for realizing embodiments of the present invention.
[0034] For example, the system shown in Figure 3 is a schematic diagram of a mobile telephone network 11 and the Internet 28. Connection to the Internet 28 may include long-range wireless connections, short-range wireless connections, and various wired connections, including but not limited to telephone lines, cable lines, power lines, and similar communication paths.
[0035] Examples of communication devices shown in System 10 include, but are not limited to, electronic devices or equipment 50, a combination of a personal digital assistant (PDA) and a mobile phone 14, a PDA 16, an integrated messaging device (IMD) 18, a desktop computer 20, a notebook computer 22, and the like. Equipment 50 may be fixed or mobile when carried by an individual in transit. Equipment 50 may also be installed in a means of transport including, but not limited to, automobiles, trucks, taxis, buses, trains, ships, aircraft, bicycles, motorcycles, or similar appropriate means of transport.
[0036] The embodiments can also be implemented in set-top boxes (i.e., digital television receivers that may or may not have a display and wireless capabilities), tablet devices, (laptop) personal computers (PCs) (which may be equipped with hardware implementations, software implementations, or a combination thereof of encoders / decoders), various operating systems, and chipsets, processors, DSPs, and / or embedded systems that provide hardware / software-based encoding.
[0037] Some or more devices may connect to a base station 24 via a wireless connection 25 to make and receive calls and messages and communicate with a service provider. The base station 24 may be connected to a network server 26 that enables communication between the mobile telephone network 11 and the Internet 28. The system may include additional communication devices and various types of communication devices.
[0038] Communication devices include, but are not limited to, code division multiple access (CDMA), global systems for mobile communications (GSM), universal mobile communications system (UMTS), time division multiple access (TDMA), frequency division multiple access (FDMA), transmission control protocol / Internet protocol (TCP / IP), short message service (SMS), multimedia messaging service (MMS), email, instant messaging service (IMS), Bluetooth®, IEEE 802.11, and any similar wireless communication technologies. Communication devices involved in implementing various embodiments of the present invention may communicate using a variety of media, including, but not limited to, wireless, infrared, laser, cable connections, and other suitable connections.
[0039] In telecommunications and data networks, a channel may refer to either a physical channel or a logical channel. A physical channel refers to a physical transmission medium such as a wire, while a logical channel may refer to a logical connection on a multiplexed medium capable of transmitting multiple logical channels. A channel may be used to transmit information signals, such as a bitstream, from one or more senders (or transmitters) to one or more receivers.
[0040] The MPEG-2 transport stream (TS), as defined in ISO / IEC 13818-1 or ITU-T Recommendation H.222.0, is a format for transmitting audio, video, other media, and program metadata and other metadata as a multiplexed stream. Packet identifiers (PIDs) are used to identify elementary streams (also known as packetized elementary streams) within a TS. Therefore, logical channels within an MPEG-2 TS can be considered to correspond to specific PID values.
[0041] Available media file format standards include the ISO-based media file format (ISO / IEC 14496-12, abbreviated as ISOBMFF) and the file format for NAL unit structured video derived from ISOBMFF (ISO / IEC 14496-15).
[0042] A video codec consists of an encoder, which converts input video into a compressed representation suitable for storage / transmission, and a decoder, which can decompress (decode) the compressed video representation into a viewable form. The video encoder and / or video decoder may be independent of each other and do not necessarily constitute a codec. Typically, the encoder discards some information from the original video sequence and represents the video in a more compact form (i.e., at a lower bitrate).
[0043] Typical hybrid video encoders, such as many encoder implementations of ITU-T H.263 and H.264, encode video information in two stages. First, the pixel values of a given image region (or "block") are predicted. This prediction can be made, for example, by motion compensation (i.e., by finding and indicating a region approximating the block to be encoded from one of the previously encoded video frames) or by spatial means (by utilizing the pixel values surrounding the block to be encoded in a predetermined manner). Second, the prediction error, i.e., the difference between the predicted pixel block and the original pixel block, is encoded. This is usually done by transforming the difference in pixel values using a predetermined transformation (e.g., discrete cosine transform (DCT) or a variation thereof), quantizing the coefficients, and entropy encoding the quantized coefficients. By changing the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel representation (image quality) and the size of the resulting encoded video representation (file size and transmission bitrate).
[0044] In temporal prediction, the source of the prediction is a previously decoded image (also known as a reference image). In intrablock copy (IBC, also known as intrablock copy prediction), the prediction is applied similarly to temporal prediction, but the reference image is the current image, and the prediction process can only refer to previously decoded samples. Inter-layer prediction or inter-view prediction can also be applied similarly to temporal prediction, but the reference images are decoded images from different scalable layers or decoded images from different viewpoints. In some cases, inter-prediction may refer only to temporal prediction, and in other cases, inter-prediction may refer collectively to temporal prediction, intrablock copy prediction, inter-layer prediction, or inter-view prediction, as long as it is performed using the same or similar processing as temporal prediction. Inter-prediction or temporal prediction is often referred to as motion compensation or motion-compensated prediction.
[0045] Motion compensation can be performed with either full-sample accuracy or sub-sample accuracy. In full-sample accuracy motion compensation, motion is represented as a motion vector with horizontal and vertical displacements expressed as integer values, and the motion compensation process uses these displacements to substantially copy the sample from the reference image. In sub-sample accuracy motion compensation, the motion vector is expressed with horizontal and vertical components as fractional or decimal values. If the motion vector points to a non-integer position in the reference image, sub-sample interpolation is usually performed to calculate a predicted sample value based on the reference sample and the selected sub-sample position. Sub-sample interpolation typically consists of horizontal filtering to compensate for the horizontal offset relative to the full sample position, followed by vertical filtering to compensate for the vertical offset relative to the full sample position. However, depending on the environment, the vertical processing may be performed before the horizontal processing.
[0046] Interpretation, also known as time prediction, motion compensation, or motion-compensated prediction, reduces temporal redundancy. In interpretation, the prediction source is a previously decoded image. Intrapretation utilizes the fact that adjacent pixels within the same image are likely to be correlated. Intrapretation can be performed in the spatial or transformation domain, meaning it can predict either sample values or transformation coefficients. Intrapretation is typically used in intracoding, in which case interpretation is not applied.
[0047] The result of the encoding process is a set of encoding parameters, such as motion vectors and quantized transformation coefficients. Many parameters can be more efficiently entropically encoded by predicting them in advance from spatially or temporally adjacent parameters. For example, motion vectors can be predicted from spatially adjacent motion vectors, and only the difference between them and a motion vector predictor can be encoded. Prediction and intra-prediction of encoding parameters can be collectively called in-picture prediction.
[0048] Figures 4a and 4b show encoders and decoders suitable for employing embodiments of the present invention. A video codec consists of an encoder that converts the input video into a compressed representation suitable for storage / transmission, and a decoder that can decompress (decode) the compressed video representation into a viewable form again. Typically, the encoder discards or loses some information in the original video sequence in order to represent the video in a more compact form (i.e., at a lower bitrate). Figure 4a shows an example of the encoding process. Figure 4a shows the image to be encoded (I n ), predictive representation of image block (P' n ), prediction error signal (D n ), reconstructed prediction error signal (D' n ), provisional reconstructed image (I´ n ), final reconstructed image (R' n ), transform (T) and inverse transform (T)-1 ) Quantization (Q) and inverse quantization (Q -1 ) Entropy encoding (E), reference frame memory (RFM), inter prediction (P inter ) Intra prediction (P intra ) Mode selection (MS), filtering (F) are shown.
[0049] An example of the decoding process is shown in Fig. 4b. Fig. 4b shows the predicted representation (P´ n ) of the image block, the reconstructed prediction error signal (D´ n ) of the image block, the provisional reconstructed image (I´ n ) of the image block, the final reconstructed image (R´ n ) of the image block, inverse transform (T -1 ) inverse quantization (Q -1 ) entropy decoding (E -1 ) reference frame memory (RFM), prediction (P, either inter or intra), filtering (F) are shown.
[0050] Many hybrid video encoders encode video information in two stages. First, the pixel values of a given image region (or "block") are predicted. This is done, for example, by motion compensation (i.e., by searching for and specifying a region that approximates the block to be encoded from one of the previously encoded video frames) or by spatial means (by using the pixel values surrounding the block to be encoded in a predetermined way). Second, the prediction error, i.e., the difference between the predicted pixel block and the original pixel block, is encoded. This is usually done by transforming the difference in pixel values using a predetermined transformation (e.g., the discrete cosine transform (DCT) or a variation thereof), quantizing the coefficients, and entropy encoding the quantized coefficients. By changing the fidelity of the quantization process, the encoder can control the balance between the accuracy of the pixel representation (image quality) and the size of the resulting encoded video representation (file size and transmission bitrate). Video codecs may also offer a transformation skip mode, which the encoder can select to use. In transformation skip mode, the prediction error is encoded in the sample region. For example, the difference value per sample for a specific adjacent sample is derived, and that difference value per sample is encoded using an entropy encoder.
[0051] Entropy coding / decoding can be performed in various ways. For example, context-based coding / decoding is applied, where the context state of the coding parameters is modified in both the encoder and decoder based on previously coded / decoded coding parameters. Context-based coding includes, for example, context-adaptive binary arithmetic coding (CABAC), context-based variable-length coding (CAVLC), or any similar entropy coding. Alternatively or additionally, entropy coding / decoding may be performed using variable-length coding schemes such as Huffman coding / decoding or Exp-Golomb coding / decoding. The process of decoding coding parameters from an entropy-coded bitstream or codeword is sometimes called parsing.
[0052] The expression "along the bitstream" (e.g., indicating along the bitstream) may be defined as referring to out-of-band transmission, signaling, or storage in a manner that associates the out-of-band data with the bitstream. Expressions such as "decode along the bitstream" may refer to the decoding of out-of-band data associated with the bitstream (which can be retrieved from out-of-band transmission, signaling, or storage). For example, "instructions along the bitstream" may refer to metadata within a container file that encapsulates the bitstream.
[0053] The H.264 / AVC standard was developed by the Joint Video Team (JVT), comprised of the Video Coding Expert Group (VCEG) of the Telecommunication Standardization Sector of the International Telecommunication Union (ITU-T) and the Video Expert Group (MPEG) of the International Organization for Standardization (ISO) / International Electrotechnical Commission (IEC). The H.264 / AVC standard is published by both parent standardization bodies and is known as ITU-T Recommendation H.264 and ISO / IEC International Standard 14496-10 (also known as MPEG-4 Part 10 Advanced Video Coding (AVC)). Multiple versions of the H.264 / AVC standard exist, incorporating new extensions and features. These extensions include Scalable Video Coding (SVC) and Multiview Video Coding (MVC).
[0054] Version 1 of the High Efficiency Video Coding (H.265 / HEVC, also known as HEVC) standard was developed by the Joint Cooperation Team of VCEG and MPEG (JCT-VC). This standard is published by its parent standardization bodies and is known as ITU-T Recommendation H.265 and ISO / IEC International Standard 23008-2 (also known as MPEG-H Part 2 High Efficiency Video Coding (HEVC)). Subsequent versions of H.265 / HEVC include Scalable Extensions, Multiview Extensions, Fidelity Range Extensions, 3D Extensions, and Screen Content Coding Extensions, which may be abbreviated as SHVC, MV-HEVC, REXT, 3D-HEVC, and SCC, respectively.
[0055] The Multipurpose Video Coding Scheme (VVC) (MPEG-I Part 3), also known as ITU-T H.266, is a video compression standard jointly developed by the Joint Video Experts Team (JVET) of the Video Experts Group (MPEG) (officially known as ISO / IEC JTC1 SC29 WG11) and the Video Coding Experts Group (VCEG) of the International Telecommunication Union (ITU). It is the successor to HEVC / H.265.
[0056] This section describes some key definitions, bitstreams, and coding structures and concepts in H.264 / AVC and HEVC as examples of video encoders, decoders, encoding methods, decoding methods, and bitstream structures. These specific examples can be implemented as embodiments of the present invention. Some of the key definitions, bitstreams, and coding structures and concepts of H.264 / AVC are identical to those of HEVC and will therefore be described together below. While aspects of the present invention are not limited to H.264 / AVC or HEVC, they will be described as one possible basis for realizing some or all of the present invention.
[0057] Like many traditional video encoding standards, H.264 / AVC and HEVC specify the syntax and semantics of the bitstream, as well as the decoding process for error-free bitstreams. However, the encoding process is not specified; the encoder must generate a compatible bitstream. The compatibility between the bitstream and decoder can be verified using a virtual reference decoder (HRD). While these standards include encoding tools to address transmission errors and losses, their use is optional, and the decoding process for error-containing bitstreams is not specified.
[0058] The basic unit in the input to an H.264 / AVC or HEVC encoder, and in the output from an H.264 / AVC or HEVC decoder, is the picture. The picture given as input to the encoder is also called the source picture, and the picture decoded by the decoder is also called the decoded picture.
[0059] The source picture and the decoded picture each consist of one or more sample sequences, for example, including one of the following sets of sample sequences: Brightness (Y) only (monochrome), Luminance and two color difference components (YCbCr or YCgCo), Green, blue, red (also known as GBR or RGB), An array representing other unspecified monochrome or tristimulus color sampling (e.g., YZX, also known as XYZ).
[0060] In H.264 / AVC and HEVC, a picture is either a frame or a field. A frame consists of a matrix of luminance samples, and possibly corresponding chrominance samples. A field is a set of alternating sample rows of a frame, and may be used as the input to an encoder when the input signal is interlaced. A chrominance sample array may not exist (and therefore monochrome sampling is used), or the chrominance sample array may be subsampled compared to the luminance sample array. The chrominance format can be summarized as follows: In monochrome sampling, there is only one sample sequence, which is nominally considered to be a luminance sequence. In 4:2:0 sampling, each of the two color difference arrays has half the height and half the width of the luminance array. In 4:2:2 sampling, each of the two color difference arrays has the same height as the luminance array, but half the width. When using 4:4:4 sampling and not using independent color planes, the two color difference arrays each have the same height and width as the luminance array.
[0061] In H.264 / AVC and HEVC, it is possible to encode the sample sequence as separate color planes into the bitstream and decode each of the separately encoded color planes from the bitstream. When separate color planes are used, each color plane is processed individually (by the encoder and / or decoder) as a monochrome sampled picture.
[0062] Partitioning is defined as dividing a set into multiple subsets such that each element of the set is contained within one of the subsets.
[0063] In describing the operation of HEVC encoding and / or decoding, the following terms may be used: A coded block is defined as an N×N sample block of some value N such that the result of dividing a coded tree block into coded blocks is a partition. A coded tree block (CTB) is defined as an N×N sample block of some value N such that the result of dividing a component into coded tree blocks is a partition. A coded tree unit (CTU) may be defined as a coded tree block of luminance samples, two corresponding color difference sample coded tree blocks of a picture with three sample sequences, or a coded tree block of samples of a monochrome picture or a picture encoded using three separate color planes, and the syntactic structure used to encode the samples. A coded unit (CU) may be defined as a coded block of luminance samples, two corresponding color difference sample coded blocks of a picture with three sample sequences, or a sample coded block of a monochrome picture or a picture encoded using three separate color planes, and the syntactic structure used to encode those samples. The largest allowable CU size is called LCU (Largest Encoding Unit) or Encoding Tree Unit (CTU), and video pictures are divided into non-overlapping LCUs.
[0064] A CU consists of one or more prediction units (PUs) that define the prediction process for samples within the CU, and one or more transformation units (TUs) that define the prediction error coding process for samples within the CU. Typically, a CU consists of square sample blocks of a size selectable from a predefined set of possible CU sizes. Each PU and TU can be further subdivided into smaller PUs and TUs to increase the granularity of the prediction and prediction error coding processes, respectively. Each PU is associated with prediction information that defines the type of prediction to be applied to the pixels within that PU (e.g., motion vector information for inter-prediction PUs, and intra-prediction directionality information for intra-prediction PUs).
[0065] Each TU can be associated with information describing the prediction error decoding process (e.g., including DCT coefficient information) for the samples within that TU. Typically, whether or not prediction error coding is applied to each CU is signaled at the CU level. If there is no prediction error residual associated with a CU, it can be assumed that there is no TU for that CU. The division of an image into CUs, and the division of CUs into PUs and TUs, is usually signaled within the bitstream, so that the decoder can reproduce the intended structure of these units.
[0066] In HEVC, a picture can be divided into tiles. A tile is rectangular and contains an integer number of LCUs. In HEVC, the division into tiles forms a regular grid, and the height and width of the tiles differ by a maximum of 1 LCU. In HEVC, a slice is defined as an integer number of coded tree units contained in one independent slice segment and all dependent slice segments (if any) that precede the next independent slice segment (if any) within the same access unit. In HEVC, a slice segment is defined as an integer number of coded tree units that are consecutively ordered in a tile scan and contained in a single NAL unit. The division of each picture into slice segments is called partitioning. In HEVC, an independent slice segment is defined as a slice segment in which the values of syntactic elements in the slice segment header are not inferred from the values of preceding slice segments. A dependent slice segment is defined as a slice segment in which the values of some syntactic elements in the slice segment header are inferred from the values of preceding independent slice segments in the decoding order. In HEVC, a slice header is defined as the slice segment header of the current slice segment, which is an independent slice segment, or the slice segment header of an independent slice segment preceding the current dependent slice segment. A slice segment header is also defined as part of an encoded slice segment, containing data elements related to the first or all encoded tree units represented by the slice segment. CUs are scanned in the raster scan order of LCUs within a tile if tiles are used, or in the raster scan order of LCUs within a picture if tiles are not used. Within an LCU, CUs have a specific scan order.
[0067] The decoder reconstructs the output video by applying prediction means similar to those used by the encoder. Specifically, it uses motion or spatial information generated and stored in a compressed representation by the encoder to form a predicted representation of the pixel block, and performs prediction error decoding (the inverse of prediction error coding, which restores the quantized prediction error signal in the spatial pixel region). After applying the prediction means and prediction error decoding means, the decoder adds the predicted signal and the prediction error signal (pixel value) to generate an output video frame. The decoder (and encoder) may also apply additional filtering means to improve the quality of the output video. This is done before display and / or before storing it as a prediction reference for subsequent frames in the video sequence.
[0068] Filtering may include one or more of the following: deblocking, sample adaptive offset (SAO), and adaptive loop filtering (ALF). H.264 / AVC includes deblocking, while HEVC includes both deblocking and SAO.
[0069] In a typical video codec, motion information is represented by motion vectors associated with motion-compensated image blocks, such as prediction units. Each of these motion vectors represents the displacement between the image block in the picture being encoded (encoder side) or decoded (decoder side) and the source block in either the previously encoded or decoded picture. To efficiently represent motion vectors, they are typically encoded as the difference to the predicted motion vector for each block. In a typical video codec, the predicted motion vector is generated in a predetermined way, for example, by calculating the median of the encoded or decoded motion vectors of adjacent blocks. Another method for generating motion vector predictions is to generate a list of prediction candidates from adjacent blocks and / or blocks at the same position in the temporal reference picture, and signal the selected candidates as motion vector predictors. In addition to predicting motion vector values, it is also possible to predict the reference picture used for motion compensation prediction, and this prediction information is represented, for example, by the reference index of the previously encoded / decoded picture. The reference index is typically predicted from adjacent blocks and / or blocks at the same position in the temporal reference picture. Furthermore, typical high-efficiency video codecs employ an additional motion information encoding / decoding mechanism (often called merge / merge mode). In this mechanism, all motion field information, including the motion vector and corresponding reference picture index for each available reference picture list, is predicted and used without modification / correction. Similarly, the prediction of motion field information is performed using motion field information from adjacent blocks and / or co-located blocks within the temporal reference picture, and the motion field information used is signaled from a list of motion field candidates filled with motion field information from available adjacent / co-located blocks.
[0070] In typical video codecs, the predicted residuals after motion compensation are first transformed by a transformation kernel (such as DCT) and then encoded. This is because the residuals often still contain correlations, and the transformation reduces these correlations, often resulting in more efficient encoding.
[0071] Video coding standards and specifications may allow encoders to divide coded images into coded slices or similar units. Typically, intra-image predictions are invalidated across slice boundaries. Therefore, slices can be considered a means of dividing coded images into independently decodeable portions. In H.264 / AVC and HEVC, intra-image predictions may also be invalidated across slice boundaries. Therefore, slices are considered a means of dividing coded images into independently decodeable portions, and slices are often treated as the basic unit for transmission. Often, encoders can indicate within the bitstream which types of intra-image predictions are invalidated across slice boundaries, and decoders consider this information, for example, when determining which prediction sources are available. For example, if adjacent CUs belong to another slice, samples from those adjacent CUs may be considered unavailable for intra-prediction.
[0072] The fundamental unit in the output of an H.264 / AVC or HEVC encoder and the input of an H.264 / AVC or HEVC decoder is the Network Abstraction Layer (NAL) unit. For transmission in packet-oriented networks and storage in structured files, NAL units may be encapsulated in packets or similar structures. H.264 / AVC and HEVC specify a byte-stream format for transmission and storage environments that do not provide a frame structure. In the byte-stream format, NAL units are distinguished from each other by adding a start code before each NAL unit. To avoid false detection of NAL unit boundaries, the encoder executes a byte-oriented start code pseudo-occurrence prevention algorithm. This means that if a start code would normally occur, a pseudo-occurrence prevention byte is added to the NAL unit's payload. To facilitate gateway operations between packet-oriented and stream-oriented systems, start code pseudo-occurrence prevention may always be performed, regardless of whether the byte-stream format is used. A NAL unit can be defined as a syntactic structure that includes information indicating the type of subsequent data, and a byte sequence containing that data in the form of an RBSP (Raw Byte Sequence Payload), with pseudo-error prevention bytes inserted as needed. An RBSP can be defined as a syntactic structure consisting of an integer number of bytes encapsulated within the NAL unit. An RBSP can be empty, or it can have the form of a data bit sequence containing syntactic elements followed by an RBSP stop bit, and then zero or more bits equal to 0.
[0073] A NAL unit consists of a header and a payload. In H.264 / AVC and HEVC, the NAL unit header indicates the type of NAL unit.
[0074] In HEVC, a 2-byte NAL unit header is used for all specified NAL unit types. The NAL unit header contains 1 reserved bit, 6 bits indicating the NAL unit type, 3 bits nuh_temporal_id_plus1 indicating the time level (which may be required to be 1 or greater), and 6 bits nuh_layer_id syntax elements. The temporal_id_plus1 syntax element can be considered the time identifier of the NAL unit, and a zero-based TemporalId variable can be derived as TemporalId temporal_id_plus1-1. The abbreviation TID can be used interchangeably with the TemporalId variable. A TemporalId of 0 corresponds to the lowest time level. The value of temporal_id_plus1 must be non-zero to avoid pseudo-start code occurrences related to the 2-byte NAL unit header. A bitstream generated by excluding all VCL NAL units with a TemporalId greater than or equal to a selected value and including all other VCL NAL units maintains compliance. Therefore, a picture whose TemporalId is equal to tid_value will not use a picture whose TemporalId is greater than tid_value as an interprediction reference. A sublayer, or time sublayer, is defined as a time-scalable layer (or time layer, TL) of a time-scalable bitstream and consists of VCL NAL units with a specific TemporalId value and associated non-VCL NAL units. nuh_layer_id can be understood as the scalability layer identifier.
[0075] NAL units can be classified into video coding layer (VCL) NAL units and non-VCL NAL units. VCL NAL units are typically coding slice NAL units. In HEVC, a VCL NAL unit contains syntactic elements representing one or more CUs.
[0076] Non-VCL NAL units may include, for example, sequence parameter sets, picture parameter sets, supplemental extension information (SEI) NAL units, access unit delimiters, sequence end NAL units, bitstream end NAL units, or filler data NAL units. Parameter sets may be required for the reconstruction of the decoded image, while many other non-VCL NAL units are not essential for the reconstruction of the decoded sample values.
[0077] Parameters that do not change throughout the encoded video sequence may be included in the sequence parameter set. In addition to the parameters required for decoding, the sequence parameter set may optionally include video usability information (VUI). This includes parameters that are important for buffering, image output timing, rendering, and resource reservation. In HEVC, the sequence parameter set (RBSP) includes parameters that may be referenced by one or more picture parameter sets (RBSPs) or one or more SEI NAL units that include buffering period SEI messages. The picture parameter set includes parameters that are unlikely to change across multiple encoded pictures. The picture parameter set (RBSP) may include parameters that may be referenced from the encoded slice NAL units of one or more encoded pictures.
[0078] In HEVC, the Video Parameter Set (VPS) can be defined as a syntactic structure containing syntactic elements that apply to zero or more encoded video sequences, determined by syntactic elements in the SPS, which are referenced by syntactic elements in the P
[0079] A video parameter set (VPS) may contain parameters that can be referenced from one or more sequence parameter sets (SPS).
[0080] The relationships and hierarchy between video parameter sets (VPS), sequence parameter sets (SPS), and picture parameter sets (PPS) can be described as follows: VPS is one level above SPS in the parameter set hierarchy and in the context of scalability and / or 3D video. VPS may contain parameters common to all slices within all (scalability or view) layers across the entire encoded video sequence. SPS contains parameters common to all slices within a particular (scalability or view) layer across the entire encoded video sequence and may be shared across multiple (scalability or view) layers. PPS contains parameters common to all slices within a particular layer representation (a representation of a scalability or view layer in a single access unit) and is likely to be shared by all slices across multiple layer representations.
[0081] The VPS can provide information about the dependencies between layers within the bitstream, and much other information applicable to all slices across all (scalability or view) layers in the entire encoded video sequence. The VPS can be considered to consist of two parts: a base VPS and a VPS extension, where the VPS extension is optional.
[0082] Out-of-band transmission, signaling, and storage may be used additionally or alternatively for purposes other than tolerance to transmission errors (such as accessibility or session negotiation). For example, sample entries for tracks in a file conforming to an ISO-based media file format may include parameter sets, and encoded data in a bitstream may be stored elsewhere in the file or in a separate file. Expressions such as "along the bitstream" (e.g., indicating "along the bitstream") or "along the encoding unit of the bitstream" (e.g., indicating "along the encoding tile") may be used in claims or described embodiments to refer to out-of-band transmission, signaling, or storage in a manner in which out-of-band data is associated with the bitstream or encoding unit. Expressions such as "along the bitstream" or "along the encoding unit of the bitstream" may refer to the decoding of out-of-band data associated with the bitstream or encoding unit (which may be obtained by out-of-band transmission, signaling, or storage), respectively.
[0083] An SEI NAL unit may contain one or more SEI messages. These are not necessary for decoding the output image, but they can assist in related processes such as image output timing, rendering, error detection, error concealment, and resource reservation.
[0084] An encoded image is an encoded representation of an image.
[0085] In HEVC, an encoded image may be defined as the encoded representation of an image that includes all the encoding tree units of that image. In HEVC, an access unit (AU) may be defined as a set of NAL units that are related to each other according to a predetermined classification rule, are consecutive in the decoding order, and contain at most one image with a specific nuh_layer_id value. An access unit may include not only VCL NAL units of the encoded image but also non-VCL NAL units. A predetermined classification rule may, for example, associate images with the same output time or image output count value with the same access unit.
[0086] A bitstream can be defined as a sequence of bits in the form of a NAL unit stream or byte stream, which forms a representation of encoded images and associated data that make up one or more encoded video sequences. A first bitstream may be followed by a second bitstream within the same logical channel, such as within the same file or within the same connection of a communication protocol. An elementary stream (in the context of video encoding) is defined as a sequence of one or more bitstreams. The end of a first bitstream is indicated by a specific NAL unit, called the End of Bitstream (EOB) NAL unit, which is the last NAL unit of the bitstream. HEVC and its current draft extensions require that the nuh_layer_id of the EOB NAL unit be 0.
[0087] In H.264 / AVC, an encoded video sequence is defined as a sequence of consecutive access units in a decoding order, from the one containing the IDR access unit to the one excluding the next IDR access unit, or to the end of the bitstream, whichever comes first.
[0088] In HEVC, an encoded video sequence (CVS) is defined, for example, as follows: In the decoding order, it is defined as a sequence of access units that begins with an IRAP access unit whose NoRaslOutputFlag is 1, followed by zero or more access units that are not IRAP access units with NoRaslOutputFlag 1. This includes all subsequent access units up to the next access unit that is an IRAP access unit with NoRaslOutputFlag 1 (but not the access unit itself). An IRAP access unit is defined as an access unit whose base layer picture is an IRAP picture. The NoRaslOutputFlag value is 1 for each IDR image, each BLA image, and each IRAP image that is the first image in a particular layer of the bitstream in the decoding order. It is also 1 for the first IRAP image following a sequence-ending NAL unit with the same nuh_layer_id value in the decoding order. Means may be provided for an external entity (e.g., a player or receiver controlling the decoder) to provide the HandleCraAsBlaFlag value to the decoder. HandleCraAsBlaFlag may be set to 1, for example, by a player seeking to a new position in the bitstream, or by a player tuning into the broadcast and starting decoding, and then starting decoding from the CRA image. If HandleCraAsBlaFlag is 1 for a CRA image, that CRA image is treated and decoded as a BLA image.
[0089] In HEVC, an encoded video sequence may be specified, in addition to or instead of the above specifications, to terminate when a specific NAL unit (sometimes called an End of Sequence (EOS) NAL unit) appears in the bitstream and nuh_layer_id is 0.
[0090] Groups of Pictures (GOPs) and their characteristics are defined as follows: A GOP can be decoded regardless of whether any preceding images have been decoded or not. An open GOP is a group of images where, in the output order, images located before the initial intra-image may not be correctly decoded if decoding starts from the initial intra-image of the open GOP. In other words, images in an open GOP may (in interpretation) refer to images belonging to a preceding GOP. The HEVC decoder can recognize the intra-image that starts an open GOP because it is used for slices encoded with a specific NAL unit type (CRA NAL unit type). A closed GOP is a group of images where all images can be correctly decoded if decoding starts from the initial intra-image of that closed GOP. In other words, no images in a closed GOP refer to images belonging to a preceding GOP. In H.264 / AVC and HEVC, a closed GOP can start from an IDR image. In HEVC, a closed GOP can also start from a BLA_W_RADL image or a BLA_N_LP image. Open GOP coding structures offer greater flexibility in the selection of reference images, potentially resulting in higher compression efficiency compared to closed GOP coding structures.
[0091] A decoded image buffer (DPB) may be used in encoders and / or decoders. There are two reasons for buffering decoded images: for reference in interpretation and to sort decoded images in output order. Because H.264 / AVC and HEVC offer high flexibility in both marking reference images and sorting output order, having separate buffers for reference images and output images can waste memory resources. Therefore, the DPB may include integrated decoded image buffering for buffering reference images and sorting output order. Decoded images may be removed from the DPB when they are no longer used as references and are no longer needed for output.
[0092] In many coding modes of H.264 / AVC and HEVC, the reference image for interpretation is indicated by an index in the reference image list. This index can be coded by variable-length coding, so a smaller index usually corresponds to a shorter code length for the corresponding syntactic element. In H.264 / AVC and HEVC, two reference image lists (reference image list 0 and reference image list 1) are generated for each bidirectional prediction (B) slice, and one reference image list (reference image list 0) is formed for each intercoded (P) slice.
[0093] Many encoding standards, including H.264 / AVC and HEVC, have a decoding process for deriving a reference image index for a reference image list. This is used to indicate which reference image list to use for interpretation of a particular block. In some intercoding modes, the reference image index may be encoded into the bitstream by the encoder, while in other intercoding modes, it may be derived (by the encoder and decoder) using, for example, adjacent blocks.
[0094] The types of motion parameters or motion information include, but are not limited to, one or more of the following: Instructions for the prediction type (e.g., intra prediction, unidirectional prediction, bidirectional prediction) and / or the number of reference images, Instructions for the prediction direction, such as inter-time prediction, inter-layer prediction, inter-view prediction, view composite prediction (VSP), inter-component prediction, etc. (may be specified per reference image and / or per prediction type; in some embodiments, inter-view prediction and view composite prediction may be considered together as a single prediction direction), Instructions for the type of reference image, such as short-term reference image, long-term reference image, inter-layer reference image, etc. (these may be specified for each reference image), A reference index to the list of reference images, and / or other identifiers for the reference images (for example, one that may be shown for each reference image, the type of which may depend on the prediction direction and / or the reference image type, and which may be accompanied by relevant information such as the list of reference images to which the reference index applies), Horizontal motion vector components (which may be shown, for example, per prediction block, per reference index), Vertical motion vector components (which may be shown, for example, per prediction block, per reference index, or in similar units), One or more parameters (e.g., the image order count difference and / or relative camera spacing between an image containing / or related to motion parameters and its reference image) may be used to scale the horizontal motion vector components and / or vertical motion vector components in one or more motion vector prediction processes (where one or more parameters may be shown, for example, for each reference image or for each reference index), The coordinates of the block to which motion parameters and / or motion information are applied, for example, the coordinates of the top-left sample of the block in luminance sample units, The extent of the block to which motion parameters and / or motion information apply (e.g., width and height).
[0095] Compared to conventional video encoding standards, the general-purpose video codec (H.266 / VVC) introduces several new encoding techniques, as follows: • Intra-prediction 67 intra modes with wide-angle mode extension, Block size and mode-dependent 4-tap interpolation filter, Location-dependent intra-predictive coupling (PDPC), Cross-component linear model intraprediction (CCLM), Multi-reference line intra-prediction, Intra subpartition, Weighted intra prediction using matrix multiplication • Picture-to-picture prediction Block motion copying using spatial, temporal, historical, and pairwise mean candidates. Affine motion interface prediction, Subblock-based temporal motion vector prediction, Adaptive motion vector resolution, 8x8 block-based motion compression for temporal motion prediction, High-precision (1 / 16 pixel) motion vector storage, and motion compensation using an 8-tap interpolation filter for luminance components and a 4-tap interpolation filter for chrominance components. triangulation, A combination of intra-prediction and inter-prediction. Integration with MVD (MMVD), Symmetric MVD coding, Bidirectional optical flow, Decoder-side motion vector refinement, CU-level weighted bidirectional prediction, • Transformation, quantization, coefficient coding Multiple primary conversion selection using DCT2, DST7, and DCT8. Secondary conversion for low frequency range, Subblock transformation for interpredicted residuals, Dependent quantization with maximum QP increased from 51 to 63. Coefficient coding with coded data hiding, Conversion skip residual coding, Entropy coding Arithmetic coding engine with adaptive dual-window probability update, • In-loop filter Reshaping within the loop, Deblocking filters that use powerful and long filters, Sample-adaptive offset, Adaptive loop filter, Screen content encoding Current image reference with reference area limitations, 360-degree video coding Horizontal wrap-around motion compensation, • High-level syntax and parallel processing Reference image management through direct signaling of the reference image list. A tile group with rectangular tile groups.
[0096] In VVC, partitioning is performed similarly to HEVC. That is, each image is partitioned into coding tree units (CTUs). Images may also be partitioned into slices, tiles, bricks, or subpictures. CTUs are partitioned into smaller CUs using a quadtree structure. Each CU is partitioned using quadtrees, ternary partitions, and nested multitype trees, including binary partitions. However, there are specific rules for estimating partitions at image boundaries, and redundant partitioning patterns are prohibited in nested multitype tree partitions.
[0097] Among the new coding tools described above, the cross-component linear model (CCLM) prediction mode is used in VVC to reduce cross-component redundancy. In this method, the following linear model is used to predict chrominance samples based on reconstructed luminance samples of the same CU.
number
[0098] Here, pred c (i,j) represents the predicted color difference sample within CU, and rec L ’ (i,j) represents a downsampled reconstructed luminance sample of the same CU.
[0099] Alternatively, the following formula can be used for CCLM.
number
[0100] The CCLM parameters (α and β) are derived from up to four adjacent color difference samples and their corresponding downsampled luminance samples. Assuming the dimensions of the current luminance block are W×H, W' and H' are set as follows: • When LM mode is applied W'=W, H'=H • When LM mode is applied W' = W + H • When LM-L mode is applied H' = H + W
[0101] Here, LM-A mode refers to linear model_above and uses only the upper template (i.e., sample values from the upper adjacent position of the CU) to calculate the linear model coefficients. To obtain more samples, the upper template is extended to (W+H). LM-L mode refers to linear model_left and calculates the linear model coefficients using only the left template (sample values from the left adjacent position of the CU). To obtain more samples, the left template is extended to (H+W). For non-square blocks, the upper template is extended to W+W and the left template is extended to H+H.
[0102] The adjacent positions on the upper side are, S[0, -1] … S[W'-1, -1] It is expressed as follows, and the adjacent position on the left is, S[-1, 0] … S[-1, H'-1] It is expressed as follows. Then, the four samples are selected as follows. • When LM mode is applied and both the upper and left adjacent samples are available. S[W' / 4, -1], S[3*W' / 4, -1], S[-1, H' / 4], S[-1, 3*H' / 4] • When LM-A mode is applied, or when only the upper adjacent sample is available. S[W' / 8, -1], S[3*W' / 8, -1], S[5*W' / 8, -1], S[7*W' / 8, -1] • When LM-L mode is applied, or when only the adjacent sample to the left is available. S[-1, H' / 8], S[-1, 3* H' / 8], S[-1, 5*H' / 8], S[-1, 7*H' / 8]
[0103] Four adjacent luminance samples at the selected location are downsampled and compared four times to find two smaller values x0A and x1A, and two larger values x0B and x1B. The corresponding color difference sample values are denoted as y0A, y1A, y0B, and y1B. Subsequently, xA, xB, yA, and yB are derived as follows. X a =(x 0 A +x 1 A +1)>>1; X b =(x 0 B +x 1 B +1)>>1; Y a =( y 0 A +y 1 A +1)>>1; Y b =( y 0 B +y 1 B +1)>>1 (Formula 2)
[0104] Finally, the parameters α and β of the linear model are obtained according to the following equations.
number
number
[0105] Figure 5 shows examples of the left and upper sample locations involved in the CCLM mode, as well as the sample locations in the current block.
[0106] The division operation to calculate the parameter α is implemented using a lookup table. To reduce the memory required for table storage, the difference value (the difference between the maximum and minimum values) and the parameter α are expressed in exponential notation. For example, the difference is approximated by a 4-bit effective part and an exponent. As a result, the table of 1 / difference corresponding to the 16 possible values of the effective part is reduced to 16 elements as follows: DivTable[ ]={0, 7, 6, 5, 5, 4, 4, 3, 3, 2, 2, 1, 1, 1, 1, 0} (Formula 5)
[0107] This has the advantage of reducing both the complexity of the calculations and the amount of memory required to store the necessary tables.
[0108] To match the color difference sample positions in a 4:2:0 video sequence, two types of downsampling filters are applied to the luminance samples to achieve a 2:1 downsampling ratio in both the horizontal and vertical directions. The selection of the downsampling filter is defined by the SPS level flag. The two types of downsampling filters are as follows, corresponding to "Type 0" and "Type 2" content, respectively.
number
number
[0109] Note that if the upper reference line is located at the CTU boundary, only one luminance line (a typical line buffer in intra prediction) is used to generate the downsampled luminance sample.
[0110] This parameter calculation is performed as part of the decoding process and is not simply an encoder search process. As a result, no syntax is used to communicate the values of α and β to the decoder.
[0111] In chrominance intra-mode coding, a total of eight intra-modes are permitted. These modes include five conventional intra-modes and three cross-component linear model modes (CCLM, LM_A, LM_L). The signaling and derivation processes for chrominance modes are shown in Table 1. The coding of chrominance modes directly depends on the intra-prediction mode of the corresponding luminance block. In I-slices, separate block partitioning structures are permitted for luminance and chrominance components, so one chrominance block may correspond to multiple luminance blocks. Therefore, in chrominance DM modes, the intra-prediction mode of the corresponding luminance block covering the center position of the current chrominance block is directly inherited. [Table 1]
[0112] As shown in Table 2, a single binarization table is used regardless of the value of sps_cclm_enabled_flag. [Table 2]
[0113] In Table 2, the first bin indicates whether it is normal mode (0) or LM mode (1). If it is LM mode, the next bin indicates whether it is LM_CHROMA (0). If it is not LM_CHROMA, the next bin indicates whether it is LM_L (0) or LM_A (1). In this case, when sps_cclm_enabled_flag is 0, the first bin of the corresponding intra_chroma_pred_mode binarization table can be discarded prior to entropy coding. In other words, the first bin is presumed to be 0 and therefore not coded. This single binarization table is used in both cases where sps_cclm_enabled_flag is 0 and 1. The first two bins in Table 2 are context-coded by a dedicated context model, and the remaining bins are bypass-coded.
[0114] Furthermore, to reduce the delay between luminance and chrominance in dual trees, if a 64x64 luminance coding tree node is Not Split (and does not use an intra-subpartition (ISP) for 64x64 CUs) or is split by QT, the chrominance CUs in a 32x32 / 32x16 chrominance coding tree node are permitted to use CCLM in the following manner: If a 32x32 color difference node is not split or is split using QT splitting, all color difference CUs within the 32x32 node can use CCLM. If a 32x32 color difference node is divided by horizontal BT and the 32x16 child node is not divided or uses vertical BT division, all color difference CUs within the 32x16 color difference node can use CCLM.
[0115] For all other luminance and chrominance coding tree partitioning conditions, CCLM is not permitted for chrominance CUs.
[0116] Multi-model LM (MMLM)
[0117] The CCLM included in VVC is extended by adding three multi-model linear model (MMLM) modes. In each MMLM mode, reconstructed neighboring samples are classified into two classes using the average value of the reconstructed luminance neighboring samples as a threshold. The linear model for each class is derived using the least mean squares (LMS) method. The LMS method is also used to derive the linear model in the CCLM mode. Figures 6a and 6b show two luminance-to-chrominance models obtained when the luminance (Y) threshold is set to 17 in the sample and spatial domains. Each luminance-to-chrominance model has its own linear model parameters α and β. As can be seen from Figure 6b, each luminance-to-chrominance model corresponds to the spatial segmentation of the content (i.e., to different objects or textures in the scene).
[0118] Convolutional Cross-Component Model (CCCM)
[0119] CCCM, an improved version of cross-component prediction, derives a luminance-to-chrominance model using a two-dimensional filter kernel. Filter coefficients are derived on the decoder side using reconstructed input data and chrominance samples. To derive the filter coefficients, co-located reference sample regions consisting of reconstructed luminance samples and re-luminance chrominance samples are defined for both luminance and chrominance, as shown in Figure 7. A commonly used 4:2:0 chrominance downsampling is applied here. The reference sample region for a given block can be, for example, the six lines on the top and left sides as shown in Figure 7, but any reference lines achievable by both the encoder and decoder can be used. Generally, the reference samples can include any chrominance and luminance samples reconstructed by both the encoder and decoder. Once the reference samples are determined, the filter coefficients can be derived using various linear regression tools, such as standard least squares estimation, orthogonal matching tracking, optimized orthogonal matching tracking, ridge regression, minimum absolute erosion, and selection operators.
[0120] The dimensions of the filter kernel can be any dimension, such as 1x3 (1D vertical), 3x1 (1D horizontal), 3x3, or 7x7, and it can also be formed into any shape, such as a cross or a rhombus (as shown in Figure 8), by selecting only a subset of the possible kernel positions. When referring to samples within the filter kernel, a notation is used in which the letters N, E, S, W, and C represent north (top), east (right), south (bottom), west (left), and center, respectively, as shown in Figure 8.
[0121] The overall method for reconstructing color difference samples by convolution of the filter kernel obtained on the decoder side with the input dataset is referred to here as a convolutional cross-component model (CCCM). The following steps can be applied to perform the CCCM process. 1) Define a reference region whose position is aligned with that of the luminance component and the chrominance component. 2) Downsample the luminance sample to match the color difference grid (optional). 3) Scan luminance and chrominance samples within the reference region and collect available statistics (such as autocorrelation matrices and cross-correlation vectors) based on the filter shape. 4) Based on available statistics (such as autocorrelation matrices and cross-correlation vectors), the filter coefficients are determined by minimizing the squared error (or other evaluation metrics). 5) Predicted color difference blocks are calculated by convolving the downsampled luminance samples with a filter kernel.
[0122] We define the luminance samples (which may have been downsampled) as a two-dimensional array Y(x,y) indexed by the horizontal x-coordinate and vertical y-coordinate. Similarly, we define the color difference samples at the same positions as a two-dimensional array C(x,y), and the filter kernel (coefficients) as a 3x3 array F(i,j). At the sample level, we define the convolution of Y and F as follows:
number
[0123] When using other data terms, such as a nonlinear square root term, the added convolution would be as follows:
number
number
number
[0124] Multiple Reference Line (MRL) Intra-Prediction
[0125] Multiple Reference Line (MRL) intra-prediction uses more reference lines for intra-prediction. Figure 9 shows an example with four reference lines, where samples from segments A and F are not taken from reconstructed adjacent samples, but are padded by the nearest samples from segments B and E, respectively. HEVC intra-image prediction uses the nearest reference line (i.e., reference line 0). MRL uses two additional lines (reference line 1 and reference line 3).
[0126] The index of the selected reference line (mrl_idx) is signaled and used to generate the intra-predictor. For reference line indices greater than 0, only the additional reference line mode is included in the MPM list, and only the MPM index is signaled without the residual mode. The reference line index is signaled before the intra-prediction mode, and if a non-zero reference line index is signaled, the plain mode is excluded from the intra-prediction mode.
[0127] MRL is disabled for the first line of a block within the CTU. This prevents the use of extended reference samples outside the current CTU line. Additionally, PDPC is disabled if additional lines are used. In MRL mode, the derivation of DC values in DC intra-prediction mode when the reference line index is non-zero is consistent with the derivation when the reference line index is 0. MRL requires the storage of three adjacent luminance reference lines in the CTU for prediction generation. The CCLM tool also requires three adjacent luminance reference lines for the downsampling filter. To reduce the storage capacity requirements in the decoder, the definition of MRL is consistent with CCLM to use the same three lines.
[0128] Intra-subpartition (ISP)
[0129] An intra-subpartition (ISP) divides a luminance intra-prediction block into two or four subpartitions vertically or horizontally, depending on the block size. For example, the minimum block size for an ISP is 4x8 (or 8x4). If the block size is larger than 4x8 (or 8x4), the corresponding block is divided into four subpartitions. It has been noted that Mx128 (M≦64) and 128×N (N≦64) ISP blocks can cause potential problems with respect to 64×64 VDPUs. For example, an Mx128 CU in a single-tree configuration has an Mx128 luminance TB and two corresponding M / 2×64 chrominance TBs. If the CU uses an ISP, the luminance TB is divided into four M×32 TBs (horizontal division only is possible), each smaller than a 64×64 block. However, in current ISP designs, the chrominance block is not divided. Therefore, both chrominance components are larger than a 32×32 block. Similarly, the same situation can occur with 128×N CUs using ISP. Therefore, these two cases pose a problem for 64×64 decoder pipelines. For this reason, the maximum CU size that can use ISP is limited to 64×64. All subpartitions satisfy the condition that they have at least 16 samples.
[0130] Matrix-weighted intra-prediction (MIP)
[0131] Matrix-weighted intra-prediction (MIP) is a new intra-prediction technique added to VVC. To predict samples in a rectangular block of width W and height H, MIP uses one line consisting of H reconstructed neighboring boundary samples located on the left side of the block and one line consisting of W reconstructed neighboring boundary samples located on the top side of the block as input. If reconstructed samples are unavailable, they are generated in the same way as in conventional intra-prediction. The generation of the prediction signal is based on three steps: averaging, matrix-vector multiplication, and linear interpolation, as shown in Figure 10.
[0132] Interpretation in VVC
[0133] The merge list may include the following candidates: 1) Spatial MVP from spatial neighbor CU 2) Temporal MVP from the same CU 3) History-based MVP from FIFO table 4) Pairwise average MVP (using candidates from the list) 5) Zero MVP
[0134] Merge Mode Width Motion Vector Difference (MMVD) is a method that first signals the merge candidates, and then signals the MVD and resolution exponent.
[0135] In symmetric MVD, for bidirectional prediction, the motion information in List 1 is derived from the motion information in List 0.
[0136] In affine prediction, multiple motion vectors are shown for different corners of a block, and these are used to derive motion vectors for subblocks. In affine merging, the affine motion information of a block is generated based on the normal or affine motion information of adjacent blocks.
[0137] In subblock-based time-series motion vector prediction, the motion vector of the current block's subblocks is predicted from the appropriate subblocks in the reference frame, indicated by the motion vectors (if available) of spatially adjacent blocks.
[0138] Adaptive Motion Vector Resolution (AMVR) indicates the accuracy of the MVD for each CU.
[0139] In CU-level weighted bidirectional prediction, an index is used that represents the weight values of the weighted average of two prediction blocks.
[0140] Bidirectional Optical Flow (BDOF) refines motion vectors in bidirectional prediction. BDOF generates two prediction blocks using signaled motion vectors. Then, to minimize the error between the two prediction blocks, motion refinement is calculated using their gradient values. The final prediction block is refined using motion refinement and gradient values.
[0141] CU-level weighted bidirectional prediction (BCW) and weighted prediction (WP)
[0142] In HEVC, the bidirectional prediction signal is generated by averaging two prediction signals obtained from different reference images and / or using two different motion vectors. In VVC, the bidirectional prediction mode is extended beyond simple averaging to allow for a weighted average of the two prediction signals.
number
[0143] Weighted mean bidirectional prediction allows for five weights (w∈{-2,3,4,5,10}). For each bidirectional prediction CU, the weight w is determined in one of the following ways: 1) For non-merged CUs, the weight index is signaled after motion vector difference. 2) For merged CUs, the weight index is estimated from adjacent blocks based on the merge candidate index. BCW is applied only to CUs with 256 or more luminance samples (i.e., CU width × CU height is 256 or more). All five weights are used for low-latency images. Only three weights (w∈{3,4,5}) are used for non-low-latency images. On the encoder side, a fast search algorithm is applied to identify the weighted index without significantly increasing complexity. An overview of these algorithms is provided below. For further details, please refer to the VTM software and document JVET-L0646. When combined with AMVR, unequal weighting is conditionally checked for 1-pel and 4-pel motion vector accuracy only if the current frame is a low-latency frame. When combined with affine mode, affine ME is performed on unequal weights only if affine mode is selected as the current optimal mode. In bidirectional prediction, if the two reference images are identical, unequal weighting is checked conditionally. Depending on the POC distance between the current image and its reference image, the encoded QP, and the time level, unequal weights are not searched if certain conditions are met.
[0144] The BCW weight index is encoded using one context-coded bin followed by a bypass-coded bin. The first context-coded bin indicates whether equal weights are used or not. If unequal weights are used, the additional bin is signaled by bypass coding to indicate which unequal weights are used.
[0145] Weighted Prediction (WP) is an encoding technique supported by the H.264 / AVC and HEVC standards for efficiently encoding video content with fading. WP support has also been added to the VVC standard. With WP, weighting parameters (weights and offsets) can be signaled for each reference image in each reference image list L0 and L1. Then, during motion compensation, the corresponding weights and offsets of the reference images are applied. WP and BCW are designed for different types of video content. To avoid the interaction between WP and BCW that complicates VVC decoder design, when a CU uses WP, the BCW weight index is not signaled, and w is estimated to be 4 (i.e., equal weights are applied). In the case of merged CUs, the weight index is estimated from adjacent blocks based on the merge candidate index. This is applicable to both normal merge mode and inherited affine merge mode. In constructed affine merge mode, affine motion information is constructed based on motion information from up to three blocks. The BCW index of a CU using constructed affine merge mode is simply set to equal to the BCW index of the first control point MV.
[0146] In VVC, CIIP and BCW cannot be used simultaneously for the same CU. When a CU is encoded in CIIP mode, the BCW index of that CU is set to 2 (equal weight).
[0147] Joint Inter / Intra Prediction (CIIP)
[0148] In VVC, when a CU is encoded in merge mode, an additional flag is indicated to show whether the Combined Inter / Intra Prediction (CIIP) mode is applied to the current CU if the CU contains at least 64 luminance samples (i.e., CU width × CU height is 64 or greater) and both the CU width and CU height are less than 128 luminance samples. As the name suggests, CIIP prediction combines the inter-prediction signal and the intra-prediction signal. Inter-prediction signal P in CIIP mode inter This is derived using a method similar to the inter-prediction process applied to normal merge modes. Meanwhile, the intra-prediction signal P intra This is derived according to a normal intra-prediction process using planar mode. The intra-prediction signal and the inter-prediction signal are then combined using a weighted average. These weight values are calculated as follows, depending on the coding mode of the block above and to the left of the current block, as shown in Figure 11. If an adjacent block exists above and is intra-encoded, set isIntraTop to 1. Otherwise, set isIntraTop to 0. If the block to the left exists and is intra-encoded, set isIntraLeft to 1. Otherwise, set isIntraLeft to 0. If (isIntraLeft + isIntraTop) is equal to 2, then wt is set to 3. Otherwise, if (isIntraLeft + isIntraTop) is 1, set wt to 2; Otherwise, set wt to 1.
[0149] The CIIP prediction is formed as follows:
number
[0150] Local Illumination Correction (LIC)
[0151] LIC is an inter-prediction technique that models local illumination variations between the current block and its predicted block as a function of variations between the current block template and the reference block template. The parameters of this function are expressed in terms of scale α and offset β, forming a linear equation α*p[x]+β that compensates for illumination changes, where p[x] is the reference sample pointed to by the MV at position x on the reference image. Since α and β can be derived based on the current block template and the reference block template, no signal transmission overhead is required for them. However, in AMVP mode, an LIC flag is transmitted to indicate the use of LIC.
[0152] The local illumination correction proposed in JVET-O0066 is used in the ECM with the following modifications to the uniprediction interCU. • Intra-neighbor samples can be used to derive LIC parameters. • Disable LIC for blocks with fewer than 32 luminance samples. In both non-subblock and affine modes, the LIC parameters are derived based on the template block sample corresponding to the current CU, not on the partial template block sample corresponding to the first 16x16 unit in the upper left corner. • The sample reference block template is generated using MC without rounding the block MV to integer pixel precision.
[0153] From the above, it can be concluded that intercoding in modern video encoding standards (such as H.266) uses a single-tree motion-compensated prediction. That is, regions from a reference frame or set of frames are copied to the target frame using subpixel interpolation. Furthermore, additional tools such as weighted prediction (WP), CU-level weighted bidirectional prediction (BCW), a combination of inter- and intra-prediction (CIIP), and local illumination compensation (LIC) are used to compensate for changes in illumination and color differences between the reference and target frames.
[0154] In single-tree coding, one set of partitioning information is encoded for each block, and this partitioning information is shared between the luminance and chrominance components. Due to the scarcity of dedicated chrominance coding tools, luminance blocks often take precedence over chrominance blocks in intercoding. Even when dedicated chrominance tools such as CCCM and CCLM are used, they are not very effective because chrominance blocks in the reference frame are usually of higher quality than inter-predicted cross-component blocks. To leverage the advantages of cross-component tools in inter-prediction, an improved approach is desirable, primarily using motion-compensated luminance and chrominance reference samples at the same location to model chrominance samples.
[0155] Typically, predicted samples are generated by motion-compensated prediction using a reference frame and sub-perfect motion vectors. As mentioned earlier, it is also possible to further refine the predicted samples using tools such as WP, BCW, CIIP, GPM, and LIC. However, the use of such refinement tools can negatively impact the reference samples used. For example, in affine prediction, motion compensation generates predictions by applying transformations to the reference block, which may be independent of the model parameter calculation. Processing such as LIC can have undesirable effects on the model parameter calculation, and interpolation filters can affect signal characteristics, such as the loss of high-frequency components. In general, additional processing introduces delays in the model calculation pipeline. Furthermore, if the block size is small, the predicted block may not contain enough samples, and a dependency on reconstructed samples from neighboring blocks may arise, making it undesirable to use samples from the vicinity of the current block.
[0156] Here, we'll show you how to improve your selection of reference samples.
[0157] Figure 12 shows a method according to one embodiment. This method includes determining an image block unit of a frame (1200). An image block unit contains samples of color channels. Determine at least one reference image and a first reference region within at least one reference image for predicting target samples of at least a first color channel of the image block unit (1202). Similarly, determine a second reference region within at least one reference image for predicting target samples of at least a second color channel of the image block unit. Reconstruct samples of the first color channel of the image block unit (1204). Derive reference samples for creating a cross-component predictive model from the reference samples in the first and second reference regions (1206). Use the reconstructed samples of at least the first color channel as input to the cross-component predictive model to predict target samples of at least a second color channel of the image block unit (1208).
[0158] Therefore, this method omits the sample prediction process, and reference sample values for cross-component model creation can be directly derived from the determined reference image. The reference image can be determined, for example, from any syntactic element describing the movement of the coding unit or prediction unit. Thus, this method can be applied to reference sample selection in cross-component prediction model generation for inter- or intra-block copy (IBC) prediction blocks. The position of the reference sample in the reference image can be calculated using partial sample precision or full sample precision.
[0159] According to one embodiment, the first and second color channels include at least one luminance channel and at least one chrominance channel.
[0160] According to one embodiment, this method includes rounding the luminance and / or chrominance motion vectors determined for a block in the current image to full sample accuracy, and using the rounded motion vectors to determine the position of a reference region in a reference image.
[0161] According to one embodiment, this method includes clipping a motion vector to full sample accuracy by removing the fractional part of the motion vector.
[0162] According to one embodiment, this method includes rounding the motion vector of one color channel to full sample accuracy and determining the sample values corresponding to other color channels by interpolating the samples while taking into account the offset between color channels.
[0163] This section addresses situations where a decimal offset exists between the luminance and chrominance sampling grids. Therefore, this offset is also called the sampling offset or sampling position offset. As an example, in a 4:2:0 subsampled image, the chrominance motion vector can be rounded to full sample accuracy, and the corresponding luminance motion vector can be determined to be twice the size of the rounded chrominance motion vector. In this example, the following filters can be used to determine the luminance reference value corresponding to the chrominance value at chrominance positions x,y.
number
[0164] The reference region can have the same size and shape as the coding unit or prediction unit. Its position can also be determined using motion vector information associated with the coding unit or prediction unit.
[0165] According to one embodiment, this method includes extending the region used for selecting the reference sample beyond the reference region.
[0166] For example, when the coding unit or prediction unit is relatively small, it may be beneficial to extend the initially defined reference region to generate a more representative cross-component model. The reference region can be extended, for example, by expanding it in all directions by a predetermined amount or signaled amount. For example, the reference region may be determined to include a region with dimensions equal to the current prediction unit's region plus four additional sample rows and columns on all sides (top, bottom, left, and right). Another example is that the reference region may be determined to include two additional sample rows and columns around a region determined by the size and shape of the prediction unit.
[0167] The following discloses various embodiments relating to the selection and / or expansion of the region (also called the prediction region) from which the reference sample is selected.
[0168] According to one embodiment, this method involves symmetrically expanding the region used for selecting a reference sample (prediction region) in the up, down, left, and right directions, based on a region determined by the size and shape of the prediction unit. Figure 13a shows an example of expanding the prediction region for both luminance and chrominance samples.
[0169] According to one embodiment, this method includes extending the region used for selecting a reference sample (prediction region) asymmetrically with respect to a region determined by the size and shape of the prediction unit.
[0170] According to one embodiment, this method includes extending the region used for selecting a reference sample (prediction region) asymmetrically with respect to the region determined by the size and shape of the prediction unit, based on the direction and or magnitude of the prediction vector, or both.
[0171] According to one embodiment, this method includes expanding the region used for selecting reference samples (prediction region) in the direction of movement when the number of samples in the prediction block is less than a predetermined threshold. Therefore, if the number of samples in the prediction block is insufficient for model generation, the reference sample selection region (prediction region) can be expanded in the direction of movement, for example, in the lower right direction in the example of Figure 13a.
[0172] According to one embodiment, the method includes excluding reference samples within first and second reference regions located outside the reference image from model generation. Therefore, it is possible to exclude samples from padding regions during model generation.
[0173] According to one embodiment, the method includes using reference samples from multiple hypotheses for model parameter calculation in response to applying bidirectional or multi-hypothesis prediction to generate prediction blocks. Thus, bidirectional or multi-hypothesis prediction can be used to generate prediction blocks, and two or more hypotheses, or reference samples from each hypothesis, can be used for model parameter calculation.
[0174] According to one embodiment, this method involves weighting reference samples from multiple hypotheses with their respective weights, or weights derived from the weights used in binary or multi-hypothesis prediction. Thus, when using two-hypothesis or multi-hypothesis prediction with equal or different weights, reference samples from each hypothesis are weighted in the model parameter calculation with their respective weights, or weights derived from the weights used in the two-hypothesis or multi-hypothesis prediction.
[0175] According to one embodiment, the weight of each hypothesis is calculated based on the reciprocal of the motion vector length of that hypothesis. For example, the weight of one hypothesis may be set to the motion vector length of another hypothesis.
[0176] In another embodiment, the weights of each hypothesis are assigned according to the fractional part of the motion vector. For example, a hypothesis in which the luminance motion vector points to an integer sample in the luminance reference frame may be given a higher weight. In another example, a hypothesis in which the chrominance motion vector points to an integer sample in the chrominance reference frame may be given a higher weight.
[0177] According to one embodiment, this method includes generating a cross-component predictive model using samples around the region pointed to by the full-precision motion vector, prior to the motion compensation step. Thus, model generation utilizes samples around the region pointed to by the full-precision motion vector before applying motion compensation, affine transformations, or other processing to the samples.
[0178] According to one embodiment, when prediction samples are used for parameter calculation, the same processing used for prediction generation can be applied to the extended region mentioned in the above embodiment in order to generate more samples to be used for parameter calculation.
[0179] According to one embodiment, this method includes, when it detects the use of local illumination correction in predictive generation, skipping signal processing related to local illumination correction and presuming the relevant syntactic element to be false.
[0180] According to one embodiment, the method includes generating a cross-component model using both a reference sample S from a reference image P and a reference sample S around the current block, in response to detecting the use of coupled interprediction and intraprediction (CIIP) to generate a prediction mode.
[0181] According to one embodiment, this method includes generating a first cross-component prediction model P from reference samples S in a reference image and generating a second cross-component prediction model P from reference samples S around the current block when the use of composite inter / intra prediction (CIIP) is detected for prediction mode generation.
[0182] This allows for the generation of two separate models: one generated from reference samples within the reference image, and another generated from reference samples around the current block. In the final color difference prediction generation, two blocks utilizing these models are generated and can be blended with equal / unequal weighting, or the models can be switched based on the sample positions. For example, the model generated from reference samples around the current block can be used for the upper left portion of the block.
[0183] Note that blocks belonging to the reference sample region may already contain cross-component models derived using intra-based CCCM / CCLM methods or inter-based CCRM methods. In this case, existing models can be reused.
[0184] According to one embodiment, this method involves mixing an existing cross-component model with a newly created cross-component model during regression using a weighted average of the statistics used to acquire each model.
[0185] According to one embodiment, this method includes mixing an existing cross-component model and a newly created cross-component model during regression using a weighted average of the coefficients of both cross-component models.
[0186] Therefore, existing cross-component models are mixed with newly derived cross-component models either by implicit mixing during regression using a weighted average of the statistics used to acquire the models (i.e., co-regression), or by mixing using a weighted average of the coefficients of the two cross-component models (i.e., separate regression).
[0187] According to one embodiment, this method involves modifying one or more parameters of an existing model obtained from a reference region to create a modified model that better represents the characteristics of the current block. For example, the bias coefficient or slope coefficient of the model may be recalculated based on one or more samples reconstructed from the neighborhood of the current block.
[0188] According to one embodiment, the shape of the extended reference region is non-rectangular. Therefore, the reference sample region does not need to be a set of four rectangular sides, and can exhibit other shapes, as shown in Figure 13b. In this case, the extension can be considered to have a triangular shape with respect to the block's contour. Another possibility is a circular extension. For example, using a triangular or elliptical reference region extension can cover more reference samples while simultaneously minimizing the spatial distance of the reference samples to the block. As another example, hexagonal, octagonal, or circular reference regions can be used. In general, any shape can be used as a reference region extension.
[0189] According to one embodiment, this method includes dividing a reference sample with respect to luminance and / or chrominance intensity to obtain an extension of the reference region of any shape. In this case, the selected reference sample minimizes the sample value intensity with respect to a threshold, which is, for example, a statistical value derived from a block sample (such as the mean value of luminance and / or chrominance).
[0190] According to one embodiment, the shape of the reference region extension is signaled by an encoder or estimated by a decoder.
[0191] According to one embodiment, the method involves modifying the geometry of the reference region extension based on a motion model. As shown in the example in Figure 13c, affine motion compensation can apply an affine transformation to the reference region extension to compensate for potential rotations and zooms of the reference region.
[0192] According to one embodiment, when the encoding mode is intrablock copy (block copy based on IBC or template matching), the reference region is determined using the IBC reference region pointed to by the associated block vector (BV) in IBC mode. The block vector may be derived based on luminance blocks or chrominance blocks. The block vector may be scaled before being used for reference region determination. Figure 14a shows an example of using the IBC reference region for parameter calculation.
[0193] In one embodiment, the IBC tool uses two or more blocks to compute a prediction, and the final prediction is a weighted combination of multiple blocks. In this case, one or more of the blocks used in the IBC prediction may be used as reference blocks for computing a cross-component model.
[0194] According to one embodiment, the block includes an IBC reference sample located within the same CTU as the current block, or outside the current CTU, or both.
[0195] For example, to reduce local memory access from the current frame, this method may use only IBC reference samples located outside the current CTU. Figure 14b shows an example where the IBC reference region overlaps with the current CTU and adjacent CTUs. In this case, reference samples from the current CTU may be excluded from the cross-component model derivation to mitigate memory access issues.
[0196] In the above embodiment, if some or all of the IBC reference samples belong to the same CTU as the current block, colocal block samples of the IBC reference region from the reference frame can be used for cross-component model derivation.
[0197] The extension of the reference domain can also be applied to intra-cross-component prediction (CCLM, CCCM, GL-CCCM, etc.) through the extension of the intra-reference template according to any of the preceding embodiments.
[0198] According to one embodiment, this method involves individually deriving an intersection component model for each partition by utilizing each reference sample region and its extension, depending on the use of multiple motion vectors such as geometric partitioning modes (GPMs).
[0199] According to one embodiment, the method includes using only a subset of reference samples, corresponding to the use of multiple motion vectors and / or multiple reference images. For example, only one of the motion vectors can be used to extract reference samples for a cross-component model, or only one of the reference images can be used.
[0200] It should be noted that the methods and embodiments disclosed above are applicable to both the encoder and decoder sides unless otherwise stated, either expressly or implicitly. In principle, the operations performed on both sides are identical.
[0201] One embodiment of the apparatus comprises means for determining an image block unit of a frame, wherein the image block unit includes samples of at least first and second color channels; means for determining at least one reference image and a first reference region within at least one reference image for predicting target samples of at least first color channels of the image block unit, and a second reference region within at least one reference image for predicting target samples of at least second color channels of the image block unit; means for reconstructing samples of first color channels of the image block unit; means for deriving reference samples from reference samples in the first and second reference regions for generating a cross-component prediction model; and means for predicting target samples of at least second color channels of the image block unit using reconstructed samples of at least first color channels as input to the cross-component prediction model.
[0202] According to one embodiment, the first and second color channels include at least one luminance channel and at least one chrominance channel.
[0203] According to one embodiment, the apparatus comprises means for rounding motion vectors of luminance and / or chrominance determined for a block in the current image to full sample accuracy, and means for determining the position of a reference region in a reference image using the rounded motion vectors.
[0204] According to one embodiment, the apparatus includes means for clipping a motion vector to full sample accuracy by removing the fractional part of the motion vector.
[0205] According to one embodiment, the apparatus includes means for rounding the motion vector of one color channel to full sample accuracy, and means for determining sample values corresponding to other color channels by interpolating samples while taking into account offsets between color channels.
[0206] According to one embodiment, the apparatus includes means for extending the area used for the selection of the reference sample beyond the reference area.
[0207] According to one embodiment, the apparatus includes means for symmetrically extending the area used for the selection of the reference sample upward, downward, leftward, and rightward with respect to the area determined by the size and shape of the prediction unit.
[0208] According to one embodiment, the apparatus includes means for asymmetrically extending the area used for the selection of the reference sample with respect to the area determined by the size and shape of the prediction unit.
[0209] According to one embodiment, the apparatus includes means for extending the area used for the selection of the reference sample in the motion direction in response to the number of samples in the prediction block being less than a predetermined threshold.
[0210] According to one embodiment, the apparatus includes means for omitting reference samples from outside the reference area when generating a cross-component prediction model.
[0211] According to one embodiment, the shape of the extended reference area is non-rectangular.
[0212] According to one embodiment, the means is implemented in an encoder.
[0213] According to one embodiment, the means is implemented in a decoder.
[0214] In yet another embodiment, there is provided an apparatus comprising at least one processor and at least one memory, the memory of which stores code, which, when executed by the at least one processor, causes the apparatus to perform at least: determining an image block unit of a frame, wherein the image block unit includes samples in at least first and second color channels; determining at least one reference image and a first reference region in at least one reference image for predicting target samples in at least first color channels of the image block unit, and a second reference region in at least one reference image for predicting target samples in at least second color channels of the image block unit; reconstructing samples in first color channels of image block units; deriving reference samples from reference samples in first and second reference regions for generating a cross-component prediction model; and predicting target samples in at least second color channels of image block units using the reconstructed samples in at least first color channels as input to the cross-component prediction model.
[0215] According to one embodiment, the first and second color channels include at least one luminance channel and at least one chrominance channel.
[0216] According to one embodiment, the device includes code that causes the device to round the luminance and / or chrominance motion vectors determined for a block in the current image to full sample accuracy, and to determine the position of a reference region in a reference image using the rounded motion vectors.
[0217] According to one embodiment, the device includes code that causes the device to operate in such a way that it rounds the motion vectors and clips them to full sample accuracy.
[0218] According to one embodiment, the device includes code that causes the device to round the motion vector of one channel to full sample accuracy and determine the sample values corresponding to other color channels by interpolating the samples, taking into account the offset between color channels.
[0219] According to one embodiment, the device includes code that causes the device to extend the region used for selecting a reference sample beyond the reference region.
[0220] According to one embodiment, the device includes code that causes the device to perform an expansion of the region used for selecting a reference sample symmetrically upward, downward, leftward, and rightward relative to the region determined by the size and shape of the prediction unit.
[0221] According to one embodiment, the device includes code that causes the device to perform an asymmetrical expansion of the region used for selecting a reference sample relative to the region determined by the size and shape of the prediction unit.
[0222] According to one embodiment, the device expands the region used for selecting a reference sample in the motion direction in response to the number of samples in the prediction block being less than a predetermined threshold.
[0223] According to one embodiment, the device includes code that causes the device to omit reference samples from outside the reference region when creating a cross-component predictive model.
[0224] Such a device may, for example, include the functional units disclosed in any of Figures 1, 2, 4a, and 4b for carrying out the embodiment.
[0225] Such a device further comprises code stored in at least one memory and executed by at least one processor, which causes the device to perform one or more embodiments disclosed herein.
[0226] Figure 15 illustrates an example of a multimedia communication system in which various embodiments can be implemented. The data source 1510 provides a source signal in analog, uncompressed digital, compressed digital, or any combination thereof. The encoder 1520 may include, or be connected to, preprocessing functions such as data format conversion and filtering of the source signal. The encoder 1520 converts the source signal into an encoded media bitstream. It should be noted that the bitstream to be decoded may be received directly or indirectly from remote devices located within virtually any type of network. Furthermore, the bitstream may be received from local hardware or software. The encoder 1520 may be capable of encoding multiple media types, such as audio and video. Alternatively, multiple encoders 1520 may be required to encode different media types of the source signal. The encoder 1520 may also accept synthetically generated inputs, such as graphics or text. Alternatively, it may be capable of generating encoded bitstreams of synthetic media. For simplicity, the following discussion will only consider the processing of a single encoded media bitstream for a single media type. However, it should be noted that a typical real-time broadcast service consists of multiple streams (usually at least audio, video, and subtitle streams). Furthermore, while a system may include multiple encoders, the diagram shows only one encoder 1520 for the sake of generality and to simplify the explanation. Additionally, even if the descriptions and examples contained herein describe a specific encoding process, those skilled in the art should understand that the same concepts and principles apply to the corresponding decoding process, and vice versa.
[0227] The encoded media bitstream may be transferred to the storage device 1530. The storage device 1530 may consist of any kind of mass storage device for storing the encoded media bitstream. The format of the encoded media bitstream in the storage device 1530 may be a basic self-contained bitstream format. Alternatively, one or more encoded media bitstreams may be encapsulated in a container file. Or, the encoded media bitstream may be encapsulated in a segment format suitable for DASH (or a similar streaming system) and stored as a series of segments. If one or more media bitstreams are encapsulated in a container file, a file generator (not shown in the diagram) can be used to save one or more media bitstreams to a file and create metadata for the file format. This metadata may also be stored in a file. The encoder 1520 or the storage device 1530 may constitute a file generator. Alternatively, the file generator may be operationally connected to either the encoder 1520 or the storage device 1530. Some systems operate "live," meaning that the storage device is omitted and the encoded media bitstream is transferred directly from the encoder 1520 to the sender 1540. The encoded media bitstream is transferred to the sender 1540 (also known as the server) as needed. The format used for transmission may be a basic self-contained bitstream format, a packet stream format, a segment format suitable for DASH (or a similar streaming system), or a format in which one or more encoded media bitstreams are encapsulated in a container file. The encoder 1520, storage device 1530, and server 1540 may reside on the same physical device or be contained in separate devices. The encoder 1520 and server 1540 may handle live real-time content, in which case the encoded media bitstream is not typically stored permanently but is briefly buffered within the content encoder 1520 and / or server 1540 to smooth out processing delays, transfer delays, and fluctuations in the encoded media bitrate.
[0228] Server 1540 transmits encoded media bitstreams using a communication protocol stack. This stack includes, but is not limited to, Real-Time Transport Protocol (RTP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), Transmission Control Protocol (TCP), and Internet Protocol (IP). If the communication protocol stack is packet-oriented, Server 1540 encapsulates the encoded media bitstream into packets. For example, when using RTP, Server 1540 encapsulates the encoded media bitstream in RTP packets according to the RTP payload format. Typically, each media type has its own dedicated RTP payload format. A system may contain multiple Server 1540s, but for simplicity, only one Server 1540 will be considered in the following description.
[0229] If media content is encapsulated in a container file for storage device 1530 or for data input to sender 1540, sender 1540 may configure or operationally connect to a “transmit file parser” (not shown). In particular, if the container file is not transmitted as is, but encapsulated for at least one stored encoded media bitstream to be transmitted over the communication protocol, the transmit file parser identifies the appropriate portion of the encoded media bitstream to be transmitted over the communication protocol. The transmit file parser may also help create the correct format for the communication protocol, such as packet headers and payloads. The multimedia container file may contain instructions for encapsulating at least one media bitstream over the communication protocol, such as an ISOBMFF hint track.
[0230] Server 1540 may or may not be connected to gateway 1550 via a communication network. This communication network may be, for example, a combination of a CDN, the Internet, and / or one or more access networks. The gateway may sometimes be referred to as a middlebox. In the case of DASH, the gateway may be an edge server (of the CDN) or a web proxy. The system may generally include any number of gateways or similar components, but for simplicity, only one gateway 1550 will be considered in the following description. Gateway 1550 performs different types of functions. For example, converting a packet stream based on a certain communication protocol stack to another communication protocol stack, combining and branching data streams, operating on data streams according to downlink and receiver capabilities (e.g., controlling the bitrate of the transfer stream according to the state of the downlink network), etc. Gateway 1550 may be a server entity in various embodiments.
[0231] The system typically includes one or more receivers 1560 that can receive, demodulate, and deencapsulate a transmitted signal into an encoded media bitstream. The encoded media bitstream is transferred to a storage device 1570. The storage device 1570 may consist of any kind of mass storage for storing the encoded media bitstream. The storage device 1570 may, instead or additionally, include arithmetic memory such as random access memory. The format of the encoded media bitstream in the storage device 1570 may be a basic self-contained bitstream format, or one or more encoded media bitstreams may be encapsulated in a container file. If there are multiple encoded media bitstreams that are related to each other, such as audio and video streams, a container file is usually used, and the receiver 1560 constitutes or is connected to a container file generator that generates a container file from the input stream. Some systems operate "live," that is, they omit the storage device 1570 and transfer the encoded media bitstream directly from the receiver 1560 to the decoder 1580. In some systems, the recording storage device 1570 retains only the most recent portion of the recording stream, for example, only the most recent 10 minutes of the recording stream, and discards any older recording data from the recording storage device 1570.
[0232] The encoded media bitstream may be transferred from the storage device 1570 to the decoder 1580. If there are multiple encoded media bitstreams that are related to each other and encapsulated in a container file, such as audio and video streams, or if a single media bitstream is encapsulated in a container file (for example, to facilitate access), a file parser (not shown in the diagram) is used to decapsulate each encoded media bitstream from the container file. The storage device 1570 or decoder 1580 may constitute the file parser, or the file parser may be connected to either the storage device 1570 or decoder 1580. Also, the system may contain multiple decoders, but for the sake of simplicity and without loss of generality, we will only deal with decoder 1570 here.
[0233] The encoded media bitstream is further processed by the decoder 1570, and its output is one or more uncompressed media streams. Finally, the renderer 1590 plays the uncompressed media streams, for example, using a speaker or display. The receiver 1560, recording device 1570, decoder 1570, and renderer 1590 may reside in the same physical device or may be contained in separate devices.
[0234] The transmitter 1540 and / or gateway 1550 may be configured to perform switching between different representations (e.g., switching between different viewports of 360-degree video content, view switching, bitrate adaptation, fast startup). The transmitter 1540 and / or gateway 1550 may also be configured to select which representation to transmit. Switching between different representations can occur for several reasons, such as responding to requests from the receiver 1560 or depending on circumstances such as the throughput of the network through which the bitstream is transmitted. In other words, the receiver 1560 may initiate switching between representations. Requests from the receiver may include, for example, requests for segments or subsegments from a different representation than the previous one, requests for changes to the scalability layer and / or sublayers being transmitted, or requests for changes to rendering devices with different capabilities compared to the previous ones. Segment requests may be HTTP GET requests. Subsegment requests may be HTTP GET requests with specified byte ranges. Furthermore, or instead, bitrate adjustment or bitrate adaptation may be used, for example, to achieve so-called fast startup in streaming services. In this case, the bitrate of the transmission stream is set lower than the channel bitrate to enable playback immediately after the start of streaming or random access and to achieve a buffer occupancy level that can withstand occasional packet delays and retransmissions. Bitrate adaptation may involve up-switching and down-switching operations of multiple representations or layers in various orders.
[0235] Decoder 1580 may be configured to switch between different representations, for example, to switch between different viewports of 360-degree video content, view switching, bitrate adaptation, and / or for faster startup. Decoder 1580 may also be configured to select the representation to be transmitted. Switching between different representations may be done for several reasons. For example, to achieve faster decoding operations, or to adapt the transmitted bitstream to current conditions such as bitrate and the throughput of the network through which the bitstream is transmitted. For example, if a device including decoder 1580 is performing multitasking and using computing resources for purposes other than decoding the video bitstream, faster decoding operation may be required. Another example is when content is played at a faster pace than normal playback speed, for example, at two or three times the conventional real-time playback speed, which may require faster decoding operation.
[0236] The above describes several embodiments using the terms HEVC and / or VVC. It should be understood that these embodiments can be similarly implemented using any video encoder and / or video decoder.
[0237] Where an exemplary embodiment is described above with reference to an encoder, it should be understood that the resulting bitstream and decoder may include corresponding elements. Similarly, where an exemplary embodiment is described with reference to a decoder, it should be understood that the encoder may include a structure and / or computer program for generating the bitstream to be decoded by the decoder. For example, a relevant embodiment is described in which prediction blocks are generated as part of the encoding. The embodiment can be similarly realized by generating prediction blocks as part of the decoding, the difference being that encoding parameters such as horizontal and vertical offsets are decoded from the bitstream rather than being determined by the encoder.
[0238] The embodiments of the invention described above describe the codec as a separate encoder and decoder device to aid in understanding the related processes. However, it should be understood that the device, structure, and operation can be implemented as a single encoder-decoder device / structure / operation. Furthermore, it is possible that the encoder and decoder share some or all common elements.
[0239] While the above examples illustrate embodiments of the present invention operating within a codec in an electronic device, it will be understood that the present invention as defined in the claims can be implemented as part of any video codec. Therefore, for example, embodiments of the present invention can be implemented in a video codec that performs video encoding over a fixed or wired communication path.
[0240] Therefore, the user device may include a video codec as described in the embodiments of the invention described above. It should be understood that the term "user device" is intended to encompass any suitable type of wireless user device, such as a mobile phone, a portable data processing device, or a portable web browser.
[0241] Furthermore, components of a public land mobile network (PLMN) may also include the video codecs mentioned above.
[0242] In general, various embodiments of the present invention can be implemented in hardware or dedicated circuitry, software, logic, or any combination thereof. For example, some features may be implemented in hardware, while others may be implemented in firmware or software executed by a controller, microprocessor, or other computing device, but the present invention is not limited to these. Various aspects of the present invention can be described using block diagrams, flowcharts, or other illustrative representations, but it is well known that these blocks, devices, systems, techniques, or methods described herein can be implemented, in non-limiting examples, in hardware, software, firmware, dedicated circuitry or logic, general-purpose hardware or controllers or other computing devices, or a combination thereof.
[0243] Embodiments of the present invention may be implemented by computer software executable in a data processing device (e.g., a processor entity) of a mobile device, by hardware, or by a combination of software and hardware. In this regard, it should be noted that any block in the illustrated logic flow may represent a program step, an interconnected logic circuit block function, or a combination of a program step and a logic circuit block function. The software may be stored on a physical medium such as a memory chip, a memory block implemented in a processor, a magnetic medium such as a hard disk or floppy disk, or an optical medium such as a DVD or its data variant, or a CD.
[0244] Memory can be of any type suitable for the local technological environment and can be implemented using any appropriate data storage technology, including semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory, and removable memory. Data processing devices can be of any type suitable for the local technological environment. Non-limiting examples include one or more general-purpose computers, dedicated computers, microprocessors, digital signal processors (DSPs), and processors based on multi-core processor architectures.
[0245] Embodiments of the present invention can be implemented in various components, such as integrated circuit modules. Designing integrated circuits is generally a highly automated process. Complex and powerful software tools are available to translate logic-level designs into semiconductor circuit designs that can be etched and formed on semiconductor substrates.
[0246] Programs like those offered by Synopsys in Mountain View, California, and Cadence Design in San Jose, California, use established design rules and libraries of pre-saved design modules to automatically route conductors and place components on semiconductor chips. Once the semiconductor circuit design is complete, the resulting design is sent to a semiconductor manufacturing facility (fab) in a standardized electronic format (e.g., Opus, GDSII) for manufacturing.
[0247] The above description, through illustrative and non-limiting examples, has provided a complete and useful explanation of embodiments of the present invention. However, a person skilled in the art will notice, upon reading the above description in conjunction with the accompanying drawings and claims, that various modifications and adaptations are possible. However, all such and similar modifications to the teachings of the present invention are within the scope of the invention.
Claims
1. Means for determining an image block unit of a frame, wherein the image block unit includes samples in at least a first and a second color channel, Means for determining at least one reference image and a first reference region within the at least one reference image for predicting target samples of at least the first color channel in the image block unit, and for determining a second reference region within the at least one reference image for predicting target samples of at least the second color channel in the image block unit, Means for reconstructing the sample of the first color channel in the image block unit, Means for deriving the reference samples for generating a cross-component predictive model from the reference samples in the first and second reference regions, A means for predicting the target sample of at least the second color channel in the image block unit using at least the reconfigured sample of the first color channel as input to the cross-component prediction model, A device equipped with the following features.
2. The apparatus according to claim 1, wherein the first and second color channels include at least one luminance channel and at least one chrominance channel.
3. A means for rounding the motion vectors of luminance and / or chrominance determined for a block in the image to full sample accuracy, Means for determining the position of the reference region within the reference image using the rounded motion vector, The apparatus according to claim 1 or 2, comprising:
4. Means for clipping the motion vector to full sample accuracy by removing the fractional part of the motion vector, The apparatus according to any one of claims 1 to 3, comprising:
5. A method for rounding the motion vector of one color channel to full sample accuracy, A means for determining sample values corresponding to other color channels by interpolating samples while considering the offset between the aforementioned color channels, The apparatus according to any one of claims 1 to 4, comprising:
6. Means for extending the region used for selecting a reference sample beyond the said reference region, The apparatus according to any one of claims 1 to 5, comprising:
7. Means for extending the region used for selecting the reference sample symmetrically upward, downward, left, and right with respect to the region determined by the size and shape of the prediction unit, The apparatus according to claim 6, comprising:
8. Means for extending the region used for selecting the reference sample asymmetrically with respect to the region determined by the size and shape of the prediction unit, The apparatus according to claim 6, comprising:
9. Means for expanding the region used for selecting the reference sample in the direction of motion in response to the number of samples in the prediction block being less than a predetermined threshold, The apparatus according to claim 6, comprising:
10. Means for omitting the reference samples located outside the reference image within the first and second reference regions when generating the cross-component prediction model, The apparatus according to any one of claims 1 to 9, further comprising:
11. The apparatus according to any one of claims 1 to 10, wherein the shape of the extended reference region is non-rectangular.
12. The apparatus according to any one of claims 1 to 11, wherein the means is mounted on an encoder.
13. The apparatus according to any one of claims 1 to 11, wherein the means is implemented in the decoder.
14. Determining the image block units of a frame, wherein the image block units include samples in at least the first and second color channels. Determining at least one reference image and a first reference region within the at least one reference image for predicting target samples of at least the first color channel in the image block unit, and determining a second reference region within the at least one reference image for predicting target samples of at least the second color channel in the image block unit, Reconstructing the sample of the first color channel in the image block unit, From the reference samples in the first and second reference regions, the reference samples for generating a cross-component predictive model are derived, Using the reconfigured sample of at least the first color channel as input to the cross-component prediction model, predict the target sample of at least the second color channel in the image block unit, Methods that include...
15. The method according to claim 14, wherein the first and second color channels include at least one luminance channel and at least one chrominance channel.