Method and apparatus for adaptive sorting of reference frames
Adaptive reordering of reference frames using template matching enhances video coding efficiency by optimizing reference frame usage, leading to improved compression ratios and reduced bandwidth.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- TENCENT AMERICA LLC
- Filing Date
- 2024-09-25
- Publication Date
- 2026-07-01
Smart Images

Figure 0007883553000002 
Figure 0007883553000003 
Figure 0007883553000004
Abstract
Description
[Technical Field]
[0001] [References]
[0002] This application claims priority to U.S. Provisional Application No. 63 / 247,088, filed September 22, 2021, pursuant to U.S. Non-Provisional Application No. 17 / 932,333, filed September 15, 2022, also entitled “METHOD AND APPARATUS FOR ADAPTIVE REORDERING FOR REFERENCE FRAMES,” each of which is incorporated herein by reference in whole. [Technical field]
[0003] This disclosure describes a set of advanced video coding techniques. More specifically, the techniques disclosed include adaptively rearranging reference frames. [Background technology]
[0004] The background art description provided herein is for the general purpose of presenting the context of this disclosure. The research of the inventors named herein is not, expressly or implicitly, considered prior art to this disclosure, insofar as such research is described in this background art section, as is the case with any description that was not recognized as prior art at the time of filing of this application.
[0005] Video coding and decoding can be performed using interpicture prediction with motion compensation. Uncompressed digital video can contain a series of pictures, each picture having spatial dimensions of, for example, 1920×1080 luminance samples and associated full or subsampled chrominance samples. The series of pictures can have a fixed or variable picture rate (alternatively called frame rate), for example, 60 pictures / second or 60 frames / second. Uncompressed video has specific bitrate requirements for streaming or data processing. For example, video with a pixel resolution of 1920×1080, a frame rate of 60 frames / second, and chroma subsampling of 8 bits / pixel per color channel in a 4:2:0 ratio requires a bandwidth of nearly 1.5 Gbit / second. One hour of such video would require more than 600 gigabytes of storage space.
[0006] One objective of video coding and decoding may be to reduce the redundancy of uncompressed input video signals by making them more compressed. Compression can help reduce the aforementioned bandwidth and / or storage space requirements by more than two orders of magnitude, in some cases. Both lossless and lossy compression, or combinations thereof, can be employed. Lossless compression refers to a technique in which an exact copy of the original signal can be reconstructed from the compressed original signal through the decoding process. Lossy compression refers to a coding / decoding process in which the original video information is not fully preserved during coding and is not fully recoverable during decoding. When using lossy compression, the reconstructed signal may not be identical to the original signal, but the distortion between the original and reconstructed signals is made small enough that, despite some information loss, the reconstructed signal is useful for the intended application. For video, lossy compression is widely adopted in many applications. The amount of acceptable distortion depends on the application. For example, users of certain consumer video streaming applications may tolerate higher distortion than users of movie or television broadcast applications. The compression ratio achievable by a particular coding algorithm can be selected or adjusted to reflect varying distortion tolerances. Generally, a larger acceptable level of distortion allows for coding algorithms that result in higher loss and higher compression ratios.
[0007] Video encoders and decoders can utilize techniques from several broad categories and steps, including, for example, motion compensation, Fourier transform, quantization, and entropy coding.
[0008] Video codec techniques may include a technique known as intra-coding. In intra-coding, sample values are represented without referencing samples or other data from a previously reconstructed reference picture. In some video codecs, the picture is spatially subdivided into blocks of samples. When all blocks of samples are coded in intra-mode, the picture may be called an intra-picture. Intra-pictures and their derivatives, such as independent decoder refresh pictures, can be used to reset the decoder state and thus can be used as the first picture in a coded video bitstream and video session, or as a still image. The samples of the intra-predicted blocks can then undergo a transformation into the frequency domain, and the resulting transformation coefficients can be quantized before entropy coding. Intra-prediction represents a technique that minimizes the sample values in the pre-transformation domain. In some cases, smaller post-transformation DC values and smaller AC coefficients result in fewer bits being required at a given quantization step size to represent the block after entropy coding.
[0009] For example, traditional intra-coding, as known from MPEG-2 generation coding techniques, does not use intra-prediction. However, some newer video compression techniques include methods that attempt to code / decode blocks based on surrounding sample data and / or metadata acquired during the encoding and / or decoding of spatially adjacent ones, and which precede the block of data being intra-coded or decoded in the decoding order. Such techniques are hereafter referred to as “intra-prediction” techniques. It should be noted that, at least in some cases, intra-prediction uses only reference data from the current picture being reconstructed, and not reference data from other reference pictures.
[0010] Intra-prediction can take many different forms. When two or more such techniques are available in a given video coding technique, the technique in use may be called an intra-prediction mode. A particular codec may offer one or more intra-prediction modes. In particular, a mode may have submodes and / or be associated with various parameters, and mode / submode information and intra-coding parameters for a block of video may be coded individually or collectively contained in a mode codeword. Which codeword should be used for a given combination of mode, submode, and / or parameters may affect the coding efficiency gain by intra-prediction, and therefore may also affect the entropy coding technique used to convert the codeword into a bitstream.
[0011] Certain modes of intra-prediction were introduced in H.264, improved in H.265, and further refined in newer coding techniques such as the joint exploration model (JEM), versatile video coding (VVC), and benchmark set (BMS). Generally, in intra-prediction, predictor blocks can be formed using available neighboring sample values. For example, available values of a specific set of neighboring samples along a particular direction and / or line may be copied into the predictor block. References to the direction in use can be coded in the bitstream or may be predicted themselves.
[0012] Referring to Figure 1A, the lower right shows a subset of the nine predictor directions defined in H.265, which are the 33 possible intra-predictor directions (corresponding to 33 of the 35 intra-modes defined in H.265, or angular modes). The point where the arrows converge (101) represents the predicted sample. The arrows indicate the direction in which adjacent samples are used to predict the sample at 101. For example, arrow (102) indicates that sample (101) is predicted to the upper right from one or more adjacent samples at an angle of 45 degrees from the horizontal. Similarly, arrow (103) indicates that sample (101) is predicted to the lower left from one or more adjacent samples at an angle of 22.5 degrees from the horizontal.
[0013] Continuing with Figure 1A, a 4x4 sample square block (104) (indicated by a thick dashed line) is shown in the upper left. The square block (104) contains 16 samples, each labeled with "S" and its position in the Y dimension (e.g., row index) and its position in the X dimension (e.g., column index). For example, sample S21 is the second sample (from the top) in the Y dimension and the first sample (from the left) in the X dimension. Similarly, sample S44 is the fourth sample in block (104) in both the Y and X dimensions. Since the block is 4x4 samples in size, S44 is in the lower right. Further exemplary reference samples following a similar numbering scheme are shown. The reference samples are labeled with R and their Y position (e.g., row index) and X position (column index) relative to block (104). In both H.264 and H.265, predicted samples adjacent to the block being reconstructed are used.
[0014] Intra-picture prediction in block 104 may be initiated by copying a reference sample value from an adjacent sample according to the signaled prediction direction. For example, suppose the coded video bitstream includes signaling for this block 104 indicating the prediction direction of arrow (102), i.e., a sample is predicted to the upper right from one or more prediction samples at an angle of 45 degrees from the horizontal. In such a case, samples S41, S32, S23, and S14 are predicted from the same reference sample R05, and sample S44 is predicted from reference sample R08.
[0015] In certain cases, especially when the direction is not evenly divisible by 45 degrees, multiple reference sample values may be combined, for example, by interpolation, to calculate the reference sample.
[0016] As video coding technology continues to evolve, the number of possible directions is increasing. For example, in H.264 (2003), nine different directions are available for intra-prediction. This increased to 33 in H.265 (2013), and as of the present disclosure, JEM / VVC / BMS can support up to 65 directions. Experimental research has been conducted to help identify the most appropriate intra-prediction direction, and certain techniques in entropy coding may be used to encode those most appropriate directions with fewer bits, while accepting a certain bit penalty for the direction. Furthermore, the direction itself may also be predicted from the adjacent direction used for intra-prediction of the decoded adjacent block.
[0017] Figure 1B shows a schematic diagram (180) of 65 intra-prediction directions by JEM to illustrate the increasing number of prediction directions in various coding techniques developed over a long period of time.
[0018] The method for mapping bits representing intra-prediction directions to prediction directions in a coded video bitstream can vary depending on the video coding technique, ranging from simple direct mapping of prediction directions to intra-prediction modes to complex adaptive schemes involving codewords, most probable modes, and similar techniques. However, in all cases, there may be specific directions for intro prediction that are statistically less likely to occur in video content than certain other directions. Since the goal of video compression is to reduce redundancy, these less likely directions may be represented by more bits than more likely directions in a well-designed video coding technique.
[0019] Interpicture prediction or interpretation may be based on motion compensation. In motion compensation, sample data from a previously reconstructed picture or a portion of it (reference picture) may be spatially shifted in the direction indicated by a motion vector (MV) and then used to predict the newly reconstructed picture or portion of the picture (e.g., a block). In some cases, the reference picture may be the same as the picture currently being reconstructed. The MV may have two dimensions, X and Y, or three dimensions, the third of which indicates the reference picture being used (similar to the time dimension).
[0020] In some video compression techniques, the current motion vector (MV) applicable to a particular area of sample data is predictable from other MVs, for example, those MVs that are spatially adjacent to the area being reconstructed and relate to other areas of sample data that precede the current MV in the decoding order. By doing so, the overall amount of data required to code the MV is substantially reduced by relying on removing redundancy in the correlated MV, thereby increasing compression efficiency. MV prediction can work effectively because, for example, when coding an input video signal derived from a camera (known as natural video), there is a statistical possibility that an area larger than the area to which a single MV is applicable will move in a similar direction in the video sequence, and in some cases, it can be predicted using similar motion vectors derived from the MVs of adjacent areas. As a result, the actual MV of a given area will be similar to or identical to the MV predicted from the surrounding MVs. Such an MV may be represented with fewer bits after entropy coding than the number of bits that would have been used if the MV were coded directly rather than predicted from adjacent MVs. In some cases, MV prediction can be an example of lossless compression of a signal (i.e., MV) derived from the original signal (i.e., sample stream). In other cases, for example, due to rounding errors when calculating the predictor from several surrounding MVs, the MV prediction itself may be irreversible.
[0021] H.265 / HEVC (ITU-T Rec. H.265, "High Efficiency Video Coding", December 2016) describes various MV prediction mechanisms. Of the many MV prediction mechanisms specified by H.265, the one described below is a technique called "spatial merging".
[0022] Specifically, referring to FIG. 2, the current block (201) contains samples discovered by the encoder during the motion search process that are predictable from the previous block of the same size that has been spatially shifted. Instead of directly coding that MV, the MV can be derived from metadata associated with one or more reference pictures, for example, from the most recent reference picture (in decoding order), using the MV associated with any one of five surrounding samples shown as A0, A1, and B0, B1, B2 (202-206 respectively). In H.265, MV prediction can use predictors from the same reference picture that adjacent blocks use.
[0023] AOMedia Video 1 (AV1) is an open video coding format designed for video transmission over the Internet. It was developed as a successor to VP9 by building on VP9's codebase and incorporating additional techniques. The AV1 bitstream specification includes reference video coders such as H.265 or High Efficiency Video Coding (HEVC) standard or Versatile Video Coding (VVC). SUMMARY OF THE INVENTION
[0024] Embodiments of the present disclosure provide a method and apparatus for adaptively reordering reference frames for video coding technology. Template matching (TM) can be used to reorder a reference frame or a pair of reference frames for each block by comparing the difference between the template of the current block and the template of the reference block, referring to the motion information of spatial reference motion information (or spatial motion vector) and / or temporal reference motion information (or temporal motion vector). The difference between the template of the current block and the template of the reference block is calculated for each spatial reference motion information and / or temporal reference motion information and can be marked as a score value for the associated reference frame or pair of reference frames. The available reference frames or pairs of reference frames are ranked based on the score values.
[0025] In one embodiment, a method is provided for sorting reference frames for each block by template matching (TM), the method comprising the steps of: comparing the template of the current block with the template of a reference block for motion information, wherein the motion information includes spatial reference motion information or temporal reference motion information; calculating the difference between the template of the current block and the template of the reference block; determining the score value of the associated reference frames based on the calculated difference; and sorting the reference frames based on the determined scores. The reference frames further include reference frame pairs. TM includes decoder-side motion vector derivation for refining the motion information of the current block. The sorting step further includes ranking the available reference frames based on the score value for each of the reference frames. When the score values of multiple reference frames are equivalent, the ranking step corresponds to the scanning order of the spatial reference motion information or temporal reference motion information. When the score values of multiple reference frames are equivalent, the ranking step is based on the frequency of occurrence of these reference frames used in the spatial reference motion information or temporal reference motion information. The calculation steps include at least one of the following: Sum of Absolute Differences (SAD), Sum of Squared Differences (SSD), Mean Squared Error (MSE), or Sum of Absolute Difference Transformations (SATD). The template includes adjacent blocks above or to the left. Spatial reference motion information includes one or more spatial motion vectors. Temporal reference motion information includes one or more temporal motion vectors. When multiple motion vectors point to one of the reference frames, the one with the smallest difference is used to determine the score value. When there are no motion vectors pointing to one of the reference frames, the score value is determined to be the maximum allowable value. Allowed unidirectional and bidirectional composite reference frames are ranked together by using TM, and the index of the reference frame for the current block is signaled in the bitstream.The allowed single reference frames are ranked together by using TM, and the index of the reference frame for the current block is signaled in the bitstream.
[0026] In some other embodiments, a device for processing video information is disclosed. The device may include circuitry configured to execute any one of the above method implementations.
[0027] Embodiments of the present disclosure also provide a non-transitory computer-readable medium storing instructions that, when executed by a computer for video decoding and / or encoding, cause the computer to execute a method for video decoding and / or encoding.
Brief Description of the Drawings
[0028] Further features, properties, and various advantages of the disclosed subject matter will become more apparent from the following detailed description and the accompanying drawings. [Figure 1A] A schematic diagram of an exemplary subset of intra prediction direction modes is shown. [Figure 1B] A diagram of an exemplary intra prediction direction is shown. [Figure 2] A schematic diagram of a current block and its surrounding spatial merge candidates for motion vector prediction in one example is shown. [Figure 3] A schematic diagram of a simplified block diagram of a communication system (300) according to an exemplary embodiment is shown. [Figure 4] A schematic diagram of a simplified block diagram of a communication system (400) according to an exemplary embodiment is shown. [Figure 5] A schematic diagram of a simplified block diagram of a video decoder according to an exemplary embodiment is shown. [Figure 6] A schematic diagram of a simplified block diagram of a video encoder according to an exemplary embodiment is shown. [Figure 7] A block diagram of a video encoder according to another exemplary embodiment is shown. [Figure 8] A block diagram of a video decoder according to another exemplary embodiment is shown. [Figure 9] This disclosure illustrates a coding block partitioning scheme according to an exemplary embodiment. [Figure 10] Another method of coding block partitioning according to exemplary embodiments of this disclosure is shown. [Figure 11] Another method of coding block partitioning according to exemplary embodiments of this disclosure is shown. [Figure 12] This example demonstrates an exemplary partitioning scheme for a base block into a coding block. [Figure 13] An example of a ternary partitioning scheme is shown. [Figure 14] An example of a quadtree-binary-coding block partitioning scheme is shown. [Figure 15] This disclosure illustrates an exemplary embodiment of a method for partitioning a coding block into multiple transformation blocks, and a coding order for the transformation blocks. [Figure 16] An exemplary embodiment of this disclosure shows another method for partitioning a coding block into multiple transformation blocks, and a coding order for the transformation blocks. [Figure 17] An exemplary embodiment of this disclosure illustrates another method for partitioning a coding block into multiple transformation blocks. [Figure 18] An exemplary partition tree for block partitioning is shown. [Figure 19] Exemplary partitions and trees for a quadtree-plus-binary tree (QTBT) structure are shown. [Figure 20] An exemplary ternary tree partitioning is shown. [Figure 21] This shows an exemplary template matching (TM) on the search area around the initial motion vector (MV). [Figure 22]An example of decoder-side motion vector refinement is shown. [Figure 23] An example of an exemplary spatial motion vector search pattern is shown. [Figure 24] A flowchart of the method according to an exemplary embodiment of this disclosure is shown. [Figure 25] A schematic diagram of a computer system according to an exemplary embodiment of the present disclosure is shown. [Modes for carrying out the invention]
[0029] Throughout this specification and the claims, terms may have nuances implied or suggested in context beyond their expressly stated meanings. As used herein, “in one embodiment” or “in some embodiments” does not necessarily refer to the same embodiment, and as used herein, “in another embodiment” or “in other embodiments” does not necessarily refer to a different embodiment. Similarly, as used herein, “in one implementation” or “in some implementations” does not necessarily refer to the same implementation, and as used herein, “in another implementation” or “in other implementations” does not necessarily refer to a different implementation. For example, the claimed subject matter is intended to include, in whole or in part, exemplary combinations of embodiments / implementations.
[0030] In general, terms can be understood at least partially from their usage in context. For example, terms such as “and,” “or,” or “and / or” as used herein may have a variety of meanings that may at least partially depend on the context in which such terms are used. Typically, when “or” is used to relate a list such as A, B, or C, it is intended to mean A, B, and C (used here in an inclusive sense) as well as A, B, or C (used here in an exclusive sense). In addition, the terms “one or more” or “at least one” as used herein may be used at least partially contextually to describe any feature, structure, or characteristic in a singular sense, or to describe a combination of features, structures, or characteristics in a plural sense. Similarly, terms such as “a,” “an,” or “the” may, in this case as well, be understood to convey a singular usage or a plural usage, at least partially contextually. Furthermore, the terms "based on" or "determined by" are not necessarily intended to convey an exclusive set of factors; rather, it should be understood that, in this case as well, they may allow for the presence of additional factors that are not necessarily explicitly stated, at least in part depending on the context.
[0031] Figure 3 shows a simplified block diagram of a communication system (300) according to one embodiment of the present disclosure. The communication system (300) includes a plurality of terminal devices that can communicate with each other, for example, over a network (350). For example, the communication system (300) includes a first pair of terminal devices (310) and (320) interconnected over the network (350). In the example of Figure 3, the first pair of terminal devices (310) and (320) may perform unidirectional transmission of data. For example, terminal device (310) may code video data (for example, a stream of video pictures captured by terminal device (310)) for transmission to the other terminal device (320) over the network (350). The encoded video data may be transmitted in the form of one or more encoded video bitstreams. The terminal device (320) can receive coded video data from the network (350), decode the coded video data to restore a video picture, and display the video picture according to the restored video data. Unidirectional data transmission can be implemented in media serving applications and the like.
[0032] In another example, the communication system (300) includes a second pair of terminal devices (330) and (340) that perform bidirectional transmission of coded video data, which may be implemented, for example, in a video conferencing application. For bidirectional transmission of data, in one example, each terminal device of terminal devices (330) and (340) may code video data (e.g., a stream of video pictures captured by the terminal device) for transmission to the other terminal device of terminal devices (330) and (340) over the network (350). Each terminal device of terminal devices (330) and (340) may also receive coded video data transmitted by the other terminal device of terminal devices (330) and (340), decode the coded video data to restore video pictures, and display the video pictures on an accessible display device according to the restored video data.
[0033] In the example in Figure 3, terminal devices (310), (320), (330), and (340) may be implemented as servers, personal computers, and smartphones, but the applicability of the fundamental principles of this disclosure is not limited thereto. Embodiments of this disclosure may be implemented in desktop computers, laptop computers, tablet computers, media players, wearable computers, dedicated video conferencing equipment, and / or similar devices. Network (350) represents any number or type of network that transmits coded video data between terminal devices (310), (320), (330), and (340), including, for example, wireline (wired) communication networks and / or wireless communication networks. Communication network (350) may exchange data via circuit switching, packet switching, and / or other types of channels. Typical networks include telecommunication networks, local area networks, wide area networks, and / or the Internet. For the purposes of this description, the architecture and topology of network (350) may not be important to the operation of this disclosure unless expressly described herein.
[0034] Figure 4 shows an example of an application of the disclosed subject matter, illustrating the arrangement of a video encoder and video decoder in a video streaming environment. The disclosed subject matter may also be equally applicable to other video applications, such as video conferencing, digital television broadcasting, games, virtual reality, and the storage of compressed video on digital media including CDs, DVDs, and memory sticks.
[0035] A video streaming system may include a video capture subsystem (413) which may include a video source (401), such as a digital camera, for creating a stream (402) of uncompressed video pictures or images. In one example, the stream (402) of video pictures includes samples recorded by the digital camera of the video source 401. The stream (402) of video pictures, shown in bold to emphasize its high data volume compared to encoded video data (404) (or encoded video bitstream), can be processed by an electronic device (420) which includes a video encoder (403) coupled to the video source (401). The video encoder (403) may include hardware, software, or a combination thereof to enable or implement aspects of the disclosed subject, as will be described in more detail below. The encoded video data (404) (or encoded video bitstream (404)), shown in thin lines to emphasize its low data size compared to the uncompressed video picture stream (402), may be stored on a streaming server (405) for future use or directly on a downstream video device (not shown). One or more streaming client subsystems, such as client subsystems (406) and (408) in Figure 4, can access the streaming server (405) to retrieve copies (407) and (409) of the encoded video data (404). The client subsystem (406) may include, for example, a video decoder (410) within an electronic device (430). The video decoder (410) decodes the incoming copy (407) of the encoded video data to create an outgoing stream (411) of video pictures that is uncompressed and can be rendered on a display (412) (e.g., a display screen) or other rendering device (not shown). The video decoder 410 may be configured to perform some or all of the various functions described herein.In some streaming systems, encoded video data (404), (407), and (409) (e.g., video bitstream) may be encoded according to specific video coding / compression standards. Examples of these standards include ITU-T Recommendation H.265. For example, a video coding standard under development is informally known as Multipurpose Video Coding (VVC). The disclosed subject matter may be used in the context of VVC and other video coding standards.
[0036] It should be noted that electronic devices (420) and (430) may include other components (not shown). For example, electronic device (420) may include a video decoder (not shown), and electronic device (430) may also include a video encoder (not shown).
[0037] Figure 5 shows a block diagram of a video decoder (510) according to any embodiment of the present disclosure described below. The video decoder (510) may be included in an electronic device (530). The electronic device (530) may include a receiver (531) (e.g., a receiving circuit). The video decoder (510) can be used in place of the video decoder (410) in the example of Figure 4.
[0038] The receiver (531) may receive one or more coded video sequences to be decoded by the video decoder (510). In the same or another embodiment, one coded video sequence may be decoded at a time, where the decoding of each coded video sequence is independent of other coded video sequences. Each video sequence may be associated with multiple video frames or images. Coded video sequences may be received from a channel (501), which may be a storage device storing coded video data or a hardware / software link to a streaming source transmitting coded video data. The receiver (531) may receive coded video data together with other data, such as coded audio data and / or auxiliary data streams, which may be transferred to their respective processing circuits (not shown). The receiver (531) may isolate coded video sequences from other data. To counteract network jitter, a buffer memory (515) may be placed between the receiver (531) and the entropy decoder / parser (520) (hereinafter, "Parser (520)"). In certain applications, the buffer memory (515) may be implemented as part of the video decoder (510). In other applications, it may be outside the video decoder (510) and separate from it (not shown). In yet other applications, for example, to counter network jitter, a buffer memory (not shown) may exist outside the video decoder (510), and for example, to handle playback timing, another additional buffer memory (515) may exist inside the video decoder (510). When the receiver (531) is receiving data from a storage / transfer device with sufficient bandwidth and controllability, or from an isosynchronous network, the buffer memory (515) may not be needed or may be small. For use over best-effort packet networks such as the Internet, a sufficiently sized buffer memory (515) may be required, and its size may be relatively large.Such buffer memory may be implemented with an adaptive size and may be implemented at least partially in an operating system or similar element (not shown) outside the video decoder (510).
[0039] The video decoder (510) may include a parser (520) for reconstructing symbols (521) from the coded video sequence. The categories of these symbols include information used to manage the operation of the video decoder (510) and, optionally, information for controlling rendering devices such as a display (512) (e.g., a display screen), which may or may not be an integral part of the electronic device (530) but may be coupled to the electronic device (530), as shown in Figure 5. The control information for the rendering device(s) may take the form of supplemental enhancement information (SEI messages) or video usability information (VUI) parameter set fragments (not shown). The parser (520) may parse / entropy decode the coded video sequence received by the parser (520). The entropy coding of the coded video sequence may follow video coding techniques or standards and may follow various principles, including variable-length coding, Huffman coding, and context-sensitive or unsensitive arithmetic coding. The parser(520) may extract from the coded video sequence a set of subgroup parameters for at least one of the subgroups of pixels in the video decoder, based on at least one parameter corresponding to a subgroup. Subgroups may include groups of pictures (GOP), pictures, tiles, slices, macroblocks, coding units (CU), blocks, transform units (TU), and predictive units (PU). The parser(520) may also extract from the coded video sequence information such as transform coefficients (e.g., Fourier transform coefficients), quantizer parameter values, and motion vectors.
[0040] The parser (520) may perform an entropy decoding / parsing operation on the video sequence received from the buffer memory (515) in order to create a symbol (521).
[0041] Symbol reconstruction (521) may include several different processing or function units, depending on the type of the coded video picture or part thereof (such as interpicture and intrapicture, interblock, and intrablock), and other factors. The units to be included and how they are included may be controlled by subgroup control information parsed from the coded video sequence by the parser (520). The flow of such subgroup control information between the parser (520) and the following processing or function units is not shown for simplicity.
[0042] Beyond the functional blocks already described, the video decoder (510) can be conceptually subdivided into several functional units, as described below. In actual implementations operating under commercial constraints, many of these functional units can interact closely with each other and integrate with each other at least partially. However, for the purpose of clearly illustrating the various functions of the disclosed subject, the conceptual subdivision into functional units is adopted in the following disclosure.
[0043] The first unit may include a scaler / inverse unit (551). The scaler / inverse unit (551) may receive quantized transformation coefficients and control information, which may include information indicating which type of inverse transformation to use, block size, quantization factors / parameters, quantization scaling matrix, and ly as symbols (521) from the parser (520). The scaler / inverse unit (551) may output a block containing sample values that can be input to the aggregator (555).
[0044] In some cases, the output samples of the scaler / inverse transform (551) may relate to intracoded blocks, i.e., blocks that do not use prediction information from previously reconstructed pictures, but can use prediction information from previously reconstructed portions of the current picture. Such prediction information can be provided by the intrapicture prediction unit (552). In some cases, the intrapicture prediction unit (552) may generate blocks of the same size and shape as the block being reconstructed, using surrounding block information that has already been reconstructed and is stored in the current picture buffer (558). The current picture buffer (558) buffers, for example, partially reconstructed current pictures and / or fully reconstructed current pictures. In some implementations, the aggregator (555) may, sample by sample, add the prediction information generated by the intraprediction unit (552) to the output sample information provided by the scaler / inverse transform unit (551).
[0045] In other cases, the output samples of the scaler / inverse unit (551) may be associated with an intercoded and possibly motion-compensated block. In such cases, the motion-compensated prediction unit (553) can access the reference picture memory (557) to fetch samples to be used for interpicture prediction. After motion-compensating the fetched samples according to the symbols (521) associated with the block, these samples may be added by the aggregator (555) to the output of the scaler / inverse unit (551) (the output of unit 551 may be called residual samples or residual signals) to generate output sample information. The address in the reference picture memory (557) from which the motion-compensated prediction unit (553) fetches the predicted samples can be controlled by a motion vector available to the motion-compensated prediction unit (553) in the form of a symbol (521) which may have, for example, X, Y components (shift), and a reference picture component (time). Motion compensation may also involve interpolation of sample values fetched from reference picture memory (557) when the precise motion vectors of subsamples are used, and may be associated with motion vector prediction mechanisms, etc.
[0046] The output samples of the aggregator (555) can undergo various loop filtering techniques in the loop filter unit (556). The video compression technique may include in-loop filtering techniques controlled by parameters contained in the coded video sequence (also called coded video bitstream) and made available to the loop filter unit (556) as symbols (521) from the parser (520), but may also respond to metadata obtained during decoding of previous parts (in decoding order) of the coded picture or coded video sequence, and may also respond to previously reconstructed and loop-filtered sample values. Several types of loop filters may be included as part of the loop filter unit 556 in various orders, as will be described in more detail below.
[0047] The output of the loop filter unit (556) may be a sample stream that is output to the rendering device (512) and can also be stored in the reference picture memory (557) for use in future interpicture prediction.
[0048] A particular coded picture, once fully reconfigured, can be used as a reference picture for future interpicture prediction. For example, once the coded picture corresponding to the current picture is fully reconfigured and the coded picture is identified as a reference picture (e.g., by the parser (520)), the current picture buffer (558) can become part of the reference picture memory (557) and can be reallocated before the reconfiguration of the next coded picture begins.
[0049] The video decoder (510) may perform decoding operations according to a predetermined video compression technique adopted in standards such as ITU-T Recommendation H.265. The coded video sequence may conform to the syntax specified by the video compression technique or standard being used, in the sense that the coded video sequence conforms to both the syntax of the video compression technique or standard and the profile documented in the video compression technique or standard. Specifically, a profile may select a particular tool from all the tools available in the video compression technique or standard as the only tool available for use under that profile. To conform to the standard, the complexity of the coded video sequence may be within the range defined by the level of the video compression technique or standard. In some cases, the level limits the maximum picture size, maximum frame rate, maximum reconstruction sample rate (e.g., measured in megasamples / second), maximum reference picture size, etc. The limitations set by the level may, in some cases, be further limited through the Hypothetical Reference Decoder (HRD) specification and metadata for HRD buffer management signaled in the coded video sequence.
[0050] In some exemplary embodiments, the receiver (531) may receive additional (redundant) data along with the encoded video. The additional data may be included as part of the encoded video sequence(s). The additional data may be used by the video decoder (510) to properly decode the data and / or to more accurately reconstruct the original video data. The additional data may take the form of, for example, a temporal, spatial, or signal-to-noise ratio (SNR) enhancement layer, redundant slices, redundant pictures, or forward error correction codes.
[0051] Figure 6 shows a block diagram of a video encoder (603) according to an exemplary embodiment of the present disclosure. The video encoder (603) may be included in an electronic device (620). The electronic device (620) may further include a transmitter (640) (e.g., a transmitting circuit). The video encoder (603) can be used in place of the video encoder (403) in the example of Figure 4.
[0052] The video encoder (603) may receive video samples from a video source (601) (not part of the electronic device (620) in the example in Figure 6) that can capture video images (or more) to be coded by the video encoder (603). In another example, the video source (601) may be implemented as part of the electronic device (620).
[0053] The video source (601) may provide a source video sequence to be coded by the video encoder (603) in the form of a digital video sample stream, which may be of any suitable bit depth (e.g., 8-bit, 10-bit, 12-bit, ...), any color space (e.g., BT.601 YCrCb, RGB, XYZ, ...), and any suitable sampling structure (e.g., YCrCb 4:2:0, YCrCb 4:4:4). In a media serving system, the video source (601) may be a storage device capable of storing previously prepared video. In a video conferencing system, the video source (601) may be a camera that captures local image information as a video sequence. The video data may be provided as a series of individual pictures or images that give motion when viewed sequentially. The picture itself may be organized as a spatial array of pixels, where each pixel may contain one or more samples depending on the sampling structure, color space, etc., used. Those skilled in the art will readily understand the relationship between pixels and samples. The following description will focus on samples.
[0054] According to some exemplary embodiments, the video encoder (603) may encode pictures of a source video sequence in real time or under any other time constraints required by the application and compress them into an encoded video sequence (643). Implementing an appropriate coding speed constitutes one function of the controller (650). In some embodiments, the controller (650) may be functionally coupled to and control other functional units, as described below. The couplings are not shown for simplicity. Parameters set by the controller (650) may include rate control-related parameters (picture skip, quantizer, lambda value of rate distortion optimization technique, ...), picture size group of pictures (GOP) layout, maximum motion vector search range, etc. The controller (650) may be configured to have other appropriate functions related to the video encoder (603) optimized for a particular system design.
[0055] In some exemplary embodiments, the video encoder (603) may be configured to operate in a coding loop. For the most simplified explanation, in one example, the coding loop may include a source coder (630) (responsible for creating symbols, such as a symbol stream, based, for example, on an input picture and a reference picture(s) to be coded) and a (local) decoder (633) embedded in the video encoder (603). The decoder (633) reconstructs the symbols to create sample data in a similar manner to that created by a (remote) decoder, even if the embedded decoder 633 has processed the coded video stream by the source coder 630 without entropy coding (since any compression between symbols and coded video bitstream in entropy coding can be lossless in the video compression techniques considered in the subject disclosed). The reconstructed sample stream (sample data) is input to a reference picture memory (634). Since the decoding of the symbol stream yields bit-exact results regardless of the decoder's location (local or remote), the contents within the reference picture memory (634) are also bit-exact between the local and remote encoders. In other words, the predictive portion of the encoder "sees" the exact same sample values as the reference picture samples that the decoder would "see" when using the prediction during decoding. This fundamental principle of reference picture synchronization (and the resulting drift if synchronization cannot be maintained due to, for example, channel errors) is used to improve coding quality.
[0056] The operation of the “local” decoder (633) may be the same as that of a “remote” decoder, such as the video decoder (510), which has already been described in detail above in relation to Figure 5. However, also briefly referring to Figure 5, since symbols are available and the encoding / decoding of symbols to the encoded video sequence by the entropy coder (645) and parser (520) may be lossless, the entropy decoding portion of the video decoder (510), including the buffer memory (515) and parser (520), may not be fully implemented within the local decoder (633) in the encoder.
[0057] An observation that can be made at this point is that any decoder technique, excluding parsing / entropy decoding which can only exist within the decoder, must also necessarily exist within the corresponding encoder in substantially the same functional form. For this reason, the subject matter disclosed may sometimes focus on the decoder operation related to the decoding portion of the encoder. Therefore, the description of encoder techniques can be omitted, as it is the inverse of the comprehensively described decoder techniques. A more detailed description of encoders is provided below only in specific areas or embodiments.
[0058] In operation in some exemplary implementations, the source coder (630) may perform motion-compensated predictive coding, predictively coding the input picture by referencing one or more previously coded pictures from a video sequence designated as “reference pictures”. In this way, the coding engine (632) codes the difference (or residual) in the color channels between the pixel blocks of the input picture and the pixel blocks of the reference pictures(s) that may be selected as predictive references(s) to the input picture. The terms “residue” and its adjective form “residual” may be used interchangeably.
[0059] The local video decoder (633) can decode the coded video data of a picture that may be designated as a reference picture based on symbols created by the source coder (630). The operation of the coding engine (632) can, advantageously, be a lossy process. When the coded video data can be decoded by a video decoder (not shown in Figure 6), the reconstructed video sequence may typically be a replica of the source video sequence with some errors. The local video decoder (633) can replicate the decoding process that may be performed by the video decoder on the reference picture and store the reconstructed reference picture in the reference picture cache (634). In this way, the video encoder (603) can locally store a copy of the reconstructed reference picture that has content common with the reconstructed reference picture obtained by the far-end (remote) video decoder (without transmission errors).
[0060] The predictor (635) may perform a predictive search of the coding engine (632). That is, for a new picture to be coded, the predictor (635) may search the reference picture memory (634) for sample data (as candidate reference pixel blocks) or specific metadata such as reference picture motion vectors, block shapes, etc., which can serve as appropriate predictive references for the new picture. The predictor (635) may operate on a sample block-by-pixel-block basis to find appropriate predictive references. In some cases, the input picture may have predictive references drawn from multiple reference pictures stored in the reference picture memory (634), as determined by the search results obtained by the predictor (635).
[0061] The controller (650) may manage the coding operations of the source coder (630), including, for example, setting parameters and subgroup parameters used to encode video data.
[0062] The outputs of all the aforementioned functional units can undergo entropy coding in the entropy coder 645. The entropy coder (645) converts the symbols generated by the various functional units into coded video sequences by lossless compression of the symbols according to techniques such as Huffman coding, variable-length coding, and arithmetic coding.
[0063] The transmitter (640) may buffer the coded video sequence(s) created by the entropy coder (645) in preparation for transmission over a communication channel (660), which may be a hardware / software link to a storage device that stores coded video data. The transmitter (640) may merge the coded video data from the video coder (603) with other data to be transmitted, such as coded audio data and / or auxiliary data streams (sources not shown).
[0064] The controller (650) may manage the operation of the video encoder (603). During coding, the controller (650) may assign a specific coded picture type to each coded picture, which may affect the coding technique that can be applied to each picture. For example, a picture may often be assigned as one of the following picture types:
[0065] An intra-picture (I-picture) may be one that can be coded and decoded without using other pictures in the sequence as a source of prediction. Some video codecs enable different types of intra-pictures, including, for example, independent decoder refresh ("IDR") pictures. Those skilled in the art are familiar with their variations of I-pictures, as well as their respective uses and characteristics.
[0066] A predictive picture (P-picture) may be one that can be coded and decoded using intra-prediction or inter-prediction, which uses at most one motion vector and a reference index to predict the sample value of each block.
[0067] A bidirectional predictive picture (B-picture) may be one that can be coded and decoded using intra-prediction or inter-prediction, which uses at most two motion vectors and reference indices to predict the sample values of each block. Similarly, a multiple predictive picture may use more than two reference pictures and associated metadata for the reconstruction of a single block.
[0068] A source picture can generally be spatially subdivided into multiple sample coding blocks (e.g., blocks of 4x4, 8x8, 4x8, or 16x16 samples each) and coded on a block-by-block basis. Blocks can be predictively coded by referencing other (already coded) blocks, as determined by the coding assignment applied to each picture in the block. For example, a block of picture I can be coded non-predictively or predictively by referencing already coded blocks of the same picture (spatial prediction or intra-prediction). A pixel block of picture P can be predictedly coded via spatial prediction or temporal prediction by referencing one previously coded reference picture. A block of picture B can be predictedly coded via spatial prediction or temporal prediction by referencing one or two previously coded reference pictures. A source picture or an intermediate picture can be subdivided into other types of blocks for other purposes. The subdivision of coding blocks and other types of blocks may or may not follow the same method, as will be described in more detail below.
[0069] The video encoder (603) may perform coding operations in accordance with a specified video coding technique or standard, such as ITU-T Recommendation H.265. In this operation, the video encoder (603) may perform various compression operations, including predictive coding operations that utilize temporal and spatial redundancy in the input video sequence. Thus, the coded video data may conform to the syntax specified by the video coding technique or standard being used.
[0070] In some exemplary embodiments, the transmitter (640) may transmit additional data along with the encoded video. The source coder (630) may include such data as part of the encoded video sequence. The additional data may include time / space / SNR enhancement layers, other forms of redundant data such as redundant pictures and slices, SEI messages, VUI parameter set fragments, and the like.
[0071] Video can be captured as multiple source pictures (video pictures) in a time series. Intra-picture prediction (often abbreviated as intra-prediction) utilizes spatial correlations within a given picture, while inter-picture prediction utilizes temporal or other correlations between pictures. For example, a particular picture being encoded / decoded, called the current picture, may be partitioned into blocks. Blocks in the current picture may be coded by vectors called motion vectors, when they are similar to reference blocks in a previously coded and still-buffered reference picture in the video. Motion vectors point to reference blocks in the reference picture and may have a third dimension to identify the reference picture if multiple reference pictures are used.
[0072] In some exemplary embodiments, a biprediction technique can be used for interpicture prediction. According to such a biprediction technique, two reference pictures are used, such as a first reference picture and a second reference picture, both of which precede the current picture in the video in decoding order (but may be past or future in display order, respectively). A block in the current picture may be coded by a first motion vector pointing to a first reference block in the first reference picture and a second motion vector pointing to a second reference block in the second reference picture. The block may be jointly predicted by a combination of the first and second reference blocks.
[0073] Furthermore, merge mode techniques may be used in interpicture prediction to improve coding efficiency.
[0074] According to some exemplary embodiments of this disclosure, predictions such as interpicture prediction and intrapicture prediction are performed in block units. For example, pictures in a sequence of video pictures are partitioned into coding tree units (CTUs) for compression, and the CTUs in a picture may have the same size, such as 64x64 pixels, 32x32 pixels, or 16x16 pixels. Generally, a CTU may include three parallel coding tree blocks (CTBs), i.e., one lumar CTB and two chroma CTBs. Each CTU may be recursively quadtree-partitioned into one or more coding units (CUs). For example, a 64x64 pixel CTU may be divided into one 64x64 pixel CU or four 32x32 pixel CUs. Each of one or more of the 32x32 blocks may be further divided into four 16x16 pixel CUs. In some exemplary embodiments, each CU may be analyzed during coding to determine the prediction type of that CU from among various prediction types, such as inter-prediction type or intra-prediction type. A CU may be divided into one or more prediction units (PUs) depending on its temporal and / or spatial predictability. Generally, each PU includes one lumen prediction block (PB) and two chroma PBs. In one embodiment, the prediction operation during coding (encoding / decoding) is performed in units of prediction blocks. The division of a CU into PUs (or PBs for different color channels) can be performed in various spatial patterns. For example, a lumen or chroma PB may include a matrix of sample values (e.g., lumen values), such as 8x8 pixels, 16x16 pixels, 8x16 pixels, or 16x8 samples.
[0075] Figure 7 shows a diagram of a video encoder (703) according to another exemplary embodiment of the present disclosure. The video encoder (703) is configured to receive a processing block (e.g., a prediction block) of sample values in the current video picture in a sequence of video pictures, and to encode the processing block into a coded picture which is part of a coded video sequence. The exemplary video encoder (703) may be used in place of the video encoder (403) in the example of Figure 4.
[0076] For example, the video encoder (703) receives a matrix of sample values for a processing block, such as an 8x8 sample prediction block. The video encoder (703) then determines, for example using rate-distortion optimization (RDO), which of intra-mode, inter-mode, or bi-prediction mode best codes the processing block. When it is determined that the processing block is coded in intra-mode, the video encoder (703) may encode the processing block into a coded picture using the intra-prediction technique; when it is determined that the processing block is coded in inter-mode or bi-prediction mode, the video encoder (703) may encode the processing block into a coded picture using the inter-prediction technique or the bi-prediction technique, respectively. In some exemplary embodiments, merge mode may be used as a submode of inter-picture prediction, where the motion vector is derived from one or more motion vector predictors without benefiting from the coded motion vector components outside of those predictors. In some other exemplary embodiments, there may be motion vector components applicable to the target block. Therefore, the video encoder (703) may include components not explicitly shown in Figure 7, such as a mode determination module, to determine the prediction mode of the processing block.
[0077] In the example shown in Figure 7, the video encoder (703) includes an interencoder (730), an intraencoder (722), a residual calculator (723), a switch (726), a residual encoder (724), a master controller (721), and an entropy encoder (725), all coupled together as shown in the exemplary configuration of Figure 7.
[0078] The interencoder (730) is configured to receive a sample of the current block (e.g., a processing block), compare that block to one or more reference blocks in the reference picture (e.g., blocks in the previous and subsequent pictures in display order), generate interprediction information (e.g., a description of redundant information by the intercoding technique, motion vectors, merge mode information), and compute an interprediction result (e.g., a predicted block) based on the interprediction information using any appropriate technique. In some examples, the reference picture is a decoded reference picture decoded based on encoded video information using a decoding unit 633 embedded in the exemplary encoder 620 in Figure 6 (shown as a residual decoder 728 in Figure 7, as will be described in more detail below).
[0079] The intra encoder (722) is configured to receive a sample of the current block (e.g., a processing block), compare the block to an already coded block in the same picture, generate quantization coefficients after the transformation, and, if applicable, generate intra prediction information (e.g., intra prediction direction information using one or more intra coding techniques). Based on the intra prediction information and the reference block in the same picture, the intra prediction result (e.g., a prediction block) may be computed.
[0080] The general controller (721) may be configured to determine general control data and control other components of the video encoder (703) based on the general control data. For example, the general controller (721) determines the prediction mode of a block and provides control signals to the switch (726) based on the prediction mode. For example, if the prediction mode is intra-mode, the general controller (721) controls the switch (726) to select the intra-mode result for use by the residual calculator (723) and controls the entropy encoder (725) to select the intra-prediction information and include it in the bitstream. If the prediction mode of a block is inter-mode, the general controller (721) controls the switch (726) to select the inter-prediction result for use by the residual calculator (723) and controls the entropy encoder (725) to select the inter-prediction information and include it in the bitstream.
[0081] A residual calculator (723) may be configured to calculate the difference (residual data) between the received block and the prediction result for a block selected from an intra-encoder (722) or inter-encoder (730). A residual encoder (724) may be configured to encode the residual data to generate conversion coefficients. For example, the residual encoder (724) may be configured to convert the residual data from the spatial domain to the frequency domain to generate conversion coefficients. The conversion coefficients are then subjected to a quantization process to obtain quantized conversion coefficients. In various exemplary embodiments, the video encoder (703) also includes a residual decoder (728). The residual decoder (728) is configured to perform an inverse transform to produce decoded residual data. The decoded residual data can be appropriately used by the intra-encoder (722) and inter-encoder (730). For example, an interencoder (730) can generate a decoded block based on decoded residual data and interprediction information, and an intraencoder (722) can generate a decoded block based on decoded residual data and intraprediction information. The decoded block is appropriately processed to generate a decoded picture, which is buffered in a memory circuit (not shown) and can be used as a reference picture.
[0082] The entropy encoder (725) may be configured to format a bitstream to include a coding block and to perform entropy coding. The entropy encoder (725) may be configured to include various types of information in the bitstream. For example, the entropy encoder (725) may be configured to include overall control data, selected prediction information (e.g., intra-prediction information or inter-prediction information), residual information, and other appropriate information in the bitstream. Residual information may be absent when coding a block in either inter-mode or bi-prediction mode merge submode.
[0083] Figure 8 shows a diagram of an exemplary video decoder (810) according to another embodiment of the present disclosure. The video decoder (810) is configured to receive an encoded picture which is part of an encoded video sequence, decode the encoded picture, and produce a reconstructed picture. In one example, the video decoder (810) may be used instead of the video decoder (410) in the example of Figure 4.
[0084] In the example shown in Figure 8, the video decoder (810) includes an entropy decoder (871), an interdecoder (880), a residual decoder (873), a reconfiguration module (874), and an intradecoder (872), all coupled together as shown in the exemplary configuration of Figure 8.
[0085] The entropy decoder (871) may be configured to reconstruct specific symbols from the coded picture that represent the syntax elements constituting the coded picture. Such symbols may include, for example, the mode in which the block is coded (e.g., intra-mode, inter-mode, bi-prediction mode, merge sub-mode, or another sub-mode), prediction information (e.g., intra-prediction information or inter-prediction information) that can identify specific samples or metadata used for prediction by the intra-decoder (872) or inter-decoder (880), and residual information in the form of quantized transformation coefficients. For example, if the prediction mode is inter-mode or bi-prediction mode, inter-prediction information is provided to the inter-decoder (880), and if the prediction type is intra-prediction type, intra-prediction information is provided to the intra-decoder (872). The residual information may undergo inverse quantization and be provided to the residual decoder (873).
[0086] The interdecoder (880) may be configured to receive interprediction information and generate interprediction results based on the interprediction information.
[0087] The intra decoder (872) may be configured to receive intra prediction information and generate prediction results based on the intra prediction information.
[0088] The residual decoder (873) may be configured to perform inverse quantization to extract the inversely quantized transformation coefficients, and to process the inversely quantized transformation coefficients to convert the residual from the frequency domain to the spatial domain. The residual decoder (873) may also utilize specific control information (including quantizer parameters (QP)), which may be provided by the entropy decoder (871) (the data path is not illustrated as this may only be low-data-volume control information).
[0089] The reconstruction module (874) may be configured to combine the residuals output by the residual decoder (873) and the prediction results (optionally output by the inter-prediction module or intra-prediction module) in the spatial domain to form reconstructed blocks that form part of the reconstructed picture as part of the reconstructed video. Note that other appropriate operations, such as deblocking operations, may also be performed to improve visual quality.
[0090] It should be noted that the video encoders (403), (603), and (703), as well as the video decoders (410), (510), and (810), can be implemented using any suitable technique. In some exemplary embodiments, the video encoders (403), (603), and (703), as well as the video decoders (410), (510), and (810), can be implemented using one or more integrated circuits. In another embodiment, the video encoders (403), (603), and (703), as well as the video decoders (410), (510), and (810), can be implemented using one or more processors that execute software instructions.
[0091] Turning to block partitioning for coding and decoding, general partitioning can start from a base block and follow a predefined set of rules, a specific pattern, a partition tree, or any partition structure or scheme. Partitioning can be hierarchical and recursive. After dividing or partitioning the base block according to one of the exemplary partitioning procedures or other procedures described below, or a combination thereof, a final set of partitions or coding blocks may be obtained. Each of these partitions can be one of the various partitioning levels in the partitioning hierarchy and can be of various shapes. Each partition can be called a coding block (CB). For the various exemplary partitioning implementations described further below, each resulting CB can be of any of the allowed size and partitioning level. Such partitions are called coding blocks because they can form units in which several basic coding / decoding decisions can be made, coding / decoding parameters can be optimized, determined, and signaled in the coded video bitstream. The highest or deepest level in the final partition represents the depth of the coding block partitioning structure of the tree. A coding block can be a rumatic coding block or a chromatic coding block. The CB tree structure for each color can be called a coding block tree (CBT).
[0092] The coding blocks for all color channels may be collectively called coding units (CUs). The hierarchical structure of all color channels may be collectively called coding tree units (CTUs). The partitioning patterns or structures for the different color channels within a CTU may or may not be the same.
[0093] In some implementations, the partition tree scheme or structure used for lumern and chroma channels may not need to be the same. In other words, lumern and chroma channels may have separate coding tree structures or patterns. Furthermore, whether lumern and chroma channels use the same coding partition tree structure or different coding partition tree structures, and the actual coding partition tree structure to be used, may depend on whether the slice being coded is a P slice, a B slice, or an I slice. For example, in the case of an I slice, chroma and lumern channels may have separate coding partition tree structures or coding partition tree structure modes, but in the case of a P or B slice, lumern and chroma channels may share the same coding partition tree scheme. When separate coding partition tree structures or modes are applied, a lumern channel may be partitioned into CBs by one coding partition tree structure, and a chroma channel may be partitioned into chroma CBs by another coding partition tree structure.
[0094] In some exemplary implementations, a predetermined partitioning pattern may be applied to the base block. As shown in Figure 9, an exemplary 4-way partition tree may start at a first predefined level (e.g., a 64x64 block level or other size as the base block size), and the base block may be hierarchically partitioned down to a predefined lowest level (e.g., a 4x4 level). For example, the base block may accept four predefined partitioning options or patterns indicated by 902, 904, 906, and 908, and partitions designated as R may be recursively partitioned in such a way that the same partitioning options shown in Figure 9 may be repeated at a lower scale down to the lowest level (e.g., a 4x4 level). In some implementations, additional restrictions may apply to the partitioning scheme in Figure 9. In the implementation of Figure 9, rectangular partitions (e.g., 1:2 / 2:1 rectangular partitions) may be allowed, but they may not be allowed to be recursive, whereas square partitions may be allowed to be recursive. Partitioning according to Figure 9 with recursion generates a final set of coding blocks as needed. A coding tree depth may be further defined to indicate the partitioning depth from the root node or root block. For example, the coding tree depth for the root node or root block, e.g., a 64x64 block, may be set to 0, and after the root block is partitioned one more time according to Figure 9, the coding tree depth is increased by 1. The maximum or deepest level from the 64x64 base block to the smallest 4x4 partition is 4 in the above scheme (starting from level 0). Such a partitioning scheme may be applied to one or more of the color channels. Each color channel may be partitioned independently according to the scheme in Figure 9 (for example, partitioning patterns or options between predefined patterns may be determined independently for each color channel at each hierarchy level).Alternatively, two or more of the color channels may share the same hierarchical pattern tree as in Figure 9 (for example, the same partitioning pattern or option between predefined patterns may be selected for two or more color channels at each hierarchical level).
[0095] Figure 10 shows another exemplary predefined partitioning pattern that allows recursive partitioning to form a partitioning tree. As shown in Figure 10, an exemplary 10-way partitioning structure or pattern can be predefined. The root block can start at a predefined level (e.g., from a base block at a 128x128 level or a 64x64 level). The exemplary partitioning structure in Figure 10 includes various 2:1 / 1:2 and 4:1 / 1:4 rectangular partitions. Partition types with three subpartitions, shown in 1002, 1004, 1006, and 1008 in the second row of Figure 10, are sometimes called "T-shaped" partitions. The "T-shaped" partitions 1002, 1004, 1006, and 1008 may be called left T-shaped, top T-shaped, right T-shaped, and bottom T-shaped. In some exemplary implementations, none of the rectangular partitions in Figure 10 are allowed to be further subdivided. To indicate the partitioning depth from the root node or root block, a coding tree depth may be further defined. For example, the coding tree depth for the root node or root block, e.g., a 128x128 block, may be set to 0, and after the root block is partitioned one more time according to Figure 10, the coding tree depth is increased by 1. In some implementations, only all square partitions in 1010 are capable of recursive partitioning to the next level of the partitioning tree according to the pattern in Figure 10. In other words, recursive partitioning may not be possible for square partitions in T-shaped patterns 1002, 1004, 1006, and 1008. The partitioning procedure according to Figure 10 with recursion generates a final set of coding blocks as needed. Such a scheme may be applied to one or more of the color channels. In some implementations, more flexibility may be added to the use of partitions less than 8x8 levels. For example, 2x2 chroma interpretation may be used in certain cases.
[0096] In several other exemplary implementations for coding block partitioning, a quadtree structure may be used to divide a base block or intermediate block into quadtree partitions. Such quadtree partitioning can be applied hierarchically and recursively to any square partition. Whether the base block or intermediate block or partition is further quadtree partitioned may depend on various local characteristics of the base block or intermediate block / partition. Quadtree partitioning at picture boundaries may be further adapted. For example, implicit quadtree partitioning may be performed at picture boundaries so that a block maintains its quadtree partitioning until its size fits the picture boundary.
[0097] In some other exemplary implementations, hierarchical binary partitioning from a base block may be used. In such a scheme, the base block or intermediate level block may be partitioned into two partitions. Binary partitioning can be either horizontal or vertical. For example, horizontal binary partitioning may divide the base block or intermediate block into equal right and left partitions. Similarly, vertical binary partitioning may divide the base block or intermediate block into equal upper and lower partitions. Such binary partitioning can be hierarchical and recursive. In each of the base block or intermediate block, a decision may be made as to whether the binary partitioning scheme should continue, and if so, whether horizontal or vertical binary partitioning should be used. In some implementations, further partitioning may stop at a predefined minimum partition size (in one or both dimensions). Alternatively, further partitioning may stop when a predefined partitioning level or depth from the base block is reached. In some implementations, the aspect ratio of the partitions may be restricted. For example, the aspect ratio of a partition will not be smaller than 1:4 (or larger than 4:1). Therefore, a vertical strip partition with a 4:1 vertical-to-horizontal aspect ratio can only be further binary partitioned vertically into upper and lower partitions, each having a 2:1 vertical-to-horizontal aspect ratio.
[0098] In several other examples, as shown in Figure 13, a ternary partitioning scheme may be used to partition a base block or any intermediate block. The ternary pattern can be implemented vertically, as shown in 1302 of Figure 13, or horizontally, as shown in 1304 of Figure 13. The exemplary partition ratios in Figure 13 are shown as 1:2:1 vertically or horizontally, but other ratios may be predefined. In some implementations, two or more different ratios may be predefined. Such a ternary partitioning scheme divides objects into separate partitions, as it always partitions along the block center in quadtrees and binary trees, but such a ternary partitioning can be used to complement quadtree or binary tree partitioning structures in that it is possible to capture objects located at the block center within a single contiguous partition. In some implementations, the width and height of the partitions in the exemplary ternary tree are always powers of 2 to avoid additional transformations.
[0099] The above partitioning schemes can be combined in any way at different partitioning levels. For example, the quadtree and binary partitioning schemes described above can be combined to partition a base block into a quadtree-binary (QTBT) structure. In such a scheme, the base block or intermediate block / partition can be either quadtree partitioned or binary partitioned, if specified, according to a set of predefined conditions. A specific example is shown in Figure 14. In the example in Figure 14, the base block is first quadtree partitioned into four partitions, as shown by 1402, 1404, 1406, and 1408. Each of the resulting partitions is then either quadtree partitioned into four further partitions (e.g., 1408), binary partitioned into two further partitions at the next level (e.g., horizontally or vertically, such as 1402 or 1406, both symmetrical), or not partitioned at all (1404). Binary partitioning or quadtree partitioning can be recursively enabled for square partitions, as shown by the overall exemplary partitioning pattern in 1410 and the corresponding tree structure / representation in 1420, where solid lines represent quadtree partitioning and dashed lines represent binary partitioning. A flag may be used for each binary partition node (non-leaf binary partition) to indicate whether the binary partition is horizontal or vertical. For example, as shown in 1420, consistent with the partitioning structure in 1410, the flag "0" may represent horizontal binary partitioning and the flag "l" may represent vertical binary partitioning. In the case of quadtree partitioned partitions, there is no need to indicate the partition type, as quadtree partitioning always divides a block or partition both horizontally and vertically to produce four subblocks / partitions of equal size. In some implementations, the flag "l" may represent horizontal binary partitioning and the flag "0" may represent vertical binary partitioning.
[0100] In some exemplary implementations of QTBT, the quadtree and binary partitioning rule sets can be represented by the following predefined parameters and their corresponding functions. - CTU size: The size of the root node of the quadtree (the size of the base block). - MinQTSize: Minimum allowable quadtree leaf node size - MaxBTSize: Maximum allowable binary tree root node size - MaxBTDepth: Maximum allowable binary tree depth - MinBTSize: Minimum allowable binary tree leaf node size In some exemplary implementations of the QTBT partitioning structure, the CTU size may be set as a 128x128 chromasample with two corresponding 64x64 blocks of chromasamples (when exemplary chroma subsampling is considered and used), MinQTSize may be set as 16x16, MaxBTSize may be set as 64x64, MinBTSize (both width and height) may be set as 4x4, and MaxBTDepth may be set as 4. Quadratic partitioning may initially be applied to the CTU to generate quadratic leaf nodes. A quadratic leaf node can have a size from its minimum allowable size of 16x16 (i.e., MinQTSize) to 128x128 (i.e., CTU size). If a node is 128x128, it is not initially partitioned by a binary tree because its size exceeds MaxBTSize (i.e., 64x64). Otherwise, nodes that do not exceed MaxBTSize may be partitioned by a binary tree. In the example in Figure 14, the base block is 128x128. The base block can only be partitioned by a quadtree according to a predefined set of rules. The base block has a partitioning depth of 0. Each of the four resulting partitions is 64x64, does not exceed MaxBTSize, and can be further partitioned by a quadtree or binary tree at bell 1. The process continues. When the binary tree depth reaches MaxBTDepth (i.e., 4), no further partitioning will be considered. When a binary tree node has a width equal to MinBTSize (i.e., 4), no further horizontal partitioning will be considered. Similarly, when the height of a binary tree node is equal to MinBTSize, no further vertical partitioning will be considered.
[0101] In some exemplary implementations, the above QTBT scheme may be configured to support flexibility for the lumens and chromians to have the same QTBT structure or separate QTBT structures. For example, in the case of P-slice and B-slice, the lumens CTB and chromens CTB in one CTU may share the same QTBT structure. However, in the case of I-slice, the lumens CTB may be partitioned into CBs by a QTBT structure, and the chromens CTB may be partitioned into chromens CBs by another QTBT structure. This means that CUs may be used to point to different color channels in an I-slice; for example, an I-slice may consist of a coding block for one lumens component or a coding block for two chromens components, and a CU in a P-slice or B-slice may consist of a coding block for all three color components.
[0102] In some other implementations, the QTBT scheme can be supplemented by the terminalization scheme described above. Such implementations may be called multi-type tree (MTT) structures. For example, in addition to binary partitioning of nodes, one of the terminalization partitioning patterns in Figure 13 may be selected. In some implementations, only square nodes may be subject to terminalization. Additional flags may be used to indicate whether the terminalization partitioning is horizontal or vertical.
[0103] The design of two-level or multi-level trees, such as QTBT implementations and QTBT implementations supplemented by ternary partitioning, can be primarily motivated by complexity reduction. Theoretically, the complexity of traversing a tree is TD, where T is the number of partition types and D is the depth of the tree. A trade-off can be made by using multiple types (T) while reducing the depth (D).
[0104] In some implementations, the CB may be further partitioned. For example, the CB may be further partitioned into multiple prediction blocks (PBs) for the purpose of intra or interframe prediction during the coding and decoding process. In other words, the CB may be further divided into different subpartitions where individual prediction decisions / configurations may be made. In parallel, the CB may be further partitioned into multiple transformation blocks (TBs) for the purpose of demarcating the level at which transformations or inverse transformations of the video data are performed. The partitioning schemes for the CB into PBs and TBs may be the same or different. For example, each partitioning scheme may be performed using its own procedure, for example, based on various characteristics of the video data. The PB and TB partitioning schemes may be independent in some exemplary implementations. The PB and TB partitioning schemes and boundaries may be correlated in some other exemplary implementations. In some implementations, for example, the TBs may be partitioned into PB partitions, and in particular, the PBs may be further partitioned into one or more TBs after being determined according to the partitioning of the coding blocks. For example, in some implementations, a PB can be divided into one, two, four, or other numbers of TBs.
[0105] In some implementations, lumera channels and chroma channels may be treated differently in order to partition the base block into coding blocks, and further into prediction and / or transformation blocks. For example, in some implementations, partitioning coding blocks into prediction and / or transformation blocks may be permitted for lumera channels, but such partitioning of coding blocks into prediction and / or transformation blocks may not be permitted for chroma channels. In such implementations, transformation and / or prediction of lumera blocks may thus only be performed at the coding block level. In another example, the minimum transformation block size for lumera channels and chroma channels may differ; for example, coding blocks for lumera channels may be permitted to be partitioned into smaller transformation and / or prediction blocks than chroma channels. As yet another example, the maximum partitioning depth of coding blocks into transformation and / or prediction blocks may differ between lumera channels and chroma channels. For example, coding blocks for lumera channels may be allowed to be partitioned into deeper transformation and / or prediction blocks than chroma channels. In a specific example, a lumera coding block may be partitioned into transformation blocks of multiple sizes, which can be represented by recursive partitions up to two levels down, with transformation block shapes such as square, 2:1 / 1:2, and 4:1 / 1:4, as well as transformation block sizes from 4x4 to 64x64, being acceptable. However, in the case of chroma blocks, only the largest possible transformation block specified for the lumera block may be allowed.
[0106] In some exemplary implementations for partitioning coding blocks into PBs, the depth, shape, and / or other properties of the PB partitioning may depend on whether the PB is intra-coded or inter-coded.
[0107] Partitioning of coding blocks (or prediction blocks) into transformation blocks can be implemented in a variety of exemplary ways, including, but not limited to, recursive or non-recursive quadtree partitioning and predefined pattern partitioning, along with additional consideration for the transformation blocks at the boundaries of the coding or prediction blocks. In general, the resulting transformation blocks may be at different partitioning levels, may not be the same size, and may not have to be square in shape (for example, they may be rectangles with some acceptable size and aspect ratio). Further examples are described in more detail below with respect to Figures 15, 16, and 17.
[0108] However, in some other implementations, a CB obtained through any of the above partitioning schemes can be used as a basic or minimal coding block for prediction and / or transformation. In other words, no further partitioning is performed for inter-prediction / intra-prediction and / or transformation purposes. For example, a CB obtained from the above QTBT scheme can be used directly as a unit for performing predictions. Specifically, such a QTBT structure eliminates the concept of multiple partition types; i.e., it eliminates the separation of CU, PU, and TU, and supports more flexibility for CU / CB partition shapes as described above. In such a QTBT block structure, the CU / CB can have either a square or rectangular shape. The leaf nodes of such a QTBT are used as units for prediction and transformation processing without further partitioning. This means that the CU, PU, and TU have the same block size in such an exemplary QTBT coding block structure.
[0109] The various CB partitioning schemes described above, as well as further partitioning of CBs into PBs and / or TBs (including no PB / TB partitioning), can be combined in any way. The following specific implementations are provided as non-limiting examples.
[0110] Specific exemplary implementations of coding block and transform block partitioning are described below. In such exemplary implementations, the base block may be partitioned into coding blocks using recursive quadtree partitioning or predefined partitioning patterns described above (such as those in Figures 9 and 10). At each level, whether further quadtree partitioning of a particular partition should be continued may be determined by local video data characteristics. The resulting CBs can be of various quadtree partitioning levels and of various sizes. The decision on whether to code a picture area using interpicture (time) or intrapicture (spatial) prediction may be made at the CB level (or at the CU level for all three color channels). Each CB may be further partitioned into one, two, four, or other number of PBs according to a predefined PB partitioning type. Within a single PB, the same prediction process may be applied, and relevant information may be sent to the decoder for each PB. After obtaining residual blocks by applying the prediction process based on the PB partitioning type, the CB may be partitioned into TBs according to another quadtree structure similar to the coding tree for the CB. In this particular implementation, the CB or TB may be limited to squares, but is not limited to them. Furthermore, in this particular example, the PB can be square or rectangular for interpretations, and only square for intrapretations. A coding block can be divided, for example, into four square TBs. Each TB can be recursively divided (using quadtree partitioning) into smaller TBs called residual quadtrees (RQTs).
[0111] Further exemplary implementations for partitioning a base block into CBs, PBs, and / or TBs are described below. For example, a quadtree with a nested multitype tree using binary and ternary partitioning segmentation structures (e.g., QTBT or QTBT with ternary partitioning as described above) may be used, rather than using multiple partition unit types such as those shown in Figure 9 or Figure 10. The separation of CBs, PBs, and TBs (i.e., partitioning CBs into PBs and / or TBs, and PBs into TBs) may be abandoned except when necessary for CBs that are too large for the maximum transformation length, and such CBs may require further partitioning. This exemplary partitioning scheme may be designed to support more flexibility for CB partition shapes so that both prediction and transformation can be performed at the CB level without further partitioning. In such a coding tree structure, CBs may have either a square or rectangular shape. Specifically, a coding tree block (CTB) may first be partitioned by a quadtree structure. Next, the quadtree leaf nodes can be further partitioned by a nested multitype tree structure. An example of a nested multitype tree structure using binary partitioning or ternary partitioning is shown in Figure 11. Specifically, the exemplary multitype tree structure in Figure 11 includes four partitioning types called vertical binary partitioning (SPLIT_BT_VER)(1102), horizontal binary partitioning (SPLIT_BT_HOR)(1104), vertical ternary partitioning (SPLIT_TT_VER)(1106), and horizontal ternary partitioning (SPLIT_TT_HOR)(1108). The CB corresponds to the leaf of the multitype tree. In this exemplary implementation, this segmentation is used for both prediction and transformation processing without further partitioning, as long as the CB is not too large for the maximum transformation length. This means that in most cases, the CB, PB, and TB have the same block size in a quadtree with a nested multitype tree coding block structure.An exception occurs when the maximum supported conversion length is smaller than the width or height of the color components in the CB. In some implementations, in addition to binary or ternary partitioning, the nested patterns in Figure 11 may further include quadtree partitioning.
[0112] Figure 12 shows one specific example for a quadtree having a nested multitype tree coding block structure of block partitions (including quadtree, binary, and ternary partitioning options) for one base block. More specifically, Figure 12 shows that base block 1200 is quadtree partitioned into four square partitions 1202, 1204, 1206, and 1208. The decision to further use the multitype tree structure and quadtrees of Figure 11 for further partitioning is made for each of the quadtree-partitioned partitions. In the example of Figure 12, partition 1204 is not further partitioned. Partitions 1202 and 1208 each adopt a different quadtree partitioning. For partition 1202, the second-level quadtree-partitioned upper-left, upper-right, lower-left, and lower-right partitions adopt a third-level partitioning of a quadtree, horizontal binary partition 1104 in Figure 11, unpartitioned, and horizontal ternary partition 1108 in Figure 11, respectively. Partition 1208 employs a different quadtree partitioning pattern, and the second-level quadtree-partitioned upper-left, upper-right, lower-left, and lower-right partitions employ third-level partitioning patterns: vertical ternary partitioning 1106, unpartitioned, unpartitioned, and horizontal binary partitioning 1104 in Figure 11, respectively. Two of the subpartitions of the third-level upper-left partition of 1208 are further partitioned according to horizontal binary partitioning 1104 and horizontal ternary partitioning 1108 in Figure 11, respectively. Partition 1206 is divided into two partitions by adopting a second-level partitioning pattern according to vertical binary partitioning 1102 in Figure 11, and these partitions are further partitioned at the third level according to horizontal ternary partitioning 1108 and vertical binary partitioning 1102 in Figure 11. A fourth-level partitioning pattern is further applied to one of them according to horizontal binary partitioning 1104 in Figure 11.
[0113] In the specific example above, the maximum rumar conversion size could be 64x64, and the maximum supported chromar conversion size may differ from, for example, a rumar conversion of 32x32. The exemplary CB in Figure 12 is generally not further divided into smaller PBs and / or TBs, but when the width or height of a rumar coding block or chromar coding block is greater than the maximum conversion width or height, the rumar coding block or chromar coding block may be automatically divided horizontally and / or vertically to satisfy the conversion size limitations in that direction.
[0114] In a specific example of partitioning the above base block into CBs, as explained above, the coding tree scheme can support the ability for lumens and chromens to have separate block tree structures. For example, in the case of P and B slices, the lumens CTB and chromens CTB in one CTU may share the same coding tree structure. In the case of I slices, for example, lumens and chromens may have separate coding block tree structures. When separate block tree structures are applied, a lumens CTB may be partitioned into lumens CBs by one coding tree structure, and a chromens CTB may be partitioned into chromens CBs by another coding tree structure. This means that a CU in an I slice may consist of a coding block for one lumens component or a coding block for two chromens components, and unless the video is monochrome, a CU in a P or B slice will always consist of coding blocks for all three color components.
[0115] When a coding block is further partitioned into multiple transformation blocks, these transformation blocks can be ordered in the bitstream according to various orders or scanning methods. Exemplary implementations for partitioning coding or prediction blocks into transformation blocks, and the coding order of the transformation blocks, are described in more detail below. In some exemplary implementations, as described above, transformation partitioning can support transformation blocks of multiple shapes, e.g., 1:1 (square), 1:2 / 2:1, and 1:4 / 4:1, with transformation block sizes ranging, for example, from 4x4 to 64x64. In some implementations, if the coding block is smaller than or equal to 64x64, transformation block partitioning may be applied only to the chroma component, and therefore, for chroma blocks, the transformation block size is the same as the coding block size. Instead, if the coding block width or height is greater than 64, both the rumor coding block and the chroma coding block may be implicitly divided into transformation blocks that are multiples of min(W,64) × min(H,64) and min(W,32) × min(H,32), respectively.
[0116] In some exemplary implementations of transform block partitioning, both intra-coded and interconnected blocks may be further partitioned into multiple transform blocks with a partitioning depth of up to a predefined number of levels (e.g., two levels). The transform block partitioning depth and size may be related. For some exemplary implementations, the mapping from the transform size at the current depth to the transform size at the next depth is shown in Table 1 below. [Table 1]
[0117] Based on the illustrative mapping in Table 1, for a 1:1 square block, the next level of transformation partitioning can create four 1:1 square subtransformation blocks. The transformation partition can stop at, for example, 4x4. Therefore, a transformation size of 4x4 at the current depth corresponds to the same size of 4x4 at the next depth. In the example in Table 1, for a 1:2 / 2:1 non-square block, the next level of transformation partitioning can create two 1:1 square subtransformation blocks, while for a 1:4 / 4:1 non-square block, the next level of transformation partitioning can create two 1:2 / 2:1 subtransformation blocks.
[0118] In some exemplary implementations, additional restrictions may be applied to the rumor components of intra-coded blocks with respect to transform block partitioning. For example, for each level of transform partitioning, all sub-transformation blocks may be restricted to having equal sizes. For example, for a 32x16 coding block, a level 1 transform partition creates two 16x16 sub-transformation blocks, and a level 2 transform partition creates eight 8x8 sub-transformation blocks. In other words, the second level partition must be applied to all first-level sub-blocks to keep the transform units of equal size. An example of transform block partitioning for an intra-coded square block according to Table 1 is shown in Figure 15, along with the coding order indicated by arrows. Specifically, 1502 shows a square coding block. The first-level partition into four equally sized transform blocks according to Table 1 is shown in 1504, along with the coding order indicated by arrows. The second-level partition into all 16 equally sized transform blocks of the first-level equally sized block, according to Table 1, is shown in 1506, along with the coding order indicated by arrows.
[0119] In some exemplary implementations, the above restrictions for intracoding may not apply to the rumor components of an interconnected block. For example, after the first level of transform partitioning, one of the sub-transformation blocks may be independently partitioned at yet another level. Thus, the resulting transformation blocks may or may not be the same size. An exemplary partitioning of an interconnected block into transformation blocks, along with their coding order, is shown in Figure 16. In the example in Figure 16, the interconnected block 1602 is partitioned into transformation blocks at two levels according to Table 1. At the first level, the interconnected block is partitioned into four transformation blocks of equal size. Then, as shown by 1604, only one of the four transformation blocks (but not all of them) is further partitioned into four sub-transformation blocks, resulting in a total of seven transformation blocks of two different sizes. The exemplary coding order of these seven transformation blocks is indicated by the arrow in 1604 of Figure 16.
[0120] In some exemplary implementations, several additional restrictions may apply to the chroma component(s) for the transformation block. For example, for the chroma component(s), the transformation block size can be the same as the coding block size, but it cannot be smaller than a predefined size, e.g., 8x8.
[0121] In some other exemplary implementations, for coding blocks where either the width (W) or height (H) is greater than 64, both the rumor coding block and the chroma coding block may be implicitly divided into transformation units that are multiples of min(W,64) × min(H,64) and min(W,32) × min(H,32), respectively. Here, in this disclosure, "min(a,b)" may return the smaller value between a and b.
[0122] Figure 17 further illustrates another alternative exemplary method for partitioning coding blocks or prediction blocks into transformation blocks. As shown in Figure 17, instead of using recursive transformation partitioning, a predefined set of partitioning types can be applied to coding blocks according to the transformation type of the coding block. In the particular example shown in Figure 17, one of six exemplary partitioning types may be applied to divide the coding block into a varying number of transformation blocks. Such a method for generating transformation block partitioning can be applied to either coding blocks or prediction blocks.
[0123] More specifically, the partitioning scheme in Figure 17 provides up to six exemplary partition types for any given transformation type (where "transformation type" refers to a type of primary transformation, such as ADST). In this scheme, every coding block or prediction block can be assigned a transformation partition type based, for example, on rate distortion cost. In one example, the transformation partition type assigned to a coding block or prediction block may be determined based on the transformation type of the coding block or prediction block. A particular transformation partition type can correspond to a transformation block partition size and pattern, as shown by the six transformation partition types in Figure 17. The correspondence between various transformation types and various transformation partition types can be predefined. An example is shown below, using capital letters to indicate the transformation partition types that can be assigned to a coding block or prediction block based on rate distortion cost: • PARTITION_NONE: Allocates a conversion size equal to the block size. • PARTITION_SPLIT: Assigns a conversion size that is half the width of the block size and half the height of the block size. • PARTITION_HORZ: Allocates a conversion size that has the same width as the block size and half the height of the block size. • PARTITION_VERT: Assigns a conversion size that is half the width of the block size and the same height as the block size. • PARTITION_HORZ4: Allocates a conversion size that has the same width as the block size and a height that is 1 / 4 of the block size. • PARTITION_VERT4: Allocates a conversion size that is 1 / 4 the width of the block size and the same height as the block size.
[0124] In the example above, as shown in Figure 17, all transformation partition types include a uniform transformation size for the partitioned transformation block. This is merely an example, not an limitation. In some other implementations, a mixed transformation block size may be used for the partitioned transformation block in a particular partition type (or pattern).
[0125] The PB (or CB, also called PB if not further partitioned into prediction blocks) obtained from any of the above partitioning schemes can then become individual blocks for coding via either intra-prediction or inter-prediction. In the case of inter-prediction on the current PB, a residual is generated between the current block and the prediction block, which can be coded and included in the coded bitstream.
[0126] Interpretation can be implemented, for example, in single-reference mode or composite-reference mode. In some implementations, a skip flag may be initially included in the bitstream for the current block (or a higher level) to indicate whether the current block should be intercoded and not skipped. If the current block is intercoded, another flag may be further included as a signal in the bitstream to indicate whether single-reference mode or composite-reference mode is used for predicting the current block. In single-reference mode, one reference block may be used to generate the prediction block for the current block. In composite-reference mode, two or more reference blocks may be used, for example, to generate the prediction block by weighted averaging. Composite-reference mode is sometimes called more-than-one-reference mode, two-reference mode, or multiple-reference mode. One or more reference blocks may be identified using one or more reference frame indices, and additionally using one or more corresponding motion vectors indicating the shift(s) between the reference block(s) and the current block at location, for example, at horizontal and vertical pixels. For example, in single-reference mode, an interpretation block for the current block may be generated from a single reference block identified as a prediction block by a single motion vector in the reference frame, whereas in composite-reference mode, the prediction block may be generated by a weighted average of two reference blocks in two reference frames, indicated by two reference frame indices and two corresponding motion vectors. Motion vectors(s) can be coded in various ways and included in the bitstream.
[0127] In some implementations, the encoding or decoding system may maintain a decoded picture buffer (DPB). Some images / pictures are maintained in the DPB awaiting display (in the decoding system), and some images / pictures in the DPB may be used as reference frames to enable interpretation (in the decoding or encoding system). In some implementations, reference frames in the DPB may be tagged as either short-term or long-term references for the current image being encoded or decoded. For example, short-term reference frames may include frames used for interpretation of blocks in the current frame or in subsequent video frames that are closest to the current frame in decoding order (e.g., 2). Long-term reference frames may include frames in the DPB that can be used to predict image blocks in frames that are more than a predetermined number away from the current frame in decoding order. Information about such tags for short-term and long-term reference frames may be called a reference picture set (RPS) and may be added to the header of each frame in the encoded bitstream. Each frame in the encoded bitstream may be identified by a picture order counter (POC), which is numbered in an absolute manner or in relation to a group of pictures, for example, starting with frame I, according to the playback sequence.
[0128] In some exemplary implementations, one or more reference picture lists, including the identification of short-term and long-term reference frames for inter-prediction, may be formed based on information in the RPS. For example, for unidirectional inter-prediction, a single picture reference list, designated as L0 reference (or reference list 0), may be formed, and for bidirectional inter-prediction, two picture reference lists, designated as L0 (or reference list 0) and L1 (or reference list 1), may be formed for each of the two prediction directions. The reference frames contained in the L0 and L1 lists may be ordered in various predetermined ways. The lengths of the L0 and L1 lists may be signaled in the video bitstream. Unidirectional inter-prediction can be either a single-reference mode or a composite-reference mode where multiple references for generating prediction blocks by weighted averaging in composite prediction mode are on the same side of the block to be predicted. Bidirectional inter-prediction can be a composite mode only in that bidirectional inter-prediction involves at least two reference blocks.
[0129] Block partitioning
[0130] Figure 18 shows an exemplary partition tree for block partitioning. VP9 uses a 4-way partition tree starting at a 64x64 level down to a 4x4 level, with additional restrictions for blocks 8x8 and below. This is shown in Figure 18. Partitions designated as R can be recursive in that the same partition tree is repeated at lower scales until the lowest 4x4 level is reached. AV1 extends the partition tree to a 10-way structure, as shown in the figure, but also increases the maximum size (called a superblock in VP9 / AV1), starting at 128x128. This includes 4:1 / 1:4 rectangular partitions, which were not present in VP9. None of the rectangular partitions can be further subdivided. In addition, AV1 adds further flexibility to the use of partitions below the 8x8 level. For example, 2x2 chromininter prediction becomes possible in certain cases.
[0131] In HEVC, a coding tree unit (CTU) is divided into coding units (CUs) by using a quadtree structure, represented as a coding tree, to accommodate various local characteristics. A CU can also be considered a block containing either a prediction block or a coding block. The decision of whether to code a picture area using interpicture (time) prediction or intrapicture (spatial) prediction is made at the CU level. Each CU can be further divided into one, two, or four prediction units (PUs) according to the PU partition type. Within a single PU, the same prediction process is applied, and relevant information is sent to the decoder for each PU. After obtaining residual blocks by applying the prediction process based on the PU partition type, a CU can be partitioned into transformation units (TUs) according to another quadtree structure, such as a coding tree for the CU. The HEVC structure has multiple partition concepts, including CUs, PUs, and TUs. In HEVC, a CU or TU can be square only, while a PU can be square or rectangular with respect to the interprediction block. In HEVC, a single coding block can be further divided into four square subblocks, and the transformation is performed on each subblock (i.e., TU). Each TU can be further recursively divided into smaller TUs (using quadtree partitioning), called residual quadtrees (RQTs). At picture boundaries, HEVC employs implicit quadtree partitioning so that blocks maintain their quadtree partitioning until their size fits the picture boundary.
[0132] Block positioning structures can exist that use a quadtree (QT) plus a binary tree (BT). In HEVC, a CTU can be partitioned into CUs by using a quadtree structure, represented as a coding tree, to accommodate various local characteristics. The decision of whether to code a picture area using interpicture (time) prediction or intrapicture (spatial) prediction is made at the CU level. In some embodiments, each CU can be further partitioned into one, two, or four PUs according to the PU partitioning type. Within a single PU, the same prediction process may be applied, and relevant information is sent to the decoder for each PU. After obtaining residual blocks by applying the prediction process based on the PU partitioning type, the CU can be partitioned into transformation units (TUs) according to another quadtree structure, such as a coding tree for the CU. The HEVC structure can have multiple partition concepts, including CUs, PUs, and TUs.
[0133] Figure 19 shows exemplary partitions and trees for a quadtree-plus-binary tree (QTBT) structure. A QTBT structure may not have the same concepts as multiple partition types and may eliminate the separation of the concepts of CU, PU, and TU. A QTBT structure may support increased flexibility for CU partition shapes. In some embodiments of a QTBT block structure, the CU may have a square or rectangular shape.
[0134] As shown in Figure 19, a coding tree unit (CTU) is initially partitioned by a quadtree structure. The quadtree leaf nodes are further partitioned by a binary tree structure. There can be two types of binary tree partitioning: symmetric horizontal partitioning and symmetric vertical partitioning. The binary tree leaf nodes are called coding units (CUs), and their segmentation can be used for prediction and transformation processing without further partitioning. CUs, PUs, and TUs may have the same block size in the QTBT coding block structure. In JEM, a CU may contain coding blocks (CBs) of different color components (e.g., for P-slice and B-slice in a 4:2:0 chroma format, one CU contains one lumar CB and two chroma CBs). In other embodiments, it may contain a CB of a single component (e.g., one CU contains one lumar CB or two chroma CBs in the case of an I-slice).
[0135] The following parameters may be defined for the QTBT partitioning scheme: • CTU size: The same concept as in HEVC, the size of the root node in a quadtree. • MinQTSize: Minimum allowable quadtree leaf node size • MaxBTSize: Maximum allowable binary tree root node size • MaxBTDepth: Maximum allowable binary tree depth • MinBTSize: Minimum allowable binary tree leaf node size
[0136] In one embodiment of the QTBT partitioning structure, the CTU size may be set as a 128×128 chroma sample having two corresponding 64×64 blocks of chroma sample. In one embodiment, MinQTSize may be set as 16×16, MaxBTSize may be set as 64×64, MinBTSize (both width and height) may be set as 4×4, and MaxBTDepth may be set as 4. Quadratic partitioning may be applied to the CTU first to generate quadrutree leaf nodes. Quadratic leaf nodes may have sizes ranging from 16×16 (i.e., MinQTSize) to 128×128 (i.e., CTU size). If a leaf quadrutree node is 128×128 in size, it will not be further partitioned by a binary tree because it exceeds MaxBTSize (i.e., 64×64). Otherwise, the leaf quadrutree node may be further partitioned by a binary tree. A quadtree leaf node may also be the root node of a binary tree, having a binary tree depth of 0. Once the binary tree depth reaches MaxBTDepth (i.e., 4), further partitioning will not be considered. When a binary tree node has a width equal to MinBTSize (i.e., 4), further horizontal partitioning will not be considered. Similarly, when the height of a binary tree node is equal to MinBTSize, further vertical partitioning will not be considered. Binary tree leaf nodes can be further processed by prediction and transformation processes without further partitioning. In JEM, the maximum CTU size can be 256 × 256 lumens.
[0137] Figure 19 shows an example of block partitioning using QTBT (left side of Figure 19) and the corresponding tree representation (right side of Figure 19). Solid lines represent quadtree partitioning, and dashed lines represent binary tree partitioning. At each partition (i.e., non-leaf) node of the binary tree, one flag is signaled to indicate which partition type (i.e., horizontal or vertical) is used: 0 indicates horizontal partitioning and 1 indicates vertical partitioning. In the case of quadtree partitioning, there is no need to indicate the partition type because quadtree partitioning divides the block both horizontally and vertically to produce four subblocks of equal size.
[0138] In addition, the QTBT method supports the flexibility of having separate QTBT structures for lumens and chromens. For example, in the case of P and B slices, the lumens CTB and chromens CTB in one CTU share the same QTBT structure. However, in the case of I slices, the lumens CTB may be partitioned into CUs by a QTBT structure, and the chromens CTB may be partitioned into chromens CUs by a separate QTBT structure. In this example, a CU in an I slice contains a coding block for one lumens component or a coding block for two chromens components, while a CU in a P or B slice contains coding blocks for all three color components. In HEVC, interpretation may be limited for small blocks to reduce memory access for motion compensation; as a result, biprediction is not supported for 4x8 and 8x4 blocks, and interpretation is not supported for 4x4 blocks. QTBT, as implemented in JEM-7.0, can remove these limitations.
[0139] Block partitioning can be based on a ternary tree (TT) structure. Figure 20 shows an exemplary ternary tree partitioning. In VVC, a multi-type tree (MTT) structure can add horizontal and vertical central ternary trees on top of a QTBT, as shown in Figure 20 (a and b, respectively). Ternary tree partitioning can complement quadtree and binary tree partitioning. While quadtrees and binary trees partition along the center of a block, ternary tree partitioning can capture objects located at the center of a block. With ternary tree partitioning, the width and height of the proposed ternary tree partitions can be powers of 2 so that no additional transformations are required. The design of a two-level tree can be motivated by complexity reduction, where the complexity of traversing the tree is TD, where T is the number of partition types and D is the depth of the tree.
[0140] Template Matching™
[0141] Figure 21 shows an example of template matching (TM) on a search area around an initial motion vector (MV). Template matching can be a decoder-side MV derivation method for refining motion information of a current coding unit (CU) or block by finding the closest match between a template in the current picture (i.e., adjacent blocks above and / or to the left of the current CU) and a referenced block (i.e., the same size as the template). Motion information may include spatial reference motion information (or spatial motion vectors) and / or temporal reference motion information (or temporal motion vectors). As shown in Figure 21, a better MV should be searched around the initial motion of the current CU within a [-8,+8]-pel search range.
[0142] In AMVP mode, the MVP candidate with the smallest TM error between the current block template and the reference block template is selected on both the encoder and decoder sides. TM is then performed on this MVP candidate for MV refinement. In merge mode, a similar search method may be applied to merge candidates indicated by the merge index.
[0143] Decoder-side motion vector refinement (DMVR)
[0144] Figure 22 shows an example of decoder-side motion vector refinement. To improve the accuracy of the merge-mode MV, bilateral matching (BM)-based decoder-side motion vector refinement can be applied in VVC. In bipredictive operation, the refined MV is searched around the initial MV in reference picture list L0 and reference list Ll. The BM method calculates the distortion between two candidate blocks in reference list L0 and list Ll. As shown in Figure 22, the SAD(MV1') between the two blocks is calculated based on each MV candidate around the initial MV. The MV candidate with the lowest SAD becomes the refined MV and is used to generate the bipredictive signal.
[0145] The refined MV derived by the DMVR process is used to generate interprediction samples for the current block and is also used in time motion vector prediction for future picture coding. Meanwhile, the original MV is used in the deblocking and spatial motion vector prediction processes for future CU coding within the current frame. In DMVR, the search point surrounds the initial MV, and the MV offset may follow the MV difference mirroring rule. In other words, any point checked by DMVR, indicated by a candidate MV pair (MV0,MV1), may follow the following two equations: MV0' = MV0 + MV_offset Equation (1) MV1'=MV1-MV_offset formula (2) Here, MV_offset represents the refinement offset between the initial MV (MV0, MV1) and the refined MV (MV0’, MV1’) in one of the reference pictures. The refinement search range is two integer luminance samples from the initial MV. The search includes an integer sample offset search stage and a fractional sample refinement stage.
[0146] In one embodiment, a 25-point full search is applied to the integer sample offset search. First, the SAD of the initial MV pair is calculated. If the SAD of the initial MV pair is smaller than the threshold, the integer sample stage of DMVR ends. Otherwise, the SAD of the remaining 24 points is calculated and checked in raster scan order. The point with the minimum SAD is selected as the output of the integer sample offset search stage. To reduce the penalty of the uncertainty of DMVR refinement, the original MV during the DMVR process can be favored. The SAD between the reference blocks referred to by the initial motion vectors is reduced by 1 / 4 of the SAD value.
[0147] Fractional sample refinement may follow the integer sample search. To reduce the computational complexity, the fractional sample refinement is derived by using a parametric error surface equation instead of additional search by SAD comparison. In parametric error surface-based sub-pixel offset estimation, the cost at the center position and the costs at four adjacent integer positions from the center are used to fit a 2D parabolic surface equation of the following form: E(x,y)=A(x - x min ) 2 +B(y - y min ) 2 +C Equation (3) Here, (x min , y min ) corresponds to the coordinates of the fractional position with the minimum cost, and C corresponds to the minimum cost value. By solving the above equation using the cost values of five search points, (x min , y min ) is calculated as follows: x min=(E(-1,0)-E(1,0)) / (2(E(-1,0)+E(1,0)-2E(0,0))) Equation (4) y min =(E(0,-1)-E(0,1)) / (2((E(0,-1)+E(0,1)-2E(0,0))) Equation (5)
[0148] x min and y min The value of is clipped between -8 and 8, since all cost values are positive and the minimum value is E(0,0). This corresponds to a half-pixel offset with 1 / 16-pel MV precision in VVC. The calculated (x min ,y min This is added to the integer distance refined MV to obtain the final refined delta MV with subpixel accuracy.
[0149] Reference frame coding in AV1
[0150] In AV1, for each coded block in an interframe, if the current block's mode is intercoding mode and not skip mode, a flag may be signaled to indicate whether single-reference mode or compound-reference mode is used for the current block. In single-reference mode, a prediction block is generated by a single motion vector, while in compound-reference mode, a prediction block is generated by a weighted average of two prediction blocks derived from two motion vectors. In the case of single-reference, a reference frame index with values from 1 to 7 is signaled in the bitstream to indicate which reference frame is used for the current block. Reference frame indices 1-4 may be specified for reference frames that precede the current frame in display order, while indices 5-7 are reference frames that follow the current frame in display order.
[0151] In the case of compound references, another flag may be signaled to indicate whether it is a unidirectional compound reference mode or a bidirectional compound reference mode. In unidirectional compound reference mode, both referenced frames are either before or after the current frame in display order. In bidirectional compound reference mode, one referenced frame is before the current frame and the other referenced frame is after the current frame in display order. In addition, the allowed combinations of unidirectional referenced frames are limited to only four possible pairs, namely (1,2), (1,3), (1,4), and (5,7), while in the bidirectional case, all 12 combinations are supported. The rationale behind this is that if the number of referenced frames on both sides of the current frame in display order is balanced, the bidirectional reference prediction is likely to provide a better prediction. When most referenced frames are on one side of the current frame, extrapolation including the closest one may be more relevant to the current frame.
[0152] Motion vector reference scheme in AV1 and CWG-B049
[0153] Bits for signaling motion vectors account for a significant portion of the total bitrate. Modern video codecs employ predictive coding for motion vectors and may use entropy coding to code differences. Predictive accuracy has a significant impact on coding efficiency. AV1 employs a dynamic motion vector referencing scheme that retrieves candidate motion vectors from spatial and temporal adjacencies and ranks them for efficient entropy coding. In one embodiment, spatial motion vector referencing causes a coding block to search its spatial adjacencies in units of 8x8 lumens to find those with the same reference frame index as the current block. In the case of composite interpredictive mode, it may contain the same reference frame pair. Figure 23 shows an exemplary spatial motion vector search pattern. As shown in Figure 23, the search region includes three 8x8 block rows above the current block and three 8x8 block columns to the left of the current block, and the search order is indicated by index. In CWG-B049, the number of 8x8 block rows above can be reduced from 3 to 1, but the number of columns to the left remains 3.
[0154] In another embodiment, the time motion vector reference may include a motion trajectory between the current frame and the previously coded frame, which is first constructed by utilizing motion vectors from the previously coded frame, either through linear interpolation or extrapolation. Next, a motion field can be formed between the current frame and the given reference frame by extending the motion trajectory from the current frame toward the reference frame.
[0155] In the current design of the signaling method for reference frames, the order of reference frames in reference frame list 0 and reference frame list 1 is fixed for all blocks within a single frame, regardless of reference frames in adjacent blocks. Therefore, spatial and / or temporal correlations of reference frame selection are not utilized.
[0156] In composite reference mode, the orientation of the two reference frames is the same if both of the POCs of the two reference frames for a given motion vector pair are greater than or less than the POC of the current frame. Otherwise, if the POC of one reference frame is greater than the POC of the current frame and the POC of the other reference frame is smaller than the POC of the current frame, the orientation of the two reference frames is different.
[0157] The proposed embodiments may be used separately or combined in any order. Furthermore, each of the methods (or embodiments), encoders, and decoders may be implemented by processing circuits (e.g., one or more processors or one or more integrated circuits). In one example, one or more processors execute a program stored in a non-temporary computer-readable medium. Hereinafter, the term "block" may be interpreted as a prediction block, coding block, or coding unit, i.e., CU. The direction of the reference frame may be determined by whether the reference frame is before or after the current frame in display order.
[0158] Figure 24 shows a flowchart of the method according to an exemplary embodiment of the present disclosure. For single or combined reference modes, this template matching (TM) method may be employed to sort reference frames or reference frame pairs for each block. In block 2402, the template of the current block is compared with the template of the reference block. The comparison may be for the difference between the template of the current block (i.e., the adjacent blocks above and / or to the left) and the template of the reference block referenced by the motion information of the spatial reference motion information (or spatial motion vector) and / or temporal reference motion information (or temporal motion vector). Based on this comparison, the difference between the templates is calculated in block 2404. The calculation may be based on the sum of absolute differences (SAD), the sum of squared differences (SSD), the mean squared error (MSE), or the sum of absolute differences transformed (SATD). In one embodiment, for each piece of spatial reference motion information and / or temporal reference motion information, the difference between the template of the current block and the template of the reference block is calculated and marked as a score value for the associated reference frame or reference frame pair, as shown in block 2406. A score is determined for each reference frame based on the calculated difference. Then, in block 2408, all available reference frames are ranked based on their score values. This ranking can be used to sort the frames in block 2410.
[0159] Although described as a reference frame, it may also include a reference frame pair. In one embodiment, if multiple motion vectors point to one reference frame or one reference frame pair, the smallest distinct value referenced by these motion vectors may be marked as the score value of one reference frame or one reference frame pair. In another embodiment, if there are no motion vectors pointing to one reference frame or one reference frame pair, the score value of that reference frame or reference frame pair is marked as the maximum allowable value. In another embodiment, if multiple reference frames or multiple reference frame pairs have the same score value, the ranking order for these reference frames or reference frame pairs is the same as the scanning order of the spatial reference motion information and / or temporal reference motion information. In another embodiment, if multiple reference frames or multiple reference frame pairs have the same score value, the ranking order for these reference frames or reference frame pairs depends on the frequency of occurrence of these reference frames or reference frame pairs used in the spatial reference motion information and / or temporal reference motion information.
[0160] In one embodiment, all permitted unidirectional and bidirectional composite reference frame pairs are ranked together by using the TM method, and the index of the reference frame pair for the current block in this ranking order is signaled in the bitstream. In another embodiment, all permitted single reference frames are ranked together by using the TM method, and the index of the reference frame for the current block in this ranking order is signaled in the bitstream.
[0161] In one embodiment, a method is provided for sorting reference frames for each block by template matching (TM), the method comprising the steps of: comparing the template of the current block with the template of a reference block for motion information, wherein the motion information includes spatial reference motion information or temporal reference motion information; calculating the difference between the template of the current block and the template of the reference block; determining the score value of the associated reference frames based on the calculated difference; and sorting the reference frames based on the determined scores. The reference frames further include reference frame pairs. TM includes decoder-side motion vector derivation for refining the motion information of the current block. The sorting step further includes ranking the available reference frames based on the score value for each of the reference frames. When the score values of multiple reference frames are equivalent, the ranking step corresponds to the scanning order of the spatial reference motion information or temporal reference motion information. When the score values of multiple reference frames are equivalent, the ranking step is based on the frequency of occurrence of these reference frames used in the spatial reference motion information or temporal reference motion information. The calculation steps include at least one of the following: Sum of Absolute Differences (SAD), Sum of Squared Differences (SSD), Mean Squared Error (MSE), or Sum of Absolute Difference Transformations (SATD). The template includes adjacent blocks above or to the left. Spatial reference motion information includes one or more spatial motion vectors. Temporal reference motion information includes one or more temporal motion vectors. When multiple motion vectors point to one of the reference frames, the one with the smallest difference is used to determine the score value. When there are no motion vectors pointing to one of the reference frames, the score value is determined to be the maximum allowable value. Allowed unidirectional and bidirectional composite reference frames are ranked together by using TM, and the index of the reference frame for the current block is signaled in the bitstream.Allowed single reference frames are ranked together by using TM, and the index of the reference frame for the current block is signaled in the bitstream.
[0162] The embodiments in this disclosure may be used separately or in any order. Furthermore, each of the methods (or embodiments), encoders, and decoders may be implemented by processing circuits (e.g., one or more processors or one or more integrated circuits). In one example, one or more processors execute a program stored on a non-temporary computer-readable medium. The term "block" may include prediction blocks, coding blocks, or coding units, i.e., CUs. The embodiments in this disclosure may be applied to rumor blocks or chroma blocks.
[0163] The techniques described above can be implemented as computer software using computer-readable instructions and physically stored on one or more computer-readable media. For example, Figure 25 shows a computer system (2500) suitable for implementing a particular embodiment of the subject matter disclosed.
[0164] Computer software can be coded using any suitable machine code or computer language that may follow assembly, compilation, linking, or similar mechanisms to create code that contains instructions that can be executed directly by one or more computer central processing units (CPUs), graphics processing units (GPUs), etc., or through interpretation, microcode execution, etc.
[0165] Instructions can be executed on various types of computers or their components, including, for example, personal computers, tablet computers, servers, smartphones, gaming devices, and Internet of Things devices.
[0166] The components shown in Figure 25 for the computer system (2500) are illustrative in nature and are not intended to imply any limitation on the scope of use or functionality of computer software implementing embodiments of the present disclosure. Furthermore, the configuration of the components should not be construed as having any dependence or requirement on any one or combination of components shown in the exemplary embodiments of the computer system (2500).
[0167] The computer system (2500) may include certain human interface input devices. Such human interface input devices may respond to input from one or more human users, for example, through tactile input (keystrokes, swipes, data glove movements, etc.), audio input (voices, applause, etc.), visual input (gestures, etc.), and olfactory input (not shown). The human interface devices may also be used to capture certain media that are not necessarily directly related to conscious human input, such as audio (speech, music, ambient sounds, etc.), images (scanned images, photographic images taken from still image cameras, etc.), and video (2D video, 3D video including stereoscopic video, etc.).
[0168] Input human interface devices may include one or more of the following (only one of each is shown): keyboard (2501), mouse (2502), trackpad (2503), touchscreen (2510), data glove (not shown), joystick (2505), microphone (2506), scanner (2507), and camera (2508).
[0169] The computer system (2500) may also include certain human interface output devices. Such human interface output devices may stimulate the senses of one or more human users, for example, through tactile output, sound, light, and smell / taste. Such human interface output devices may include tactile output devices (e.g., tactile feedback via a touchscreen (2510), data glove (not shown), or joystick (2505), although there may also be tactile feedback devices that do not function as input devices), audio output devices (e.g., speakers (2509), headphones (not shown)), visual output devices (e.g., screens (2510), including CRT screens, LCD screens, plasma screens, and OLED screens, each having or not having touchscreen input functionality, each having or not having tactile feedback functionality, some of which may be capable of outputting two-dimensional visual output or output beyond three dimensions via means such as stereographic output), virtual reality glasses (not shown), holographic displays, and smoke tanks (not shown)), and printers (not shown).
[0170] The computer system (2500) may also include human-accessible storage devices and their associated media, such as optical media including CD / DVD ROM / RW (2520) with media such as CD / DVD (2521), thumb drives (2522), removable hard drives or solid-state drives (2523), legacy magnetic media such as tapes and floppy disks (not shown), and dedicated ROM / ASIC / PLD-based devices such as security dongles (not shown).
[0171] Those skilled in the art should also understand that the term “computer-readable medium” as used in relation to the subject matter currently disclosed does not include a transmission medium, carrier wave, or other transient signal.
[0172] The computer system (2500) may also include an interface (2554) to one or more communication networks (2555). The networks may be, for example, wireless, wireline, or optical. Networks may further be local, wide-area, metropolitan, automotive, and industrial, real-time, or latency-tolerant. Examples of networks include local area networks such as Ethernet®, cellular networks including wireless LAN, GSM®, 3G, 4G, 5G, and LTE, TV wireline or wireless wide-area digital networks including cable TV, satellite TV, and terrestrial broadcast TV, and automotive and industrial networks including CAN bus. Certain networks generally require an external network interface adapter attached to a specific general-purpose data port or peripheral bus (2549) (e.g., a USB port on the computer system (2500)), while other networks are generally integrated into the core of the computer system (2500) by attachment to a system bus, as described below (e.g., an Ethernet interface to a PC computer system or a cellular network interface to a smartphone computer system). Using any of these networks, the computer system (2500) can communicate with other entities. Such communication may be unidirectional, receive-only (e.g., broadcast television), unidirectional, transmit-only (e.g., CANbus to a specific CANbus device), or bidirectional, for example, to other computer systems using local or wide-area digital networks. Specific protocols and protocol stacks may be used on each of these networks and network interfaces, as described above.
[0173] The aforementioned human interface devices, human-accessible memory devices, and network interfaces can be attached to the core (2540) of the computer system (2500).
[0174] The core (2540) may include one or more central processing units (CPUs) (2541), graphics processing units (GPUs) (2542), dedicated programmable processing units in the form of field-programmable gate areas (FPGAs) (2543), hardware accelerators for specific tasks (2544), graphics adapters (2550), and the like. These devices, along with read-only memory (ROM) (2545), random access memory (2546), and internal mass storage devices such as internal hard drives and SSDs (2547) that are not accessible to the user, may be connected via a system bus (2548). In some computer systems, the system bus (2548) may be accessible in the form of one or more physical plugs to allow expansion with additional CPUs, GPUs, etc. Peripheral devices may be connected directly to the core's system bus (2548) or via a peripheral bus (2549). For example, a screen (2510) may be connected to a graphics adapter (2550). Peripheral bus architectures include PCI, USB, and others.
[0175] The CPU (2541), GPU (2542), FPGA (2543), and accelerator (2544) can, in combination, execute certain instructions that constitute the aforementioned computer code. This computer code can be stored in ROM (2545) or RAM (2546). Temporary data can also be stored in RAM (2546), while persistent data can be stored, for example, in internal mass storage (2547). Fast storage and retrieval to any of the memory devices may be enabled through the use of cache memory, which may be closely associated with one or more CPUs (2541), GPUs (2542), mass storage (2547), ROM (2545), RAM (2546), etc.
[0176] A computer-readable medium may contain computer code for performing various computer implementation operations. The medium and computer code may be specifically designed and constructed for the purposes of this disclosure, or they may be of a type that is well known and available to those skilled in the computer software art.
[0177] As a non-limiting example, a computer system having an architecture (2500), specifically a core (2540), can provide functionality as a result of a processor(s) (including CPU, GPU, FPGA, accelerator, etc.) executing software embodied in one or more tangible computer-readable media. Such computer-readable media may be not only media associated with user-accessible mass storage devices as described above, but also specific storage devices of the core (2540) that are non-transient in nature, such as mass storage devices (2547) or ROM (2545) within the core. Software implementing various embodiments of this disclosure may be stored in such devices and executed by the core (2540). The computer-readable media may include one or more memory devices or chips, depending on the specific needs. The software may cause the core (2540) and specifically the processors (including CPU, GPU, FPGA, etc.) within it to execute specific processes or specific parts of specific processes described herein, including defining data structures stored in RAM (2546) and modifying such data structures according to processes defined by the software. In addition, or as an alternative, a computer system may provide functionality as a result of logic hardwired or otherwise embodied in circuits (e.g., accelerators (2544)) that can operate in place of or with software to perform a particular process or a particular part of a particular process described herein. References to software may, as appropriate, include logic, and vice versa. References to computer-readable media may, where appropriate, include circuits that store software for execution (such as integrated circuits (ICs)), circuits that embody logic for execution, or both. This disclosure encompasses any appropriate combination of hardware and software.
[0178] While this disclosure has described several exemplary embodiments, there are many modifications, substitutions, and alternative equivalents that fall within the scope of this disclosure. Therefore, those skilled in the art will understand that numerous systems and methods not expressly shown or described herein can be devised to embody the principles of this disclosure and thus fall within the spirit and scope of this disclosure. Note A: Acronym ALF: Adaptive Loop Filter AMVP: Advanced Motion Vector Prediction APS: Adaptation Parameter Set ASIC: Application-Specific Integrated Circuit AV1: AOMedia Video 1 AV2:AOMedia Video 2 BCW: Bi-prediction with CU-level weights BM: Bilateral Matching BMS: benchmark set CANBus: Controller Area Network Bus CC-ALF: Cross-Component Adaptive Loop Filter CCSO: Cross-Component Sample Offset CD: Compact Disc CDEF: Constrained Directional Enhancement Filter CDF: Cumulative Density Function CfL:Chroma from Luma CIIP: Combined intra-inter prediction CPUs: Central Processing Units CRT: Cathode Ray Tube CTBs: Coding Tree Blocks CTU: Coding Tree Unit CTUs: Coding Tree Units CU: Coding Unit DMVR: Decoder-side Motion Vector Refinement DPB: Decoded Picture Buffer DPS: Decoding Parameter Set DVD: Digital Video Disc FPGA: Field Programmable Gate Areas GBI: Generalized Bi-prediction GOPs: Groups of Pictures GPUs: Graphics Processing Units GSM: Global System for Mobile communications HDR: High Dynamic Range HEVC: High Efficiency Video Coding HRD: Hypothetical Reference Decoder IBC (or IntraBC): Intra Block Copy IC: Integrated Circuit ISP: Intra Sub-Partitions JEM: Joint Exploration Model JVET: Joint Video Exploration Team LAN: Local Area Network LCD: Liquid-Crystal Display LR: Loop Restoration Filter LSO: Local Sample Offset LTE: Long-Term Evolution MMVD: Merge Mode with Motion Vector Difference MPM: Most Probable Mode MV: Motion Vector MV: Motion Vector MVD: Motion Vector Difference MVD: Motion vector difference MVP: Motion Vector Predictor OLED: Organic Light-Emitting Diode PBs: Prediction Blocks PCI: Peripheral Component Interconnect PDPC: Position Dependent Prediction Combination PLD: Programmable Logic Device POC: Picture Order Count PPS: Picture Parameter Set PU: Prediction Unit PUs: Prediction Units RAM: Random Access Memory ROM: Read-Only Memory RPS: Reference Picture Set SAD: Sum of Absolute Difference SAO: Sample Adaptive Offset SB: Super Block SCC: Screen Content Coding SDP: Semi-Decoupled Partitioning SDR: Standard Dynamic Range SDT: Semi-Decoupled Tree SEI: Supplementary Enhancement Information SNR: Signal Noise Ratio SPS: Sequence Parameter Setting SSD: Solid-state drive SST: Semi-Separate Tree TM: Template Matching TU: Transform Unit TUs: Transform Units USB: Universal Serial Bus VPS: Video Parameter Set VUI: Video Usability Information VVC: Versatile Video Coding WAIP: Wide-Angle Intra Prediction
Claims
1. A method performed by an encoder for sorting reference frames for each block by template matching (TM), A step of comparing motion information between the current block template and the reference block template, wherein the motion information includes spatial reference motion information or temporal reference motion information. A step of calculating the difference between the template of the current block and the template of the referenced block, The steps include determining the score value of the associated reference frame based on the calculated difference, The steps include sorting the reference frames based on the determined score values, The steps include signaling an index in the bitstream that indicates the reference frame for the current block, and A method that includes this.
2. The method according to claim 1, wherein the reference frame further comprises a pair of reference frames.
3. The method according to claim 1, wherein TM includes deriving a decoder-side motion vector for refining the motion information of the current block.
4. The aforementioned sorting step is, A step of ranking available reference frames based on the score value for each of the aforementioned reference frames. The method according to claim 1, further comprising:
5. The method according to claim 4, wherein when the score values of multiple reference frames are equivalent, the ranking step corresponds to the scanning order of the spatial reference motion information or the temporal reference motion information.
6. The method according to claim 4, wherein when the score values of multiple reference frames are equivalent, the ranking step is based on the frequency of occurrence of these reference frames used in the spatial reference motion information or the temporal reference motion information.
7. The method according to claim 1, wherein the calculation step includes at least one of the following: Sum of absolute differences (SAD), Sum of squared differences (SSD), Mean squared error (MSE), or Sum of absolute differences (SATD).
8. The method according to claim 1, wherein the template includes an adjacent block above or to the left.
9. The method according to claim 1, wherein the spatial reference motion information includes one or more spatial motion vectors.
10. The method according to claim 1, wherein the time-referenced motion information includes one or more time-motion vectors.
11. The method according to claim 1, wherein when multiple motion vectors point to one of the reference frames, the motion vector having the smallest difference is used to determine the score value.
12. The method according to claim 1, wherein when there is no motion vector pointing to one of the reference frames, the score value is determined to be the maximum allowable value.
13. The method according to claim 1, wherein permitted unidirectional and bidirectional composite reference frames are ranked together by using the TM.
14. The method according to claim 1, wherein the allowed single reference frames are ranked together by using the TM.
15. A device for encoding a video bitstream, Memory for storing instructions, A processor that communicates with the aforementioned memory and A device comprising, wherein when the processor executes the instruction, the processor causes the device to perform the method according to any one of claims 1 to 14.
16. A computer program that causes a computer to perform the method described in any one of claims 1 to 14.