Bitstream Generator
The bitstream generation device improves compression efficiency and reduces processing load by using fractional pixel precision interpolation and motion correction values, addressing the limitations of existing video encoding standards like HEVC.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- PANASONIC INTELLECTUAL PROPERTY CORP OF AMERICA
- Filing Date
- 2026-04-21
- Publication Date
- 2026-07-02
Smart Images

Figure 2026110698000001_ABST
Abstract
Description
Technical Field
[0001] The present disclosure relates to encoding and decoding of images using inter prediction.
Background Art
[0002] A video encoding standard called HEVC (High-Efficiency Video Coding) has been standardized by JCT-VC (Joint Collaborative Team on Video Coding).
Prior Art Documents
Non-Patent Documents
[0003]
Non-Patent Document 1
Summary of the Invention
Problems to be Solved by the Invention
[0004] In such encoding and decoding techniques, further improvement in compression efficiency and reduction in processing load are required.
[0005] Therefore, the present disclosure provides a bitstream generation device capable of achieving further improvement in compression efficiency and reduction in processing load.
Means for Solving the Problems
[0006] A bitstream generation device according to one aspect of the present disclosure is a bitstream generation device that generates a bitstream, comprising a processor and a memory, wherein the processor uses the memory to generate a bitstream containing motion information indicating a reference picture used in interpretation, the interpretation obtains two predicted images by interpolating to fractional pixel precision using two reference pictures associated with a block to be encoded included in a picture to be encoded for bidirectional prediction, and each of the multiple pixel values of multiple first pixels included in the two predicted images is used to divide the block to be encoded and obtain multiple second pixels included in the subblock The process includes obtaining multiple vertical gradient values corresponding to the subblock, deriving a motion correction value for the subblock based on the multiple vertical gradient values, and generating an output prediction image corresponding to the subblock using the motion correction value for the subblock at the end of the interpretation using the multiple vertical gradient values, wherein the two prediction images are identified using two motion vectors, the reference range for interpolation is included in the normal reference range referenced to obtain a fractional pixel precision prediction image corresponding to the target block in normal interpretation without using the multiple vertical gradient values, and an 8-tap filter is used in the interpolation process to fractional pixel precision.
[0007] These general or specific embodiments may be implemented as a system, method, integrated circuit, computer program, or recording medium such as a computer-readable CD-ROM, or as any combination of a system, method, integrated circuit, computer program, and recording medium. [Effects of the Invention]
[0008] This disclosure can provide an encoding device, a decoding device, an encoding method, or a decoding method that can achieve further improvements in compression efficiency and reductions in processing load. [Brief explanation of the drawing]
[0009] [Figure 1]Figure 1 is a block diagram showing the functional configuration of the encoding device according to Embodiment 1. [Figure 2] Figure 2 shows an example of block division in Embodiment 1. [Figure 3] Figure 3 is a table showing the transformation basis functions corresponding to each transformation type. [Figure 4A] Figure 4A shows an example of the filter shape used in ALF. [Figure 4B] Figure 4B shows another example of the filter shape used in ALF. [Figure 4C] Figure 4C shows another example of the filter shape used in ALF. [Figure 5A] Figure 5A shows the 67 intra-prediction modes in intra-prediction. [Figure 5B] Figure 5B is a flowchart illustrating the overview of the predictive image correction process using OBMC processing. [Figure 5C] Figure 5C is a conceptual diagram illustrating the overview of the predictive image correction process using OBMC processing. [Figure 5D] Figure 5D shows an example of FRUC. [Figure 6] Figure 6 illustrates pattern matching (bilateral matching) between two blocks along a motion trajectory. [Figure 7] Figure 7 illustrates pattern matching (template matching) between a template in the current picture and a block in the referenced picture. [Figure 8] Figure 8 is a diagram illustrating a model that assumes uniform linear motion. [Figure 9A] Figure 9A is a diagram illustrating the derivation of subblock-level motion vectors based on the motion vectors of multiple adjacent blocks. [Figure 9B] Figure 9B is a diagram illustrating the overview of the motion vector derivation process using merge mode. [Figure 9C]FIG. 9C is a conceptual diagram for explaining the outline of the DMVR process. [Figure 9D] FIG. 9D is a diagram for explaining the outline of a predicted image generation method using luminance correction processing by the LIC process. [Figure 10] FIG. 10 is a block diagram showing the functional configuration of the decoding device according to Embodiment 1. [Figure 11] FIG. 11 is a flowchart showing inter prediction in Embodiment 2. [Figure 12] FIG. 12 is a conceptual diagram for explaining inter prediction in Embodiment 2. [Figure 13] FIG. 13 is a conceptual diagram for explaining an example of the reference range of the motion compensation filter and the gradient filter in Embodiment 2. [Figure 14] FIG. 14 is a conceptual diagram for explaining an example of the reference range of the motion compensation filter in Modification 1 of Embodiment 2. [Figure 15] FIG. 21 is a conceptual diagram for explaining an example of the reference range of the gradient filter in Modification 1 of Embodiment 2. [Figure 16] FIG. 16 is a diagram showing an example of the pattern of pixels referred to in the derivation of the local motion estimation value in Modification 2 of Embodiment 2. [Figure 17] FIG. 17 is an overall configuration diagram of a content supply system that realizes a content distribution service. [Figure 18] FIG. 18 is a diagram showing an example of an encoding structure during scalable encoding. [Figure 19] FIG. 19 is a diagram showing an example of an encoding structure during scalable encoding. [Figure 20] FIG. 20 is a diagram showing an example of a display screen of a web page. [Figure 21] FIG. 21 is a diagram showing an example of a display screen of a web page. [Figure 22] FIG. 出22 is a diagram showing an example of a smartphone. [Figure 23] FIG. 23 is a block diagram showing a configuration example of a smartphone. [Modes for carrying out the invention]
[0010] The embodiments will be described in detail below with reference to the drawings.
[0011] The embodiments described below are all comprehensive or specific examples. The numerical values, shapes, materials, components, arrangement and connection configurations of components, steps, and the order of steps shown in the following embodiments are examples only and are not intended to limit the scope of the claims. Furthermore, among the components in the following embodiments, those not described in the independent claim representing the highest-level concept will be described as optional components.
[0012] (Embodiment 1) First, an overview of Embodiment 1 will be given as an example of an encoding and decoding device to which the processes and / or configurations described in each aspect of this disclosure, described later, can be applied. However, Embodiment 1 is merely an example of an encoding and decoding device to which the processes and / or configurations described in each aspect of this disclosure can be applied, and the processes and / or configurations described in each aspect of this disclosure can also be implemented in encoding and decoding devices different from Embodiment 1.
[0013] When applying the processes and / or configurations described in each aspect of this disclosure to Embodiment 1, for example, one of the following may be performed:
[0014] (1) With respect to the encoding or decoding device of Embodiment 1, replace the component corresponding to the component described in each aspect of the disclosure with the component described in each aspect of the disclosure, among the plurality of components constituting the encoding or decoding device. (2) With respect to the encoding or decoding device of Embodiment 1, any modifications such as adding, replacing, or deleting functions or processes performed by some of the multiple components constituting the encoding or decoding device are made, and then the components corresponding to the components described in each aspect of the Disclosure are replaced with the components described in each aspect of the Disclosure. (3) Adding processing to and / or replacing, deleting, or otherwise modifying some of the processing included in the method performed by the encoding or decoding device of Embodiment 1, and then replacing the processing corresponding to the processing described in each aspect of the Disclosure with the processing described in each aspect of the Disclosure. (4) Combining some of the multiple components constituting the encoding or decoding device of Embodiment 1 with a component described in each aspect of the Disclosure, a component that has some of the functions of the component described in each aspect of the Disclosure, or a component that performs some of the processing performed by the component described in each aspect of the Disclosure. (5) A component that has some of the functions of some of the components constituting the encoding or decoding device of Embodiment 1, or a component that performs some of the processing performed by some of the components constituting the encoding or decoding device of Embodiment 1, in combination with a component described in each aspect of this disclosure, a component that has some of the functions of the components described in each aspect of this disclosure, or a component that performs some of the processing performed by the components described in each aspect of this disclosure. (6) With respect to the method performed by the encoding or decoding device of Embodiment 1, replace with the process corresponding to the process described in each aspect of the Disclosure among the multiple processes included in the method with the process described in each aspect of the Disclosure. (7) Performing some of the processes included in the method performed by the encoding or decoding device of Embodiment 1 in combination with the processes described in each aspect of the present disclosure.
[0015] The methods of implementing the processes and / or configurations described in each aspect of this disclosure are not limited to the examples above. For example, they may be implemented in a device used for a purpose other than the video / image encoding device or video / image decoding device disclosed in Embodiment 1, or the processes and / or configurations described in each embodiment may be implemented individually. Furthermore, the processes and / or configurations described in different embodiments may be implemented in combination.
[0016] [Overview of the coding device] First, an overview of the encoding device according to Embodiment 1 will be described. Figure 1 is a block diagram showing the functional configuration of the encoding device 100 according to Embodiment 1. The encoding device 100 is a video / image encoding device that encodes video / images in block units.
[0017] As shown in Figure 1, the encoding device 100 is a device that encodes an image in block units and comprises a division unit 102, a subtraction unit 104, a transformation unit 106, a quantization unit 108, an entropy encoding unit 110, an inverse quantization unit 112, an inverse transformation unit 114, an addition unit 116, a block memory 118, a loop filter unit 120, a frame memory 122, an intra prediction unit 124, an inter prediction unit 126, and a prediction control unit 128.
[0018] The encoding device 100 can be implemented, for example, by a general-purpose processor and memory. In this case, when a software program stored in memory is executed by the processor, the processor functions as a splitting unit 102, a subtraction unit 104, a conversion unit 106, a quantization unit 108, an entropy encoding unit 110, an inverse quantization unit 112, an inverse conversion unit 114, an addition unit 116, a loop filter unit 120, an intra prediction unit 124, an inter prediction unit 126, and a prediction control unit 128. Alternatively, the encoding device 100 may be implemented as one or more dedicated electronic circuits corresponding to the splitting unit 102, a subtraction unit 104, a conversion unit 106, a quantization unit 108, an entropy encoding unit 110, an inverse quantization unit 112, an inverse conversion unit 114, an addition unit 116, a loop filter unit 120, an intra prediction unit 124, an inter prediction unit 126, and a prediction control unit 128.
[0019] The following describes each component included in the encoding device 100.
[0020] [Divided part] The splitting unit 102 divides each picture contained in the input video into multiple blocks and outputs each block to the subtraction unit 104. For example, the splitting unit 102 first divides the picture into blocks of a fixed size (e.g., 128x128). These fixed-size blocks are sometimes called coding tree units (CTUs). Then, based on recursive quadtree and / or binary tree block partitioning, the splitting unit 102 divides each of the fixed-size blocks into blocks of a variable size (e.g., 64x64 or less). These variable-size blocks are sometimes called coding units (CUs), prediction units (PUs), or transformation units (TUs). In this embodiment, CUs, PUs, and TUs do not need to be distinguished, and some or all of the blocks in the picture may become processing units for CUs, PUs, and TUs.
[0021] Figure 2 shows an example of block partitioning in Embodiment 1. In Figure 2, solid lines represent block boundaries due to quadtree block partitioning, and dashed lines represent block boundaries due to binary tree block partitioning.
[0022] Here, block 10 is a 128x128 pixel square block (128x128 block). This 128x128 block 10 is first divided into four 64x64 square blocks (quadtree block partitioning).
[0023] The top-left 64x64 block is further divided vertically into two rectangular 32x64 blocks, and the left 32x64 block is further divided vertically into two rectangular 16x64 blocks (binary tree block partitioning). As a result, the top-left 64x64 block is divided into two 16x64 blocks 11 and 12 and a 32x64 block 13.
[0024] The 64x64 block in the upper right is horizontally divided into two rectangular 64x32 blocks, 14 and 15 (binary tree block division).
[0025] The bottom-left 64x64 block is divided into four square 32x32 blocks (quadrutree block division). Of the four 32x32 blocks, the top-left and bottom-right blocks are further divided. The top-left 32x32 block is vertically divided into two rectangular 16x32 blocks, and the rightmost 16x32 block is further horizontally divided into two 16x16 blocks (binary tree block division). The bottom-right 32x32 block is horizontally divided into two 32x16 blocks (binary tree block division). As a result, the bottom-left 64x64 block is divided into 16x32 block 16, two 16x16 blocks 17 and 18, two 32x32 blocks 19 and 20, and two 32x16 blocks 21 and 22.
[0026] The 64x64 block 23 in the bottom right will not be divided.
[0027] As described above, in Figure 2, block 10 is divided into 13 variable-sized blocks 11-23 based on recursive quad-tree and binary tree block partitioning. Such partitioning is sometimes called QTBT (quad-tree plus binary tree) partitioning.
[0028] In Figure 2, one block was divided into four or two blocks (quadrutree or binary tree block partitioning), but the partitioning is not limited to these. For example, one block may be divided into three blocks (ternary tree block partitioning). Partitioning that includes such ternary tree block partitioning is sometimes called MBT (multi-type tree) partitioning.
[0029] [Subtraction Unit] The subtraction unit 104 subtracts the predicted signal (predicted sample) from the original signal (original sample) in block units divided by the division unit 102. In other words, the subtraction unit 104 calculates the prediction error (also called the residual) of the block to be encoded (hereinafter referred to as the current block). The subtraction unit 104 then outputs the calculated prediction error to the conversion unit 106.
[0030] The source signal is the input signal to the encoding device 100, and is a signal representing the image of each picture that makes up the moving image (for example, a luminance (luma) signal and two chroma (chroma) signals). In the following, the signal representing the image may also be called a sample.
[0031] [Conversion section] The conversion unit 106 converts the prediction error in the spatial domain into conversion coefficients in the frequency domain and outputs the conversion coefficients to the quantization unit 108. Specifically, the conversion unit 106 performs a predetermined discrete cosine transform (DCT) or discrete sine transform (DST) on the prediction error in the spatial domain, for example.
[0032] The transformation unit 106 may also adaptively select a transformation type from among several transformation types and use a transformation basis function corresponding to the selected transformation type to convert the prediction error into transformation coefficients. Such a transformation is sometimes called an EMT (explicit multiple core transform) or an AMT (adaptive multiple transform).
[0033] Multiple transformation types include, for example, DCT-II, DCT-V, DCT-VIII, DST-I, and DST-VII. Figure 3 is a table showing the transformation basis functions corresponding to each transformation type. In Figure 3, N represents the number of input pixels. The selection of a transformation type from among these multiple transformation types may depend, for example, on the type of prediction (intra-prediction and inter-prediction) or on the intra-prediction mode.
[0034] Information indicating whether or not to apply such EMT or AMT (e.g., called an AMT flag) and information indicating the selected conversion type are signaled at the CU level. However, the signaling of this information is not limited to the CU level and may be at other levels (e.g., sequence level, picture level, slice level, tile level, or CTU level).
[0035] Furthermore, the transformation unit 106 may retransform the transformation coefficients (transformation results). Such retransformation is sometimes called AST (adaptive secondary transform) or NSST (non-separable secondary transform). For example, the transformation unit 106 performs retransformation for each subblock (e.g., 4x4 subblock) contained in the block of transformation coefficients corresponding to the intra-prediction error. Information indicating whether or not to apply NSST and information regarding the transformation matrix used for NSST are signaled at the CU level. Note that the signaling of this information is not limited to the CU level, but may be at other levels (e.g., sequence level, picture level, slice level, tile level, or CTU level).
[0036] Here, a separable transformation is a method in which the input is separated into directions equal to the number of dimensions and transformed multiple times, while a non-separable transformation is a method in which, when the input is multidimensional, two or more dimensions are treated as one dimension and transformed together.
[0037] For example, one example of a non-separable transformation is to treat a 4x4 block as a single array with 16 elements and then perform a transformation on that array using a 16x16 transformation matrix.
[0038] Similarly, the Hypercube Givens Transform, which treats a 4x4 input block as a single array with 16 elements and then performs multiple Givens rotations on that array, is another example of a non-separable transformation.
[0039] [Quantization section] The quantization unit 108 quantizes the conversion coefficients output from the conversion unit 106. Specifically, the quantization unit 108 scans the conversion coefficients of the current block in a predetermined scanning order and quantizes the conversion coefficients based on the quantization parameter (QP) corresponding to the scanned conversion coefficients. The quantization unit 108 then outputs the quantized conversion coefficients of the current block (hereinafter referred to as quantization coefficients) to the entropy coding unit 110 and the inverse quantization unit 112.
[0040] The predetermined order is the order for quantization / inverse quantization of the transformation coefficients. For example, the predetermined scanning order is defined as ascending frequency (from low frequency to high frequency) or descending frequency (from high frequency to low frequency).
[0041] Quantization parameters are parameters that define the quantization step (quantization width). For example, if the value of the quantization parameter increases, the quantization step also increases. In other words, if the value of the quantization parameter increases, the quantization error increases.
[0042] [Entropy coding unit] The entropy coding unit 110 generates an encoded signal (encoded bitstream) by variable-length encoding the quantization coefficients, which are input from the quantization unit 108. Specifically, the entropy coding unit 110, for example, binarizes the quantization coefficients and arithmetically encodes the binary signal.
[0043] [Dequantization section] The inverse quantization unit 112 inversely quantizes the quantization coefficients, which are input from the quantization unit 108. Specifically, the inverse quantization unit 112 inversely quantizes the quantization coefficients of the current block in a predetermined scanning order. Then, the inverse quantization unit 112 outputs the inversely quantized conversion coefficients of the current block to the inverse conversion unit 114.
[0044] [Inverse Transformation Section] The inverse transform unit 114 restores the prediction error by inversely transforming the transformation coefficients, which are input from the inverse quantization unit 112. Specifically, the inverse transform unit 114 restores the prediction error of the current block by performing an inverse transform on the transformation coefficients that corresponds to the transformation by the transformation unit 106. The inverse transform unit 114 then outputs the restored prediction error to the summation unit 116.
[0045] Furthermore, the recovered prediction error does not match the prediction error calculated by the subtraction unit 104 because information is lost due to quantization. In other words, the recovered prediction error includes quantization errors.
[0046] [Addition section] The adder 116 reconstructs the current block by adding the prediction error, which is the input from the inverse transformer 114, and the prediction sample, which is the input from the prediction control unit 128. The adder 116 then outputs the reconstructed block to the block memory 118 and the loop filter unit 120. The reconstructed block is sometimes called the local decoded block.
[0047] [Block memory] The block memory 118 is a storage unit for storing blocks within the picture to be encoded (hereinafter referred to as the current picture) that are referenced in intra prediction. Specifically, the block memory 118 stores the reconstructed blocks output from the adder 116.
[0048] [Loop Filter Section] The loop filter unit 120 applies a loop filter to the block reconstructed by the adder unit 116 and outputs the filtered reconstructed block to the frame memory 122. A loop filter is a filter used within the encoding loop (in-loop filter), and includes, for example, a deblocking filter (DF), sample adaptive offset (SAO), and adaptive loop filter (ALF).
[0049] In ALF, a least-squares error filter is applied to remove coding distortion. For example, for each 2x2 subblock within the current block, one filter selected from several filters is applied based on the direction and activity of the local gradient.
[0050] Specifically, first, subblocks (e.g., 2x2 subblocks) are classified into multiple classes (e.g., 15 or 25 classes). The classification of subblocks is based on the direction and activity of the gradient. For example, a classification value C (e.g., C = 5D + A) is calculated using the gradient direction value D (e.g., 0-2 or 0-4) and the gradient activity value A (e.g., 0-4). Then, based on the classification value C, the subblocks are classified into multiple classes (e.g., 15 or 25 classes).
[0051] The gradient direction value D is derived, for example, by comparing gradients in multiple directions (e.g., horizontal, vertical, and two diagonal directions). The gradient activation value A is derived, for example, by adding the gradients in multiple directions and quantizing the sum.
[0052] Based on the results of this classification, a filter for the subblock is determined from among multiple filters.
[0053] For example, a circularly symmetric shape is used as the filter shape in ALF. Figures 4A to 4C show several examples of filter shapes used in ALF. Figure 4A shows a 5x5 diamond-shaped filter, Figure 4B shows a 7x7 diamond-shaped filter, and Figure 4C shows a 9x9 diamond-shaped filter. Information indicating the filter shape is signaled at the picture level. However, the signaling of information indicating the filter shape is not limited to the picture level and may be at other levels (e.g., sequence level, slice level, tile level, CTU level, or CU level).
[0054] The on / off status of ALF is determined, for example, at the picture level or CU level. For instance, the decision to apply ALF to luminance is made at the CU level, and the decision to apply ALF to color difference is made at the picture level. Information indicating whether ALF is on or off is signaled at the picture level or CU level. However, the signaling of information indicating whether ALF is on or off is not limited to the picture level or CU level, but may be at other levels (e.g., sequence level, slice level, tile level, or CTU level).
[0055] The coefficient sets of multiple selectable filters (e.g., up to 15 or 25 filters) are signaled at the picture level. However, the signaling of the coefficient sets is not limited to the picture level; it may be at other levels (e.g., sequence level, slice level, tile level, CTU level, CU level, or subblock level).
[0056] [Frame memory] The frame memory 122 is a storage unit for storing reference pictures used for interpretation, and is sometimes called a frame buffer. Specifically, the frame memory 122 stores the reconstructed blocks filtered by the loop filter unit 120.
[0057] [Intra Prediction Unit] The intra-prediction unit 124 generates a prediction signal (intra-prediction signal) by performing intra-prediction (also called in-screen prediction) of the current block by referring to the block in the current picture stored in the block memory 118. Specifically, the intra-prediction unit 124 generates an intra-prediction signal by performing intra-prediction by referring to samples (e.g., luminance values, color difference values) of blocks adjacent to the current block, and outputs the intra-prediction signal to the prediction control unit 128.
[0058] For example, the intra-prediction unit 124 performs intra-prediction using one of a predetermined set of intra-prediction modes. The set of intra-prediction modes includes one or more non-directional prediction modes and multiple directional prediction modes.
[0059] One or more non-directional prediction modes include, for example, the Planar prediction mode and DC prediction mode as defined in the H.265 / HEVC (High-Efficiency Video Coding) standard (Non-Patent Document 1).
[0060] Multiple directional prediction modes include, for example, the 33 directional prediction modes defined in the H.265 / HEVC standard. Note that multiple directional prediction modes may also include 32 additional directional prediction modes (a total of 65 directional prediction modes). Figure 5A shows 67 intra-prediction modes (2 non-directional prediction modes and 65 directional prediction modes) in intra-prediction. Solid arrows represent the 33 directions defined in the H.265 / HEVC standard, and dashed arrows represent the additional 32 directions.
[0061] Furthermore, in the intra-prediction of a color difference block, a luminance block may be referenced. That is, the color difference component of the current block may be predicted based on the luminance component of the current block. Such intra-prediction is sometimes called CCLM (cross-component linear model) prediction. Such an intra-prediction mode for a color difference block that references a luminance block (e.g., called the CCLM mode) may be added as one of the intra-prediction modes for a color difference block.
[0062] The intra-prediction unit 124 may correct the pixel values after intra-prediction based on the gradient of the horizontal / vertical reference pixels. Intra-prediction with such correction is sometimes called PDPC (position dependent intra-prediction combination). Information indicating whether or not PDPC is applied (for example, called a PDPC flag) is signaled at, for example, the CU level. Note that the signaling of this information is not limited to the CU level, but may be at other levels (for example, sequence level, picture level, slice level, tile level, or CTU level).
[0063] [International Prediction Department] The inter-prediction unit 126 generates a prediction signal (inter-prediction signal) by performing inter-prediction (also called inter-screen prediction) of the current block by referring to a reference picture stored in the frame memory 122 that is different from the current picture. Inter-prediction is performed in units of the current block or sub-blocks within the current block (e.g., 4x4 blocks). For example, the inter-prediction unit 126 performs motion estimation within the reference picture for the current block or sub-block. Then, the inter-prediction unit 126 generates an inter-prediction signal for the current block or sub-block by performing motion compensation using motion information (e.g., motion vectors) obtained from the motion estimation. Finally, the inter-prediction unit 126 outputs the generated inter-prediction signal to the prediction control unit 128.
[0064] The motion information used for motion compensation is converted into a signal. A motion vector predictor may be used to convert the motion vector into a signal. In other words, the difference between the motion vector and the predicted motion vector may be converted into a signal.
[0065] Furthermore, an inter-prediction signal may be generated using not only the motion information of the current block obtained through motion search, but also the motion information of adjacent blocks. Specifically, an inter-prediction signal may be generated for each sub-block within the current block by weighted addition of a prediction signal based on motion information obtained through motion search and a prediction signal based on the motion information of adjacent blocks. Such inter-prediction (motion compensation) is sometimes called OBMC (overlapped block motion compensation).
[0066] In this OBMC mode, information indicating the size of the subblock for OBMC (e.g., called the OBMC block size) is signaled at the sequence level. Information indicating whether or not to apply OBMC mode (e.g., called the OBMC flag) is signaled at the CU level. Note that the signaling levels for this information are not limited to the sequence and CU levels; other levels (e.g., picture level, slice level, tile level, CTU level, or subblock level) may also be used.
[0067] Let's explain the OBMC mode in more detail. Figures 5B and 5C are flowcharts and conceptual diagrams illustrating the overview of the predictive image correction process using OBMC processing.
[0068] First, a predicted image (Pred) is obtained using normal motion compensation with the motion vector (MV) assigned to the block to be encoded.
[0069] Next, the motion vector (MV_L) of the encoded left adjacent block is applied to the block to be encoded to obtain a predicted image (Pred_L), and the first correction of the predicted image is performed by superimposing the predicted image and Pred_L with weights.
[0070] Similarly, the motion vector (MV_U) of the encoded upper adjacent block is applied to the block to be encoded to obtain a predicted image (Pred_U). The predicted image is then corrected a second time by weighting the first corrected predicted image and Pred_U, and this is used as the final predicted image.
[0071] While this explanation describes a two-stage correction method using the left adjacent block and the upper adjacent block, it is also possible to use the right adjacent block and the lower adjacent block to perform corrections more than two times.
[0072] Furthermore, the area to be superimposed does not have to be the entire pixel area of the block, but rather only a portion of the area near the block boundary.
[0073] Although this explanation describes the predictive image correction process using a single reference picture, the process is similar when correcting predictive images from multiple reference pictures. After obtaining corrected predictive images from each reference picture, the resulting predictive images are superimposed to create the final predictive image.
[0074] The processing target block may be a prediction block unit, or it may be a sub-block unit obtained by further dividing the prediction block.
[0075] One method for determining whether or not to apply OBMC processing is to use an obmc_flag signal, which indicates whether or not to apply OBMC processing. Specifically, in an encoding device, it is determined whether or not the block to be encoded belongs to a region with complex motion. If it belongs to a region with complex motion, the obmc_flag is set to a value of 1 and OBMC processing is applied to perform encoding. If it does not belong to a region with complex motion, the obmc_flag is set to a value of 0 and encoding is performed without applying OBMC processing. On the other hand, in a decoding device, the obmc_flag written in the stream is decoded, and the device switches whether or not to apply OBMC processing depending on its value and performs decoding.
[0076] Furthermore, motion information may be derived by the decoder without being converted into a signal. For example, the merge mode specified in the H.265 / HEVC standard may be used. Alternatively, motion information may be derived by performing a motion search on the decoder side. In this case, the motion search is performed without using the pixel values of the current block.
[0077] Here, we will explain the mode in which motion detection is performed on the decoding device side. This mode in which motion detection is performed on the decoding device side is sometimes called PMMVD (pattern matched motion vector derivation) mode or FRUC (frame rate up-conversion) mode.
[0078] An example of FRUC processing is shown in Figure 5D. First, a list of multiple candidates (which may be the same as the merge list) is generated, each having a predicted motion vector, by referencing the motion vectors of spatially or temporally adjacent encoded blocks to the current block. Next, the best candidate MV is selected from among the multiple candidate MVs registered in the candidate list. For example, an evaluation value is calculated for each candidate included in the candidate list, and one candidate is selected based on the evaluation value.
[0079] Then, based on the motion vectors of the selected candidates, a motion vector for the current block is derived. Specifically, for example, the motion vector of the selected candidate (best candidate MV) is directly derived as the motion vector for the current block. Alternatively, for example, the motion vector for the current block may be derived by performing pattern matching in the area surrounding the position in the reference picture corresponding to the motion vector of the selected candidate. That is, a similar search is performed in the area surrounding the best candidate MV, and if an MV with a better evaluation value is found, the best candidate MV may be updated to this MV and used as the final MV for the current block. It is also possible to configure the system so that this process is not performed.
[0080] The same processing method can be used when processing at the subblock level.
[0081] The evaluation value is calculated by determining the difference value of the reconstructed image through pattern matching between a region in the reference picture corresponding to the motion vector and a predetermined region. Alternatively, the evaluation value may be calculated using information other than the difference value.
[0082] For pattern matching, either first-order pattern matching or second-order pattern matching is used. First-order pattern matching and second-order pattern matching are sometimes called bilateral matching and template matching, respectively.
[0083] In the first pattern matching, pattern matching is performed between two blocks in two different reference pictures that are aligned with the motion trajectory of the current block. Therefore, in the first pattern matching, a region in another reference picture aligned with the motion trajectory of the current block is used as a predetermined region for calculating the evaluation value of the candidate described above.
[0084] Figure 6 illustrates an example of pattern matching (bilateral matching) between two blocks along a motion trajectory. As shown in Figure 6, in the first pattern matching, two motion vectors (MV0, MV1) are derived by searching for the best-matching pair of two blocks within two different reference pictures (Ref0, Ref1) that are along the motion trajectory of the current block. Specifically, for the current block, the difference between the reconstructed image at a specified position in the first encoded reference picture (Ref0) specified by the candidate MV and the reconstructed image at a specified position in the second encoded reference picture (Ref1) specified by the symmetric MV obtained by scaling the candidate MV by the display time interval is derived, and an evaluation value is calculated using the obtained difference value. It is preferable to select the candidate MV with the best evaluation value among multiple candidate MVs as the final MV.
[0085] Under the assumption of a continuous motion trajectory, the motion vector (MV0, MV1) pointing to two reference blocks is proportional to the temporal distance (TD0, TD1) between the current picture (Cur Pic) and the two reference pictures (Ref0, Ref1). For example, if the current picture is temporally located between the two reference pictures and the temporal distances from the current picture to the two reference pictures are equal, then the first pattern matching derives a mirror-symmetric bidirectional motion vector.
[0086] In the second pattern matching, pattern matching is performed between the template in the current picture (blocks adjacent to the current block in the current picture (e.g., blocks above and / or to the left)) and the blocks in the reference picture. Therefore, in the second pattern matching, the blocks adjacent to the current block in the current picture are used as a predetermined area for calculating the evaluation value of the candidates mentioned above.
[0087] Figure 7 illustrates an example of pattern matching (template matching) between a template in the current picture and a block in the reference picture. As shown in Figure 7, in the second pattern matching, the motion vector of the current block is derived by searching in the reference picture (Ref0) for the block that best matches the block adjacent to the current block (Cur block) in the current picture (Cur Pic). Specifically, for the current block, the difference is derived between the reconstructed image of the encoded region of both or either of the left adjacent and upper adjacent regions and the reconstructed image at the equivalent position in the encoded reference picture (Ref0) specified by the candidate MV. An evaluation value is calculated using the obtained difference value, and the candidate MV with the best evaluation value among multiple candidate MVs is selected as the best candidate MV.
[0088] Information indicating whether or not to apply such a FRUC mode (e.g., called the FRUC flag) is signaled at the CU level. Furthermore, if the FRUC mode is applied (e.g., the FRUC flag is true), information indicating the pattern matching method (first pattern matching or second pattern matching) (e.g., called the FRUC mode flag) is signaled at the CU level. Note that the signaling of this information is not limited to the CU level; it may be at other levels (e.g., sequence level, picture level, slice level, tile level, CTU level, or subblock level).
[0089] Here, we will explain the mode for deriving motion vectors based on a model that assumes uniform linear motion. This mode is called BIO (bi-directional optical). This is sometimes called flow mode.
[0090] Figure 8 is a diagram illustrating a model assuming uniform linear motion. In Figure 8, (vx, vy) represents the velocity vector, and τ0 and τ1 represent the temporal distance between the current picture (Cur Pic) and the two reference pictures (Ref0, Ref1), respectively. (MVx0, MVy0) represents the motion vector corresponding to reference picture Ref0, and (MVx1, MVy1) represents the motion vector corresponding to reference picture Ref1.
[0091] Under the assumption of uniform linear motion of the velocity vector (vx, vy), (MVx0, MVy0) and (MVx1, MVy1) can be expressed as (vxτ0, vyτ0) and (-vxτ1, -vyτ1), respectively, and the following optical flow equality (1) holds.
[0092]
number
[0093] Here, I(k) represents the luminance value of the reference image k (k=0,1) after motion compensation. This optical flow equation shows that the sum of (i) the time derivative of the luminance value, (ii) the product of the horizontal velocity and the horizontal component of the spatial gradient of the reference image, and (iii) the product of the vertical velocity and the vertical component of the spatial gradient of the reference image is equal to zero. Based on this optical flow equation and Hermite interpolation, block-level motion vectors obtained from merge lists, etc., are corrected on a pixel-by-pixel basis.
[0094] Furthermore, motion vectors may be derived on the decoding side using a method different from that used for deriving motion vectors based on a model that assumes uniform linear motion. For example, motion vectors may be derived on a sub-block basis based on the motion vectors of multiple adjacent blocks.
[0095] Here, we will describe a mode in which motion vectors are derived at the sub-block level based on the motion vectors of multiple adjacent blocks. This mode is sometimes called the affine motion compensation prediction mode.
[0096] Figure 9A illustrates the derivation of subblock-level motion vectors based on the motion vectors of multiple adjacent blocks. In Figure 9A, the current block contains 16 4x4 subblocks. Here, the motion vector v0 of the upper left corner control point of the current block is derived based on the motion vectors of the adjacent blocks, and the motion vector v1 of the upper right corner control point of the current block is derived based on the motion vectors of the adjacent subblocks. Then, using the two motion vectors v0 and v1, the motion vector (vx, vy) of each subblock within the current block is derived by equation (2) below.
[0097]
number
[0098] Here, x and y represent the horizontal and vertical positions of the subblock, respectively, and w represents a predetermined weighting coefficient.
[0099] Such affine motion compensation prediction modes may include several modes in which the motion vectors of the upper-left and upper-right corner control points are derived. Information indicating such affine motion compensation prediction modes (e.g., called affine flags) is signaled at the CU level. Note that the signaling of this information indicating affine motion compensation prediction modes is not limited to the CU level, but may be at other levels (e.g., sequence level, picture level, slice level, tile level, CTU level, or subblock level).
[0100] [Prediction Control Unit] The prediction control unit 128 selects either the intra-prediction signal or the inter-prediction signal and outputs the selected signal as the prediction signal to the subtraction unit 104 and the addition unit 116.
[0101] Here, we will explain an example of deriving the motion vector of a picture to be encoded using merge mode. Figure 9B is a diagram illustrating the overview of the motion vector derivation process using merge mode.
[0102] First, a list of predicted MVs is generated, containing registered candidates for predicted MVs. Candidates for predicted MVs include spatially adjacent predicted MVs, which are the MVs of multiple encoded blocks located spatially around the block to be encoded; temporally adjacent predicted MVs, which are the MVs of nearby blocks projected onto the location of the block to be encoded in the encoded reference picture; combined predicted MVs, which are generated by combining the MV values of spatially adjacent predicted MVs and temporally adjacent predicted MVs; and zero predicted MVs, which are MVs with a value of zero.
[0103] Next, one predicted MV is selected from the multiple predicted MVs registered in the predicted MV list to determine it as the MV for the block to be encoded.
[0104] Furthermore, the variable-length coding unit encodes the merge_idx signal, which indicates which predicted MV was selected, by writing it to a stream.
[0105] Note that the predicted MVs registered in the predicted MV list explained in Figure 9B are just an example, and the number of predicted MVs may differ from the number shown in the figure, the configuration may not include some of the types of predicted MVs shown in the figure, or it may include predicted MVs other than those shown in the figure.
[0106] Alternatively, the final MV may be determined by performing the DMVR process described later using the MV of the target block to be encoded derived by merge mode.
[0107] Here, we will explain an example of determining the MV using DMVR processing.
[0108] Figure 9C is a conceptual diagram illustrating the overview of DMVR processing.
[0109] First, the optimal MVP set for the block to be processed is used as a candidate MV. According to the candidate MV, reference pixels are obtained from the first reference picture, which is a processed picture in the L0 direction, and the second reference picture, which is a processed picture in the L1 direction, and a template is generated by taking the average of each reference pixel.
[0110] Next, using the template, the surrounding regions of candidate MVs for the first and second reference pictures are searched, and the MV with the lowest cost is determined as the final MV. The cost value is calculated using the difference between each pixel value of the template and each pixel value of the search region, as well as the MV value, etc.
[0111] Note that the general outline of the processing described here is basically the same for both the encoding and decoding devices.
[0112] Note that any process that can explore the vicinity of a candidate MV and derive the final MV may be used instead of the exact process described here.
[0113] Here, we will explain the mode for generating predictive images using LIC processing.
[0114] Figure 9D is a diagram illustrating the outline of a predictive image generation method using brightness correction processing by LIC processing.
[0115] First, we derive a Music Model (MV) to obtain the reference image corresponding to the block to be encoded from the reference picture, which is an encoded picture.
[0116] Next, for the block to be encoded, information indicating how the luminance values have changed between the reference picture and the picture to be encoded is extracted using the luminance pixel values of the left-adjacent and top-adjacent encoded surrounding reference regions, and the luminance pixel values at the equivalent positions in the reference picture specified by MV, and a luminance correction parameter is calculated.
[0117] By performing brightness correction processing on the reference image within the reference picture specified in MV using the brightness correction parameter, a predicted image for the encoding target block is generated.
[0118] Note that the shape of the surrounding reference region in Figure 9D is just one example, and other shapes may be used.
[0119] Furthermore, while this explanation describes the process of generating a predicted image from a single reference picture, the process is similar when generating predicted images from multiple reference pictures. Brightness correction processing is performed on each reference image obtained from a single reference picture in the same manner before generating the predicted image.
[0120] One method for determining whether or not to apply LIC processing is to use a signal called lic_flag, which indicates whether or not to apply LIC processing. Specifically, in an encoding device, it is determined whether or not the block to be encoded belongs to a region where brightness changes are occurring. If it belongs to a region where brightness changes are occurring, the value of lic_flag is set to 1 and LIC processing is applied and encoding is performed. If it does not belong to a region where brightness changes are occurring, the value of lic_flag is set to 0 and encoding is performed without applying LIC processing. On the other hand, in a decoding device, the lic_flag written in the stream is decoded, and the device switches whether or not to apply LIC processing according to its value and performs decoding.
[0121] Another way to determine whether to apply LIC processing is, for example, by checking whether LIC processing has been applied to surrounding blocks. A specific example is that if the block to be encoded is in merge mode, during the MV derivation in merge mode processing, it is determined whether the surrounding encoded blocks selected were encoded with LIC processing. Based on this result, the application of LIC processing is switched, and encoding is performed accordingly. In this example, the decoding process is exactly the same.
[0122] [Overview of the decryption device] Next, an overview of a decoding device capable of decoding the encoded signal (encoded bitstream) output from the above-mentioned encoding device 100 will be described. Figure 10 is a block diagram showing the functional configuration of the decoding device 200 according to Embodiment 1. The decoding device 200 is a video / image decoding device that decodes video / images in block units.
[0123] As shown in Figure 10, the decoding device 200 includes an entropy decoding unit 202, an inverse quantization unit 204, an inverse transform unit 206, an adder unit 208, a block memory 210, a loop filter unit 212, a frame memory 214, an intra prediction unit 216, an inter prediction unit 218, and a prediction control unit 220.
[0124] The decoding device 200 can be implemented, for example, by a general-purpose processor and memory. In this case, when the software program stored in memory is executed by the processor, the processor functions as an entropy decoding unit 202, an inverse quantization unit 204, an inverse transformation unit 206, an addition unit 208, a loop filter unit 212, an intra prediction unit 216, an inter prediction unit 218, and a prediction control unit 220. Alternatively, the decoding device 200 may be implemented as one or more dedicated electronic circuits corresponding to the entropy decoding unit 202, the inverse quantization unit 204, the inverse transformation unit 206, the addition unit 208, the loop filter unit 212, the intra prediction unit 216, the inter prediction unit 218, and the prediction control unit 220.
[0125] The following describes each component included in the decoding device 200.
[0126] [Entropy Decoder] The entropy decoding unit 202 entropically decodes the encoded bitstream. Specifically, the entropy decoding unit 202 arithmetically decodes the encoded bitstream into a binary signal, for example. Then, the entropy decoding unit 202 debinarizes the binary signal. As a result, the entropy decoding unit 202 outputs the quantization coefficients in block units to the inverse quantization unit 204.
[0127] [Dequantization section] The inverse quantization unit 204 inversely quantizes the quantization coefficients of the decoded block (hereinafter referred to as the current block), which is the input from the entropy decoding unit 202. Specifically, for each quantization coefficient of the current block, the inverse quantization unit 204 inversely quantizes the quantization coefficient based on the quantization parameter corresponding to that quantization coefficient. The inverse quantization unit 204 then outputs the inversely quantized quantization coefficients (i.e., transformation coefficients) of the current block to the inverse transformation unit 206.
[0128] [Inverse Transformation Section] The inverse transform unit 206 restores the prediction error by inversely transforming the transformation coefficients, which are input from the inverse quantization unit 204.
[0129] For example, if the information decoded from the encoded bitstream indicates that EMT or AMT should be applied (e.g., the AMT flag is true), the inverse transform unit 206 inversely transforms the transformation coefficients of the current block based on the information indicating the decoded transformation type.
[0130] For example, if the information decoded from the encoded bitstream indicates that NSST should be applied, the inverse transform unit 206 applies inverse retransformation to the transformation coefficients.
[0131] [Addition section] The adder 208 reconstructs the current block by adding the prediction error, which is the input from the inverse transformer 206, and the prediction sample, which is the input from the prediction control unit 220. The adder 208 then outputs the reconstructed block to the block memory 210 and the loop filter unit 212.
[0132] [Block memory] The block memory 210 is a storage unit for storing blocks that are referenced in intra prediction and are located within the decoded picture (hereinafter referred to as the current picture). Specifically, the block memory 210 stores the reconstructed blocks output from the adder 208.
[0133] [Loop Filter Section] The loop filter unit 212 applies a loop filter to the block reconstructed by the adder unit 208 and outputs the filtered reconstructed block to the frame memory 214 and the display device, etc.
[0134] If the information interpreted from the encoded bitstream indicating ALF on / off indicates ALF is on, one filter is selected from among several filters based on the direction and activity of the local gradient, and the selected filter is applied to the reconstruction block.
[0135] [Frame memory] The frame memory 214 is a memory unit for storing reference pictures used for interpretation, and is sometimes called a frame buffer. Specifically, the frame memory 214 stores the reconstructed blocks filtered by the loop filter unit 212.
[0136] [Intra Prediction Unit] The intra-prediction unit 216 generates a prediction signal (intra-prediction signal) by performing intra-prediction based on the intra-prediction mode decoded from the encoded bitstream, and by referring to the blocks in the current picture stored in the block memory 210. Specifically, the intra-prediction unit 216 generates an intra-prediction signal by performing intra-prediction by referring to samples (e.g., luminance values, chrominance values) of blocks adjacent to the current block, and outputs the intra-prediction signal to the prediction control unit 220.
[0137] Furthermore, if an intra-prediction mode that references a luminance block is selected in the intra-prediction of a color difference block, the intra-prediction unit 216 may predict the color difference component of the current block based on the luminance component of the current block.
[0138] Furthermore, if the information decoded from the encoded bitstream indicates the application of PDPC, the intra-prediction unit 216 corrects the pixel value after intra-prediction based on the gradient of the reference pixels in the horizontal / vertical directions.
[0139] [International Prediction Department] The inter-prediction unit 218 predicts the current block by referring to a reference picture stored in the frame memory 214. Prediction is performed in units of the current block or sub-blocks within the current block (e.g., 4x4 blocks). For example, the inter-prediction unit 218 generates an inter-prediction signal for the current block or sub-block by performing motion compensation using motion information (e.g., motion vectors) decoded from the encoded bitstream, and outputs the inter-prediction signal to the prediction control unit 220.
[0140] Furthermore, if the information decoded from the encoded bitstream indicates that OBMC mode should be applied, the interpretation unit 218 generates an interpretation prediction signal using not only the motion information of the current block obtained by motion search, but also the motion information of adjacent blocks.
[0141] Furthermore, if the information decoded from the encoded bitstream indicates that FRUC mode should be applied, the interpretation unit 218 derives motion information by performing a motion search according to the pattern matching method (bilateral matching or template matching) decoded from the encoded stream. Then, the interpretation unit 218 performs motion compensation using the derived motion information.
[0142] Furthermore, when the BIO mode is applied, the inter-prediction unit 218 derives motion vectors based on a model that assumes uniform linear motion. Also, if the information decoded from the encoded bitstream indicates that the affine motion compensation prediction mode should be applied, the inter-prediction unit 218 derives motion vectors on a sub-block basis based on the motion vectors of multiple adjacent blocks.
[0143] [Prediction Control Unit] The prediction control unit 220 selects either the intra-prediction signal or the inter-prediction signal and outputs the selected signal as the prediction signal to the adder 208.
[0144] (Embodiment 2) Next, Embodiment 2 will be described. This embodiment relates to interpretation in so-called BIO mode. This embodiment differs from Embodiment 1 in that the block-level motion vector is corrected at the sub-block level, rather than at the pixel level. The following description of this embodiment will focus on the differences from Embodiment 1.
[0145] Since the configuration of the encoding device and decoding device according to this embodiment is substantially the same as that of Embodiment 1, the illustrations and descriptions are omitted.
[0146] [Interface prediction] Figure 11 is a flowchart of interpretation in Embodiment 2. Figure 12 is a conceptual diagram illustrating interpretation in Embodiment 2. The following processes are performed by the interpretation prediction unit 126 of the encoding device 100 or the interpretation prediction unit 218 of the decoding device 200.
[0147] As shown in Figure 11, first, a block-by-block loop process is performed on multiple blocks within the picture to be encoded / decoded (current picture 1000) (S101~S111). In Figure 12, the block to be encoded / decoded is selected from among the multiple blocks as the current block 1001.
[0148] In block-based loop processing, the first reference picture 1100 (L0) and the second reference picture 1200 (L1), which are processed pictures, are processed on a reference picture-by-reference basis (S102~S106).
[0149] In the loop processing on a reference picture basis, first, block-level motion vectors are derived or obtained to acquire a predicted image from the reference picture (S103). In Figure 12, the first motion vector 1110 (MV_L0) is derived or obtained for the first reference picture 1100, and the second motion vector 1210 (MV_L1) is derived or obtained for the second reference picture 1200. Methods for deriving motion vectors include the normal interprediction mode, merge mode, and FRUC mode. For example, in the normal interprediction mode, the encoding device 100 derives motion vectors by motion search, and the decoding device 200 obtains motion vectors from the bitstream.
[0150] Next, a predicted image is obtained from the reference picture by performing motion compensation using the derived or acquired motion vectors (S104). In Figure 12, the first predicted image 1140 is obtained from the first reference picture 1100 by performing motion compensation using the first motion vector 1110. Also, the second predicted image 1240 is obtained from the second reference picture 1200 by performing motion compensation using the second motion vector 1210.
[0151] In motion compensation, a motion compensation filter is applied to the reference picture. A motion compensation filter is an interpolation filter used to obtain a predicted image with fractional pixel precision. In the first reference picture 1100 in Figure 12, the motion compensation filter applied to the first prediction block 1120, specified by the first motion vector 1110, references pixels in the first interpolation reference range 1130, which includes the pixels of the first prediction block 1120 and its surrounding pixels. Similarly, in the second reference picture 1200, the motion compensation filter applied to the second prediction block 1220, specified by the second motion vector 1210, references pixels in the second interpolation reference range 1230, which includes the pixels of the second prediction block 1220 and its surrounding pixels.
[0152] The first interpolation reference range 1130 and the second interpolation reference range 1230 are included in the first and second normal reference ranges referenced for motion compensation of the current block 1001 in normal interpretation where processing using local motion estimates is not performed. The first normal reference range is included in the first reference picture 1100, and the second normal reference range is included in the second reference picture 1200. In normal interpretation, for example, motion vectors are derived in block units by motion search, motion compensation is performed in block units using the derived motion vectors, and the motion-compensated image is adopted as the final prediction image. In other words, local motion estimates are not used in normal interpretation. The first interpolation reference range 1130 and the second interpolation reference range 1230 may coincide with the first and second normal reference ranges.
[0153] Next, a gradient image corresponding to the predicted image is obtained from the reference picture (S105). Each pixel in the gradient image has a gradient value that indicates the spatial slope of luminance or color difference. The gradient value is obtained by applying a gradient filter to the reference picture. In the first reference picture 1100 in Figure 12, the gradient filter for the first prediction block 1120 references pixels in the first gradient reference range 1135, which includes the pixels of the first prediction block 1120 and its surrounding pixels. This first gradient reference range 1135 is included in the first interpolation reference range 1130. Similarly, in the second reference picture 1200, the gradient filter references pixels in the second gradient reference range 1235, which includes the pixels of the second prediction block 1220 and its surrounding pixels. This second gradient reference range 1235 is included in the second interpolation reference range 1230.
[0154] Once the acquisition of the predicted image and gradient image from the first and second reference pictures is complete, the loop processing for each reference picture is terminated (S106). Subsequently, loop processing is performed for each subblock, which is a further division of the block (S107-S110). Each of the subblocks has a size less than or equal to the current block (for example, a 4x4 pixel size).
[0155] In the subblock-level loop processing, first, the local motion estimate 1300 of the subblock is derived using the first prediction image 1140 and the second prediction image 1240, and the first gradient image 1150 and the second gradient image 1250 obtained from the first reference picture 1100 and the second reference picture 1200 (S108). For example, in each of the first prediction image 1140 and the second prediction image 1240, and the first gradient image 1150 and the second gradient image 1250, a single local motion estimate 1300 is derived for the subblock by referring to the pixels contained in the prediction subblock. The prediction subblock is the region within the first prediction block 1120 and the second prediction block 1220 corresponding to the subblock in the current block 1001. The local motion estimate is sometimes called a corrected motion vector.
[0156] Next, the final predicted image 1400 of the subblock is generated using the pixel values of the first predicted image 1140 and the second predicted image 1240, the gradient values of the first gradient image 1150 and the second gradient image 1250, and the local motion estimate 1300 (S109). Once the generation of the final predicted image is complete for each subblock included in the current block, the final predicted image of the current block is generated, and the loop processing for each subblock is completed (S110).
[0157] Furthermore, once the loop processing for each block is completed (S111), the process shown in Figure 11 is terminated.
[0158] Furthermore, by directly assigning the block-level motion vector of the current block to each sub-block, it is also possible to acquire predicted images and gradient images on a sub-block basis.
[0159] [Reference range for motion compensation filters and gradient filters] Here, we will explain the reference range for motion compensation filters and gradient filters.
[0160] Figure 13 is a conceptual diagram illustrating an example of the reference range of the motion compensation filter and gradient filter in Embodiment 2.
[0161] In Figure 13, each of the multiple circles represents a pixel. Also, in Figure 13, the current block size is 8x8 pixels and the sub-block size is 4x4 pixels as an example.
[0162] Reference range 1131 indicates the reference range (for example, an 8x8 pixel rectangular range) of the motion compensation filter applied to the top-left pixel 1122 of the first prediction block 1120. Reference range 1231 indicates the reference range (for example, an 8x8 pixel rectangular range) of the motion compensation filter applied to the top-left pixel 1222 of the second prediction block 1220.
[0163] Furthermore, reference range 1132 indicates the reference range (for example, a 6x6 pixel rectangular area) of the gradient filter applied to the upper-left pixel 1122 of the first prediction block 1120. Reference range 1232 indicates the reference range (for example, a 6x6 pixel rectangular area) of the gradient filter applied to the upper-left pixel 1222 of the second prediction block 1220.
[0164] For other pixels within the first prediction block 1120 and the second prediction block 1220, motion compensation filters and gradient filters are applied while referencing pixels of the same size within reference ranges corresponding to the positions of each pixel. As a result, pixels in the first interpolation reference range 1130 and the second interpolation reference range 1230 are referenced to obtain the first prediction image 1140 and the second prediction image 1240. In addition, pixels in the first gradient reference range 1135 and the second gradient reference range 1235 are referenced to obtain the first gradient image 1150 and the second gradient image 1250.
[0165] [Effects, etc.] As described above, the encoding and decoding devices according to this embodiment can derive local motion estimates on a subblock basis. Therefore, while reducing prediction errors by using local motion estimates on a subblock basis, it is possible to reduce the processing load or processing time compared to when local motion estimates are derived on a pixel basis.
[0166] Furthermore, according to the encoding and decoding devices of this embodiment, the interpolation reference range can be included in the normal reference range. Therefore, in generating the final predicted image using local motion estimates at the subblock level, it is not necessary to load new pixel data from the frame memory for motion compensation, thereby suppressing increases in memory capacity and memory bandwidth.
[0167] Furthermore, according to the encoding and decoding devices of this embodiment, the gradient reference range can be included in the interpolation reference range. Therefore, it is not necessary to load new pixel data from the frame memory to acquire the gradient image, and an increase in memory capacity and memory bandwidth can be suppressed.
[0168] This embodiment may be implemented in combination with at least some of the other embodiments of this disclosure. Furthermore, some of the processes, some of the configurations of the apparatus, some of the syntax, etc., described in the flowchart of this embodiment may be implemented in combination with the other embodiments.
[0169] (Modification 1 of Embodiment 2) Next, we will specifically describe modified versions of the motion compensation filter and gradient filter with reference to the drawings. In Modification 1 below, the processing for the second predicted image is similar to the processing for the first predicted image, so the explanation will be omitted or simplified as appropriate.
[0170] [Motion compensation filter] First, let's explain the motion compensation filter. Figure 14 is a conceptual diagram illustrating an example of the reference range of the motion compensation filter in Modification 1 of Embodiment 2.
[0171] Here, we will explain as an example the case in which a motion compensation filter of 1 / 4 pixel horizontally and 1 / 2 pixel vertically is applied to the first prediction block 1120. The motion compensation filter is a so-called 8-tap filter and is represented by the following equation (3).
[0172]
number
[0173] Here, Ik[x,y] represents the pixel values of the first predicted image with fractional pixel precision when k is 0, and the pixel values of the second predicted image with fractional pixel precision when k is 1. Pixel values are the values that a pixel possesses, such as luminance values or chrominance values in a predicted image. w0.25 and w0.5 represent the weighting coefficients for 1 / 4 pixel precision and 1 / 2 pixel precision. I0k[x,y] represents the pixel values of the first predicted image with integer pixel precision when k is 0, and the pixel values of the second predicted image with integer pixel precision when k is 1.
[0174] For example, when the motion compensation filter of equation (3) is applied to the upper left pixel 1122 in Figure 14, the values of pixels arranged horizontally within the reference range 1131A are weighted and added together in each row, and the sum of the results of multiple rows is further weighted and added together.
[0175] Thus, in this modified example, the motion compensation filter for the top-left pixel 1122 refers to the pixels in reference range 1131A. The reference range 1131A is a rectangular area extending 3 pixels to the left, 4 pixels to the right, 3 pixels above, and 4 pixels below the top-left pixel 1122.
[0176] Such motion compensation filters are applied to all pixels in the first prediction block 1120. Therefore, the motion compensation filter for the first prediction block 1120 references pixels in the first interpolation reference range 1130A.
[0177] The motion compensation filter is applied to the second prediction block 1220 in the same way as to the first prediction block 1120. That is, for the top-left pixel 1222, pixels in reference range 1231A are referenced, and for the entire second prediction block 1220, pixels in the second interpolation reference range 1230A are referenced.
[0178] [Gradient filter] Next, we will explain the gradient filter. Figure 15 is a conceptual diagram illustrating an example of the reference range of the gradient filter in Modification 1 of Embodiment 2.
[0179] The gradient filter in this modified example is a so-called 5-tap filter, and is represented by the following equations (4) and (5).
[0180]
number
[0181]
number
[0182] Here, Ixk[x,y] represents the horizontal gradient value of each pixel in the first gradient image when k is 0, and the horizontal gradient value of each pixel in the second gradient image when k is 1. Iyk[x,y] represents the vertical gradient value of each pixel in the first gradient image when k is 0, and the vertical gradient value of each pixel in the second gradient image when k is 1. w represents the weighting coefficient.
[0183] For example, when the gradient filters of equations (4) and (5) are applied to the top-left pixel 1122 in Figure 15, the horizontal gradient value is calculated by weighting and adding the pixel values of five pixels arranged horizontally, including the top-left pixel 1122, which are integer-precision predicted image pixel values. Similarly, the vertical gradient value is calculated by weighting and adding the pixel values of five pixels arranged vertically, including the top-left pixel 1122, which are integer-precision predicted image pixel values. In this case, the weight coefficients have values where the sign is reversed for pixels above and below or to the left and right of the top-left pixel 1122, with the top-left pixel 1122 being the point of symmetry.
[0184] Thus, in this modified example, the gradient filter for the top-left pixel 1122 refers to the pixels of the reference range 1132A. The reference range 1132A has a cross shape extending two pixels in the up, down, left, and right directions from the top-left pixel 1122.
[0185] Such a gradient filter is applied to all pixels in the first prediction block 1120. Therefore, the motion compensation filter for the first prediction block 1120 references pixels in the first gradient reference range 1135A.
[0186] The gradient filter is applied to the second prediction block 1220 in the same way as to the first prediction block 1120. That is, for the top-left pixel 1222, pixels in reference range 1232A are referenced, and for the entire second prediction block 1220, pixels in the second gradient reference range 1235A are referenced.
[0187] If the motion vector specifying the reference range indicates a decimal pixel position, the pixel values in the reference ranges 1132A and 1232A of the gradient filter may be converted to pixel values with decimal pixel precision, and the gradient filter may be applied to the converted pixel values. Alternatively, a gradient filter whose coefficient value is obtained by convolving a coefficient value for conversion to decimal pixel precision and a coefficient value for deriving the gradient value may be applied to pixel values with integer pixel precision. In this case, the gradient filter will differ for each decimal pixel position.
[0188] [Deduction of local motion estimates at the subblock level] Next, we will explain how to derive local motion estimates at the subblock level. Specifically, we will explain the derivation of local motion estimates for the top-left subblock among the multiple subblocks contained in the current block as an example.
[0189] In this modified example, the estimated horizontal local motion u and vertical local motion v of the subblock are derived based on the following equation (6).
[0190]
number
[0191] Here, sGxGy, sGx2, sGy2, sGxdI, and sGydI are values calculated on a subblock basis, and are calculated based on the following formula (7).
[0192]
number
[0193] Here, Ω is the set of coordinates of all pixels contained in the prediction subblock, which is the region corresponding to a subblock within the prediction block. Gx[i,j] represents the sum of the horizontal gradient values of the first gradient image and the second gradient image, and Gy[i,j] represents the sum of the vertical gradient values of the first gradient image and the second gradient image. △I[i,j] represents the difference between the first and second prediction images. w[i,j] represents a weighting coefficient that depends on the pixel position within the prediction subblock. For example, the same weighting coefficient may be used for all pixels within the prediction subblock.
[0194] Specifically, Gx[i,j], Gy[i,j], and △I[i,j] are expressed by the following equation (8).
[0195]
number
[0196] As described above, local motion estimates are calculated at the sub-block level.
[0197] [Generating the final predicted image] Next, we will explain how to generate the final predicted image. Each pixel value p[x,y] of the final predicted image is calculated using the pixel values I0[x,y] of the first predicted image and the pixel values I1[x,y] of the second predicted image, based on the following equation (9).
[0198]
number
[0199] Here, b[x,y] represents the correction value for each pixel. In equation (9), the pixel value p[x,y] of the final predicted image is calculated by right-shifting the sum of the pixel value I0[x,y] of the first predicted image, the pixel value I1[x,y] of the second predicted image, and the correction value b[x,y] by one bit. The correction value b[x,y] is expressed by the following equation (10).
[0200]
number
[0201] In equation (10), the correction value b[x,y] is calculated by adding the result of multiplying the difference in horizontal gradient values between the first and second gradient images (Ix0[x,y]-Ix1[x,y]) by the estimated horizontal local motion (u) and the result of multiplying the difference in vertical gradient values between the first and second gradient images (Iy0[x,y]-Iy1[x,y]) by the estimated vertical local motion (v).
[0202] Note that the arithmetic formulas explained using formulas (6) to (10) are just examples, and any other formula that has a similar effect may be used.
[0203] [Effects, etc.] As described above, even when using the motion compensation filter and gradient filter according to this modified example, local motion estimates can be derived at the sub-block level. By generating the final predicted image of the current block using the local motion estimates derived at the sub-block level, the same effect as in Embodiment 2 can be obtained.
[0204] This embodiment may be implemented in combination with at least some of the other embodiments of this disclosure. Furthermore, some of the processes, some of the configurations of the apparatus, some of the syntax, etc., described in the flowchart of this embodiment may be implemented in combination with the other embodiments.
[0205] (Modification 2 of Embodiment 2) In Embodiment 2 and its Modification 1 described above, all pixels included in the prediction subblock within the prediction block corresponding to the subblock within the current block were referenced in the derivation of the local motion estimate, but this is not limited to this. For example, only some of the pixels among the multiple pixels included in the prediction subblock may be referenced.
[0206] Therefore, in this modified example, we will explain the case where only some of the pixels included in the prediction subblock are referenced in the derivation of the local motion estimate for each subblock. For example, in equation (7) of Modified Example 1 above, instead of Ω, which is the set of coordinates of all pixels included in the prediction subblock, a set of coordinates of some pixels within the prediction subblock is used. Various patterns can be used as the set of coordinates of some pixels within the prediction subblock.
[0207] Figure 16 shows an example of a pixel pattern referenced in the derivation of local motion estimates in Modification 2 of Embodiment 2. In Figure 16, hatched circles in prediction subblocks 1121 or 1221 indicate referenced pixels, and unhatched circles indicate pixels that are not referenced.
[0208] Each of the seven pixel patterns (a) to (g) in Figure 16 represents a subset of pixels from a group of pixels included in prediction subblock 1121 or 1221. Furthermore, the seven pixel patterns are distinct from one another.
[0209] In Figures 16(a) to (c), only 8 out of 16 pixels in the prediction subblock 1121 or 1221 are referenced. In Figures 16(d) to (g), only 4 out of 16 pixels in the prediction subblock 1121 or 1221 are referenced. In other words, in Figures 16(a) to (c), 8 out of 16 pixels are decimated, and in Figures 16(d) to (g), 12 out of 16 pixels are decimated.
[0210] More specifically, in Figure 16(a), eight pixels are referenced, which are offset by one pixel horizontally / vertically from each other. In Figure 16(b), the left and right pairs of two horizontally aligned pixels are referenced alternately in the vertical direction. In Figure 16(c), the four central pixels and the four corner pixels within prediction subblock 1121 or 1221 are referenced.
[0211] Furthermore, in Figure 16(d) and (e), two pixels each are referenced from the first and third columns from the left. In Figure 16(f), the four corner pixels are referenced. In Figure 16(g), the four central pixels are referenced.
[0212] A pixel pattern may be adaptively selected from a predetermined set of pixel patterns based on two predicted images. For example, a pixel pattern containing a number of pixels corresponding to the representative gradient values of the two predicted images may be selected. Specifically, if the representative gradient value is less than a threshold, a pixel pattern containing 4 pixels (e.g., any of (d) to (g)) may be selected, and otherwise, a pixel pattern containing 8 pixels (e.g., any of (a) to (c)) may be selected.
[0213] When a pixel pattern is selected from among multiple pixel patterns, the local motion estimate of the subblock is derived by referring to the pixels within the prediction subblock indicated by the selected pixel pattern.
[0214] The information indicating the selected pixel pattern may be written to the bitstream. In this case, the decoder can obtain the information from the bitstream and select the pixel pattern based on the obtained information. The information indicating the selected pixel pattern can be written to a header, for example, on a block, slice, picture, or stream basis.
[0215] As described above, the encoding and decoding devices according to this embodiment can derive local motion estimates on a subblock basis by referencing only some of the pixels among the multiple pixels included in the prediction subblock. Therefore, the processing load or processing time can be reduced compared to when all of the multiple pixels are referenced.
[0216] Furthermore, according to the encoding and decoding devices of this embodiment, local motion estimates can be derived in subblock units by referencing only the pixels included in a selected pixel pattern from among multiple pixel patterns. Therefore, by switching pixel patterns, it becomes possible to reference pixels suitable for deriving local motion estimates for subblocks, thereby reducing prediction errors.
[0217] This embodiment may be implemented in combination with at least some of the other embodiments of this disclosure. Furthermore, some of the processes, some of the configurations of the apparatus, some of the syntax, etc., described in the flowchart of this embodiment may be implemented in combination with the other embodiments.
[0218] (Other variations of Embodiment 2) The above describes encoding and decoding devices according to one or more embodiments of the present disclosure based on embodiments and their modifications. However, the present disclosure is not limited to these embodiments and their modifications. Various modifications that a person skilled in the art can conceive of may be applied to these embodiments or their modifications, as long as they do not depart from the spirit of the present disclosure.
[0219] For example, the number of taps in the motion compensation filter in Embodiment 2 and its Modification 1 was 8 pixels, but is not limited to this. The number of taps in the motion compensation filter may be any other number, as long as the interpolation reference range is included in the normal reference range.
[0220] In Embodiment 2 and its Modification 1 described above, the number of taps in the gradient filter was 6 pixels or 5 pixels, but it is not limited to these. Any other number of taps is acceptable as long as the gradient reference range is included in the interpolation reference range.
[0221] In Embodiment 2 and its Modification 1 described above, the first gradient reference range and the second gradient reference range were included in the first interpolation reference range and the second interpolation reference range, but this is not limited to them. For example, the first gradient reference range may coincide with the first interpolation reference range, and the second gradient reference range may coincide with the second interpolation reference range.
[0222] Furthermore, when deriving local motion estimates at the subblock level, the pixel values may be weighted so that the values of the central pixels in the prediction subblock are reflected more favorably. In other words, in deriving local motion estimates, the values of multiple pixels included in the prediction subblock may be weighted and used in each of the first and second prediction blocks, and in that case, the pixels located in the center of the prediction subblock may have a larger weight. More specifically, for example, in Modification 1 of Embodiment 2, the weight coefficient w[i,j] in equation (7) may have a larger value as the coordinate value is closer to the center of the prediction subblock.
[0223] Furthermore, when deriving local motion estimates at the subblock level, pixels in other adjacent prediction subblocks belonging to the same prediction block may also be referenced. In other words, in each of the first and second prediction blocks, local motion estimates at the subblock level may be derived by referencing not only the multiple pixels included in the prediction subblock, but also the pixels included in other prediction subblocks adjacent to that prediction subblock within the same prediction block.
[0224] Note that the reference ranges of the motion compensation filter and gradient filter in Embodiment 2 and its modified example 1 are illustrative and do not need to be limited thereto.
[0225] In the modified example 2 of Embodiment 2 described above, seven pixel patterns were exemplified, but the invention is not limited to these. For example, pixel patterns obtained by rotating each of the seven pixel patterns may be used.
[0226] Note that the weight coefficient values in the modified example 1 of Embodiment 2 are merely examples and are not limited thereto. Also, the block size and subblock size in Embodiment 2 and its various modified examples are merely examples and are not limited to 8x8 pixel size and 4x4 pixel size. Inter prediction can be performed with other sizes as in Embodiment 2 and its various modified examples.
[0227] This embodiment may be implemented in combination with at least some of the other embodiments of this disclosure. Furthermore, some of the processes, some of the configurations of the apparatus, some of the syntax, etc., described in the flowchart of this embodiment may be implemented in combination with the other embodiments.
[0228] (Embodiment 3) In each of the above embodiments, each functional block can typically be implemented by an MPU and memory, etc. Furthermore, the processing performed by each functional block is typically implemented by a program execution unit such as a processor reading and executing software (program) recorded on a recording medium such as ROM. This software may be distributed by download, etc., or it may be recorded on a recording medium such as semiconductor memory and distributed. Of course, it is also possible to implement each functional block by hardware (dedicated circuitry).
[0229] Furthermore, the processing described in each embodiment may be implemented by centralized processing using a single device (system), or by distributed processing using multiple devices. Also, the processor executing the above program may be one or multiple. In other words, centralized processing may be performed, or distributed processing may be performed.
[0230] The embodiments of this disclosure are not limited to those described above, and various modifications are possible, which are also included within the scope of the embodiments of this disclosure.
[0231] Furthermore, here we will describe application examples of the video encoding method (image encoding method) or video decoding method (image decoding method) shown in each of the above embodiments, and a system using the same. The system is characterized by having an image encoding device using the image encoding method, an image decoding device using the image decoding method, and an image encoding and decoding device that includes both. Other configurations in the system can be appropriately modified as needed.
[0232] [Usage example] Figure 17 shows the overall configuration of the content supply system ex100 that realizes the content distribution service. The service area for the communication service is divided into cells of a desired size, and fixed radio stations, base stations ex106, ex107, ex108, ex109, and ex110, are installed in each cell.
[0233] In this content supply system ex100, various devices such as a computer ex111, a game console ex112, a camera ex113, a home appliance ex114, and a smartphone ex115 are connected to the internet ex101 via an internet service provider ex102 or a communication network ex104, and base stations ex106~ex110. The content supply system ex100 may also connect any combination of the above elements. Each device may be directly or indirectly connected to each other via a telephone network or short-range radio, etc., without going through the base stations ex106~ex110, which are fixed radio stations. In addition, the streaming server ex103 is connected to various devices such as a computer ex111, a game console ex112, a camera ex113, a home appliance ex114, and a smartphone ex115 via the internet ex101, etc. Furthermore, the streaming server ex103 is connected to terminals in a hotspot on an airplane ex117 via satellite ex116.
[0234] Note that instead of base stations ex106~ex110, wireless access points or hotspots may be used. Also, streaming server ex103 may be connected directly to the communication network ex104 without going through the internet ex101 or internet service provider ex102, or it may be connected directly to the airplane ex117 without going through satellite ex116.
[0235] Camera ex113 is a device capable of taking still images and videos, such as a digital camera. Smartphone ex115 is a smartphone, mobile phone, or PHS (Personal Handyphone System) that supports mobile communication systems generally known as 2G, 3G, 3.9G, 4G, and the upcoming 5G.
[0236] Home appliance ex118 refers to appliances such as refrigerators or equipment included in household fuel cell cogeneration systems.
[0237] In the content supply system ex100, live streaming becomes possible when a terminal with a shooting function is connected to the streaming server ex103 via a base station ex106 or the like. In live streaming, the terminal (computer ex111, game console ex112, camera ex113, home appliance ex114, smartphone ex115, and terminal inside an airplane ex117, etc.) performs the encoding process described in each of the above embodiments on still images or video content captured by the user using the terminal, multiplexes the video data obtained by encoding with sound data encoded from the sound corresponding to the video, and transmits the obtained data to the streaming server ex103. In other words, each terminal functions as an image encoding device according to one aspect of this disclosure.
[0238] Meanwhile, the streaming server ex103 streams the content data sent to the requesting client. The client is a computer ex111, a game console ex112, a camera ex113, a home appliance ex114, a smartphone ex115, or a terminal on an airplane ex117, etc., that is capable of decoding the encoded data. Each device that receives the distributed data decodes and plays back the received data. That is, each device functions as an image decoding device according to one aspect of this disclosure.
[0239] [Distributed Processing] Furthermore, the streaming server ex103 may consist of multiple servers or computers that distribute data processing, recording, and distribution. For example, the streaming server ex103 may be implemented using a CDN (Content Delivery Network), where content delivery is achieved through a network connecting numerous edge servers distributed worldwide. In a CDN, the physically closest edge server is dynamically assigned depending on the client. Latency can be reduced by caching and delivering content to the edge server. In addition, if an error occurs or the communication state changes due to an increase in traffic, processing can be distributed among multiple edge servers, the delivery entity can be switched to another edge server, or delivery can be continued by bypassing the failed part of the network, thus enabling high-speed and stable delivery.
[0240] Furthermore, not only the distribution process itself can be decentralized, but the encoding process of the captured data can also be performed on each terminal, on the server side, or shared between them. As an example, generally in the encoding process, the processing loop is performed twice. In the first loop, the complexity of the image or the amount of code is detected for each frame or scene. Also, in the second loop, processing is performed to improve the encoding efficiency while maintaining the image quality. For example, if the terminal performs the first encoding process and the server that receives the content performs the second encoding process, it is possible to improve the quality and efficiency of the content while reducing the processing load on each terminal. In this case, if there is a requirement to receive and decode almost in real time, the already encoded data processed by the terminal can be received and played back by other terminals, enabling more flexible real-time distribution.
[0241] As another example, cameras such as ex113 perform feature extraction from images, compress the data related to the features as metadata, and transmit it to the server. The server performs compression according to the meaning of the image, for example, by judging the importance of the object from the features and switching the quantization accuracy. Feature data is particularly effective in improving the accuracy and efficiency of motion vector prediction during re-compression on the server. Also, simple encoding such as VLC (Variable Length Coding) can be performed on the terminal, and encoding with a large processing load such as CABAC (Context Adaptive Binary Arithmetic Coding) can be performed on the server.
[0242] As yet another example, in a stadium, shopping mall, factory, etc., there may be a plurality of video data in which substantially the same scene is captured by a plurality of terminals. In this case, using the plurality of terminals that performed the shooting, and other terminals and servers that did not shoot as necessary, encoding processes are respectively assigned and decentralized processing is performed, for example, in units of GOP (Group of Picture), picture, or tiles obtained by dividing the picture. This can reduce the delay and achieve more real-time performance.
[0243] In addition, since the plurality of video data is of substantially the same scene, the server may manage and / or give instructions so that the video data captured by each terminal can be mutually referenced. Alternatively, the server may receive the encoded data from each terminal, change the reference relationship between the plurality of data, or correct or replace the picture itself and re-encode it. Thereby, a stream with enhanced quality and efficiency of each piece of data can be generated.
[0244] In addition, the server may perform transcoding to change the encoding method of the video data and then distribute the video data. For example, the server may convert an MPEG-based encoding method to a VP-based method, or convert H.264 to H.265.
[0245] Thus, the encoding process can be performed by a terminal or one or more servers. Therefore, hereinafter, descriptions such as "server" or "terminal" will be used as the entity performing the process, but part or all of the processes performed by the server may be performed by the terminal, or part or all of the processes performed by the terminal may be performed by the server. Also, regarding these, the same applies to the decoding process.
[0246] [3D, Multi-angle] In recent years, it has also become increasingly common to integrate and utilize different scenes captured by terminals such as a plurality of cameras ex113 and / or smartphones ex115 that are substantially synchronized with each other, or images or videos of the same scene captured from different angles. The videos captured by each terminal are integrated based on the relative positional relationship between the terminals obtained separately, or the regions where the feature points included in the videos match.
[0247] The server may not only encode 2D video but also encode still images automatically based on scene analysis of the video, or at a time specified by the user, and send them to the receiving terminal. Furthermore, if the server can obtain the relative positional relationship between the shooting terminals, it can generate a 3D shape of the scene based not only on 2D video but also on video of the same scene taken from different angles. The server may also separately encode 3D data generated by a point cloud, or it may select or reconstruct video to send to the receiving terminal from video taken by multiple terminals based on the results of recognizing or tracking a person or object using the 3D data.
[0248] In this way, users can enjoy scenes by arbitrarily selecting each video corresponding to each shooting terminal, or they can enjoy content in which video from an arbitrary viewpoint is extracted from 3D data reconstructed using multiple images or videos. Furthermore, just like the video, sound can also be collected from multiple different angles, and the server may multiplex and transmit sound from a specific angle or space in conjunction with the video.
[0249] In recent years, content that links the real world with a virtual world, such as Virtual Reality (VR) and Augmented Reality (AR), has also become popular. In the case of VR images, the server may create separate viewpoint images for the right and left eyes and perform encoding that allows referencing between the viewpoint images using Multi-View Coding (MVC), or it may encode them as separate streams without referencing each other. When decoding the separate streams, it is advisable to synchronize playback so that the virtual 3D space is reproduced according to the user's viewpoint.
[0250] In the case of AR images, the server superimposes virtual object information from the virtual space onto camera information from the real space, based on its three-dimensional position or the user's viewpoint movement. The decoding device may acquire or store the virtual object information and three-dimensional data, generate a two-dimensional image according to the user's viewpoint movement, and create superimposed data by smoothly stitching them together. Alternatively, the decoding device may send the user's viewpoint movement to the server in addition to requesting virtual object information, and the server may create superimposed data from the three-dimensional data held by the server according to the received viewpoint movement, encode the superimposed data, and distribute it to the decoding device. The superimposed data may have an α value indicating transparency in addition to RGB, and the server may set the α value of parts other than the object created from the three-dimensional data to 0, etc., so that those parts are transparent, and encode the data. Alternatively, the server may set a predetermined RGB value to the background, like chroma keying, and generate data in which parts other than the object are the background color.
[0251] Similarly, the decryption process of the distributed data can be performed on each client terminal, on the server side, or shared between them. For example, one terminal may send a reception request to the server, and other terminals may receive the content corresponding to that request, perform the decryption process, and then transmit the decrypted signal to a device with a display. By distributing the processing and selecting appropriate content regardless of the performance of the communication-capable terminals themselves, it is possible to play back data with good image quality. Another example is that while receiving large image data on a TV or similar device, a portion of the picture, such as tiles, may be decrypted and displayed on the viewer's personal terminal. This allows for sharing the overall picture while allowing users to check their own area of responsibility or areas they want to examine in more detail on their own device.
[0252] In the future, it is expected that content will be seamlessly received by switching appropriate data for the connected communication, using distribution system standards such as MPEG-DASH, in situations where multiple short-range, medium-range, or long-range wireless communications are available both indoors and outdoors. This will allow users to freely select and switch in real time between decoding devices or display devices, such as displays installed indoors or outdoors, as well as their own terminals. Furthermore, decoding can be performed while switching between the decoding terminal and the display terminal based on the user's location information. This will make it possible to display map information on the wall or part of the ground of an adjacent building with a displayable device embedded, while traveling to a destination. It will also be possible to switch the bitrate of the received data based on the ease of access to the encoded data on the network, such as when the encoded data is cached on a server that can be accessed quickly from the receiving terminal, or copied to an edge server in the content delivery service.
[0253] [Scalable encoding] Regarding content switching, we will explain using a scalable stream compressed and encoded using the video encoding method described in each of the embodiments above, as shown in Figure 18. The server may have multiple streams with the same content but different qualities as individual streams, but it may also be configured to switch content by taking advantage of the characteristics of a temporally / spatially scalable stream realized by encoding it in layers, as shown in the figure. In other words, the decoding side can freely switch between decoding low-resolution and high-resolution content by deciding which layer to decode according to internal factors such as performance and external factors such as the state of the communication bandwidth. For example, if you want to watch the rest of a video that you were watching on your smartphone ex115 while traveling, on a device such as an internet TV when you get home, that device only needs to decode the same stream to a different layer, thus reducing the burden on the server.
[0254] Furthermore, in addition to the configuration described above, in which pictures are encoded for each layer and an enhancement layer exists above the base layer to achieve scalability, the enhancement layer may include metadata based on statistical information of the image, and the decoding side may generate high-quality content by super-resolution the picture in the base layer based on the metadata. Super-resolution may refer to either an improvement in the signal-to-noise ratio at the same resolution or an increase in resolution. The metadata may include information for identifying linear or nonlinear filter coefficients used in the super-resolution process, or information for identifying parameter values in the filtering process, machine learning, or least-squares operation used in the super-resolution process.
[0255] Alternatively, the picture may be divided into tiles or similar structures according to the meaning of objects within the image, and the decoding side may select tiles to decode, thereby decoding only a portion of the area. Furthermore, by storing object attributes (people, cars, balls, etc.) and their positions within the image (coordinate positions within the same image, etc.) as metadata, the decoding side can identify the location of a desired object based on the metadata and determine the tile containing that object. For example, as shown in Figure 19, the metadata is stored using a data storage structure different from pixel data, such as the SEI message in HEVC. This metadata indicates, for example, the position, size, or color of the main object.
[0256] Furthermore, metadata may be stored in units consisting of multiple pictures, such as streams, sequences, or random access units. This allows the decryption side to obtain information such as the time when a specific person appears in the video, and by combining this with the picture-level information, it can identify the picture in which the object exists and the object's position within that picture.
[0257] [Web page optimization] Figure 20 shows an example of a web page display screen on a computer ex111, etc. Figure 21 shows an example of a web page display screen on a smartphone ex115, etc. As shown in Figures 20 and 21, a web page may contain multiple linked images, which are links to image content, and their appearance will differ depending on the viewing device. When multiple linked images are visible on the screen, the display device (decoder) will display still images or I-pictures from each content as linked images, display video such as a GIF animation using multiple still images or I-pictures, or receive only the base layer and decode and display the video, until the user explicitly selects a linked image, or until the linked image approaches the center of the screen or the entire linked image is within the screen.
[0258] When a linked image is selected by the user, the display device prioritizes decoding the base layer. If the HTML of the web page contains information indicating that the content is scalable, the display device may decode up to the enhancement layer. Furthermore, to ensure real-time performance, before selection or when bandwidth is very limited, the display device can decode and display only forward-referenced pictures (I-pictures, P-pictures, and B-pictures that only use forward references), thereby reducing the delay between the decoding time and display time of the first picture (the delay from the start of content decoding to the start of display). Alternatively, the display device may deliberately ignore the reference relationships between pictures and roughly decode all B-pictures and P-pictures using forward references, then perform normal decoding as time passes and more pictures are received.
[0259] [Autonomous driving] Furthermore, when transmitting and receiving still images or video data such as 2D or 3D map information for autonomous driving or driving assistance of a vehicle, the receiving terminal may receive metadata such as weather or construction information in addition to image data belonging to one or more layers, and decode these in association with each other. The metadata may belong to a layer, or it may simply be multiplexed with the image data.
[0260] In this case, since the vehicle, drone, or airplane containing the receiving terminal is in motion, the receiving terminal can transmit its location information when a reception request is made, enabling seamless reception and decoding while switching between base stations ex106 to ex110. Furthermore, the receiving terminal can dynamically switch how much metadata is received or how much map information is updated, depending on the user's selection, the user's situation, or the state of the communication bandwidth.
[0261] As described above, the content supply system ex100 allows the client to receive, decode, and play back encoded information transmitted by the user in real time.
[0262] [Distribution of personal content] Furthermore, the ex100 content delivery system allows for unicast or multicast distribution of not only high-definition, long-duration content from video distribution companies, but also low-definition, short-duration content from individuals. It is also expected that the amount of such individual content will continue to increase. To improve the quality of individual content, the server may perform editing before encoding. This can be achieved, for example, with the following configuration.
[0263] During shooting in real-time or accumulating and after shooting, the server performs recognition processing such as shooting error, scene search, semantic analysis, and object detection on the original picture or encoded data. Then, based on the recognition results, the server manually or automatically corrects out-of-focus or camera shake, deletes less important scenes such as scenes with lower brightness or out-of-focus compared to other pictures, emphasizes the edges of objects, or changes the color tone for editing. The server encodes the edited data based on the editing results. It is also known that if the shooting time is too long, the viewing rate will decrease. The server may automatically clip scenes with little movement as well as less important scenes as described above so that the content within a specific time range is obtained according to the shooting time, based on the image processing results. Or, the server may generate a digest based on the result of semantic analysis of the scene, encode it, and output it.
[0264] Note that there are cases where personal content may contain elements that, as they are, would infringe on copyright, moral rights of the author, or portrait rights, etc., and may be inconvenient for individuals, such as the sharing scope exceeding the intended scope. Therefore, for example, the server may deliberately change the image of a person's face in the peripheral part of the screen or inside a house to an out-of-focus image and encode it. Also, the server may recognize whether a face of a person different from the pre-registered person appears in the image to be encoded, and if it does, perform processing such as applying a mosaic to the face part. Or, as pre-processing or post-processing of encoding, the user designates a person or background area that the user wants to process the image from the perspective of copyright, etc., and the server can perform processing such as replacing the designated area with another video or blurring the focus. For a person, the video of the face part can be replaced while tracking the person in the moving image.
[0265] Furthermore, because viewing personal content with small data volumes requires real-time processing, depending on the bandwidth, the decoder prioritizes receiving, decoding, and playing the base layer first. During this time, the decoder can receive the enhancement layer, and if playback is looped or if the content is played more than once, it may play the high-quality video including the enhancement layer. With a stream that uses this scalable encoding, it is possible to provide an experience where the video is rough when unselected or at the beginning of viewing, but gradually the stream becomes smarter and the image quality improves. In addition to scalable encoding, a similar experience can be provided even if the rough stream played the first time and the second stream encoded by referencing the first video are configured as a single stream.
[0266] [Other usage examples] Furthermore, these encoding or decoding processes are generally performed by the LSIex500 present in each terminal. The LSIex500 may be a single chip or a multi-chip configuration. Alternatively, video encoding or decoding software may be embedded in some recording medium (such as a CD-ROM, flexible disk, or hard disk) that can be read by a computer ex111, and the encoding or decoding process may be performed using that software. In addition, if the smartphone ex115 has a camera, video data acquired by that camera may be transmitted. In this case, the video data is data encoded by the LSIex500 present in the smartphone ex115.
[0267] The LSIex500 may also be configured to be activated by downloading application software. In this case, the terminal first determines whether it supports the content encoding method or whether it has the capability to perform the specific service. If the terminal does not support the content encoding method or does not have the capability to perform the specific service, the terminal downloads the codec or application software, and then acquires and plays the content.
[0268] Furthermore, not only the content supply system ex100 via the Internet ex101, but also digital broadcasting systems can incorporate at least one of the video encoding device (image encoding device) or video decoding device (image decoding device) of each of the above embodiments. While the content supply system ex100 has a configuration that is more suited to multicast than unicast, as it transmits and receives multiplexed data with video and sound multiplexed onto broadcast radio waves using satellites, etc., the encoding and decoding processes are similar and can be applied in the same way.
[0269] [Hardware configuration] Figure 22 shows the smartphone ex115. Figure 23 shows an example of the configuration of the smartphone ex115. The smartphone ex115 includes an antenna ex450 for transmitting and receiving radio waves with the base station ex110, a camera unit ex465 capable of taking video and still images, and a display unit ex458 that displays video captured by the camera unit ex465 and data decoded from video received by the antenna ex450. The smartphone ex115 further includes an operation unit ex466, such as a touch panel, an audio output unit ex457, such as a speaker for outputting voice or sound, an audio input unit ex456, such as a microphone for inputting voice, a memory unit ex467 capable of storing captured video or still images, recorded audio, received video or still images, encoded data such as emails, or decoded data, and a slot unit ex464, which is an interface unit with SIM ex468 for identifying the user and authenticating access to various data, including the network. External memory may be used instead of the memory unit ex467.
[0270] Furthermore, the main control unit ex460, which comprehensively controls the display unit ex458 and the operation unit ex466, is connected via the bus ex470 to the power supply circuit unit ex461, the operation input control unit ex462, the video signal processing unit ex455, the camera interface unit ex463, the display control unit ex459, the modulation / demodulation unit ex452, the multiplexing / decompression unit ex453, the audio signal processing unit ex454, the slot unit ex464, and the memory unit ex467.
[0271] The power supply circuit unit ex461, when the power key is turned on by the user, supplies power from the battery pack to each component, thereby starting up the smartphone ex115 and making it operational.
[0272] The smartphone ex115 performs tasks such as phone calls and data communication based on the control of the main control unit ex460, which has a CPU, ROM, RAM, etc. During a call, the audio signal picked up by the audio input unit ex456 is converted into a digital audio signal by the audio signal processing unit ex454, which is then subjected to spread spectrum processing by the modulation / demodulation unit ex452, and after digital-to-analog conversion and frequency conversion processing by the transmission / reception unit ex451, it is transmitted via the antenna ex450. Similarly, received data is amplified, subjected to frequency conversion and analog-to-digital conversion processing, despread spectrum processing by the modulation / demodulation unit ex452, converted into an analog audio signal by the audio signal processing unit ex454, and then output from the audio output unit ex457. In data communication mode, text, still images, or video data are sent to the main control unit ex460 via the operation input control unit ex462 by the operation unit ex466 of the main unit, and transmission and reception processing is performed in the same manner. When transmitting video, still images, or video and audio in data communication mode, the video signal processing unit ex455 compresses and encodes the video signal stored in the memory unit ex467 or the video signal input from the camera unit ex465 using the video encoding method shown in each of the above embodiments, and sends the encoded video data to the multiplexing / decoding unit ex453. The audio signal processing unit ex454 encodes the audio signal picked up by the audio input unit ex456 while the camera unit ex465 is capturing video or still images, and sends the encoded audio data to the multiplexing / decoding unit ex453. The multiplexing / decoding unit ex453 multiplexes the encoded video data and encoded audio data in a predetermined manner, performs modulation and conversion processing in the modulation / demodulation unit (modulation / demodulation circuit unit) ex452 and the transmission / reception unit ex451, and transmits the data via the antenna ex450.
[0273] When receiving video attached to an email or chat, or video linked to a webpage, etc., the multiplexing / decomposition unit ex453 separates the multiplexed data received via antenna ex450 to decode the multiplexed data, dividing it into a video data bitstream and an audio data bitstream. It then supplies the encoded video data to the video signal processing unit ex455 and the encoded audio data to the audio signal processing unit ex454 via the synchronization bus ex470. The video signal processing unit ex455 decodes the video signal using a video decoding method corresponding to the video encoding method shown in each embodiment above, and displays the video or still image contained in the linked video file from the display unit ex458 via the display control unit ex459. The audio signal processing unit ex454 decodes the audio signal, and audio is output from the audio output unit ex457. However, since real-time streaming is widespread, there may be situations where audio playback is socially inappropriate depending on the user's circumstances. Therefore, as an initial setting, it is preferable to have a configuration that plays only video data and not audio signals. Audio may be synchronized and played only when the user performs an action, such as clicking on video data.
[0274] Furthermore, although the smartphone ex115 was used as an example here, there are three possible implementation formats for terminals: a transceiver-type terminal that has both an encoder and a decoder, a transmitting terminal that has only an encoder, and a receiving terminal that has only a decoder. In addition, although it was explained that multiplexed data, in which audio data etc. is multiplexed with video data, is received or transmitted in a digital broadcasting system, the multiplexed data may also include text data related to the video in addition to audio data, or the video data itself may be received or transmitted instead of multiplexed data.
[0275] Although it was explained that the main control unit ex460, including the CPU, controls the encoding or decoding process, terminals often also have a GPU. Therefore, a configuration that leverages the GPU's performance to process a wide area at once using memory shared by the CPU and GPU, or memory whose addresses are managed so that it can be used in common, is also possible. This can shorten the encoding time, ensure real-time performance, and achieve low latency. In particular, it is efficient to perform motion detection, deblocking filters, SAO (Sample Adaptive Offset), and transformation / quantization processes at once on the GPU, rather than on the CPU, in units such as pictures. [Industrial applicability]
[0276] This disclosure can be used, for example, in television receivers, digital video recorders, car navigation systems, mobile phones, digital cameras, or digital video cameras. [Explanation of symbols]
[0277] 100 Encoding device 102 Division 104 Subtraction Unit 106 Conversion Unit 108 Quantization section 110 Entropy coding unit 112, 204 Inverse quantization section 114, 206 Inverse Transform Section 116, 208 Addition section 118, 210 block memory 120, 212 Loop filter section 122,214 frame memory 124, 216 Intra Prediction Unit 126, 218 Interpretation Unit 128, 220 Prediction Control Unit 200 Decoders 202 Entropy Decoder 1000 Current Pictures 1001 Current Block 1100 First reference picture 1110 First motion vector 1120 First prediction block 1121, 1221 Prediction subblocks 1122, 1222 Top left pixels 1130, 1130A First interpolation reference range 1131, 1131A, 1132, 1132A, 1231, 1231A, 1232, 1232A Reference range 1135, 1135A First Gradient Reference Range 1140 First Prediction Image 1150 First gradient image 1200 Second reference picture 1210 Second motion vector 1220 Second prediction block 1230, 1230A Second interpolation reference range 1235, 1235A Second Gradient Reference Range 1240 Second Prediction Image 1250 Second gradient image 1300 Local motion estimate 1400 Final Predicted Images
Claims
[Claim 1] A bitstream generator that generates a bitstream, Processor and Equipped with memory, The processor uses the memory to generate a bitstream containing motion information indicating a reference picture used in interpretation, The aforementioned inter prediction is For bidirectional prediction, two predicted images are obtained by interpolating to fractional pixel precision using two reference pictures associated with the encoded blocks contained in the encoded picture. Using the multiple pixel values of multiple first pixels contained in the two predicted images, the encoding target block is divided to obtain multiple vertical gradient values corresponding to each of the multiple second pixels contained in the subblock. Based on the aforementioned multiple vertical gradient values, the motion correction value of the subblock is derived. At the end of the interpretation using the plurality of vertical gradient values, the process includes generating an output prediction image corresponding to the subblock using the motion correction value of the subblock, The two predicted images are identified using two motion vectors. The aforementioned reference range for interpolation is included in the usual reference range referenced to obtain a fractional pixel-precision prediction image corresponding to the encoding target block in the usual interpretation that does not use the multiple vertical gradient values. In the interpolation process to fractional pixel precision, an 8-tap filter is used. Bitstream generator.