Encoding device, encoding method, decoding device, decoding method, and transmission method

The encoding device and method optimize video coding by employing an IRAP structure and supplemental enhancement information to enhance efficiency, image quality, and reduce processing load and circuit size in video encoding and decoding.

JP2026104906APending Publication Date: 2026-06-25PANASONIC INTELLECTUAL PROPERTY CORP OF AMERICA

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
PANASONIC INTELLECTUAL PROPERTY CORP OF AMERICA
Filing Date
2026-04-09
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Existing video coding technologies face challenges in improving encoding efficiency, image quality, reducing processing volume, and circuit size, as well as selecting appropriate elements or operations such as filters, blocks, motion vectors, and reference pictures effectively.

Method used

An encoding device and method that utilize an encoding structure including an IRAP picture and a specific order of encoding trailing and leading pictures, with supplemental enhancement information to indicate field pictures, allowing for efficient encoding and decoding of interlaced content.

Benefits of technology

This approach enhances coding efficiency, simplifies processing, reduces load and circuit size, and improves processing speed by optimizing the selection of encoding and decoding operations.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026104906000001_ABST
    Figure 2026104906000001_ABST
Patent Text Reader

Abstract

To provide an encoding device that can improve encoding efficiency. [Solution] The encoding device (100) comprises a circuit (160) and a memory (162) connected to the circuit (160). In operation, the circuit (160) encodes an image according to an encoding structure that includes an IRAP picture, a plurality of reading pictures output before the IRAP picture in output order, and a plurality of trailing pictures output after the IRAP picture in output order. The circuit (160) generates a bitstream that includes an image and a flag. When encoding an image, according to the flag, it encodes up to one of the plurality of trailing pictures before the plurality of reading pictures in encoding order, and encodes up to one trailing picture after the plurality of reading pictures in encoding order.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present disclosure relates to video coding, for example, systems, components, and methods in video encoding and decoding, etc.

Background Art

[0002] Video coding technologies have advanced from H.261 and MPEG-1 to H.264 / AVC (Advanced Video Coding), MPEG-LA, H.265 / HEVC (High Efficiency Video Coding), and H.266 / VVC (Versatile Video Codec). Along with this progress, there has always been a need to provide improvements and optimizations to video coding technologies to process the continuously increasing amount of digital video data in various applications.

[0003] Non-Patent Document 1 relates to an example of a conventional standard regarding the video coding technologies described above.

Prior Art Documents

Non-Patent Documents

[0004]

Non-Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0005] Regarding the encoding method as described above, for improving encoding efficiency, improving image quality, reducing processing volume, reducing circuit scale, or appropriately selecting elements or operations such as filters, blocks, sizes, motion vectors, reference pictures, or reference blocks, etc., a new method is desired to be proposed.

[0006] This disclosure provides a configuration or method that can contribute to one or more of the following: improved coding efficiency, improved image quality, reduced processing load, reduced circuit size, improved processing speed, and appropriate selection of elements or operations. This disclosure may also include configurations or methods that can contribute to other benefits not mentioned above. [Means for solving the problem]

[0007] An encoding device according to one aspect of the present disclosure is an encoding device for encoding an image, comprising a circuit and a memory connected to the circuit, wherein the circuit encodes the image according to an encoding structure that includes an IRAP picture, a plurality of leading pictures output before the IRAP picture in output order, and a plurality of trailing pictures output after the IRAP picture in output order, and generates a bitstream containing the image, which is a bitstream containing a flag, and when encoding the image, according to the flag, encodes up to one of the plurality of trailing pictures before the plurality of leading pictures in encoding order, and encodes the plurality of trailing pictures excluding the up to one trailing picture after the plurality of leading pictures in encoding order, the flag indicates that the picture of each access unit in the bitstream is a field picture, and the circuit encodes each access unit in the bitstream containing an SEI (supplemental enhancement information) message that includes information indicating that the picture of the access unit is a field picture.

[0008] Some implementations of the embodiments in this disclosure may improve coding efficiency, simplify coding / decoding processes, increase coding / decoding speeds, or efficiently select appropriate components / operations used for coding and decoding, such as appropriate filters, block sizes, motion vectors, reference pictures, reference blocks, etc.

[0009] Further advantages and effects of one aspect of this disclosure will be made apparent from the specification and drawings. Such advantages and / or effects may be obtained by several embodiments and features described in the specification and drawings, but not all of them are necessarily provided to obtain one or more advantages and / or effects.

[0010] These general or specific embodiments may be implemented as systems, methods, integrated circuits, computer programs, recording media, or any combination thereof. [Effects of the Invention]

[0011] A configuration or method relating to one aspect of this disclosure may contribute to one or more of the following: improved encoding efficiency, improved image quality, reduced processing load, reduced circuit size, improved processing speed, and appropriate selection of elements or operations. A configuration or method relating to one aspect of this disclosure may also contribute to other benefits not mentioned above. [Brief explanation of the drawing]

[0012] [Figure 1] Figure 1 is a block diagram showing the functional configuration of an encoding device according to an embodiment. [Figure 2] Figure 2 is a flowchart showing an example of the overall encoding process performed by the encoding device. [Figure 3] Figure 3 is a conceptual diagram showing an example of block division. [Figure 4A] Figure 4A is a conceptual diagram showing an example of a slice configuration. [Figure 4B] Figure 4B is a conceptual diagram showing an example of tile configuration. [Figure 5A] Figure 5A is a table showing the transformation basis functions corresponding to various transformation types. [Figure 5B] Figure 5B is a conceptual diagram showing an example of SVT (Spatially Varying Transform). [Figure 6A]FIG. 6A is a conceptual diagram showing an example of the shape of a filter used in an ALF (adaptive loop filter). [Figure 6B] FIG. 6B is a conceptual diagram showing another example of the shape of a filter used in an ALF. [Figure 6C] FIG. 6C is a conceptual diagram showing another example of the shape of a filter used in an ALF. [Figure 7] FIG. 7 is a block diagram showing an example of the detailed configuration of a loop filter section that functions as a DBF (deblocking filter). [Figure 8] FIG. 8 is a conceptual diagram showing an example of a deblocking filter having filter characteristics symmetric with respect to a block boundary. [Figure 9] FIG. 9 is a conceptual diagram for explaining a block boundary where deblocking filter processing is performed. [Figure 10] FIG. 10 is a conceptual diagram showing an example of a Bs value. [Figure 11] FIG. 11 is a flowchart showing an example of the processing performed in the prediction processing section of an encoding device. [Figure 12] FIG. 12 is a flowchart showing another example of the processing performed in the prediction processing section of an encoding device. [Figure 13] FIG. 13 is a flowchart showing another example of the processing performed in the prediction processing section of an encoding device. [Figure 14] FIG. 14 is a conceptual diagram showing an example of 67 intra prediction modes in the intra prediction of an embodiment. [Figure 15] FIG. 15 is a flowchart showing an example of the basic processing flow of inter prediction. [Figure 16] FIG. 16 is a flowchart showing an example of motion vector derivation. [Figure 17] FIG. 17 is a flowchart showing another example of motion vector derivation. [Figure 18] FIG. 18 is a flowchart showing another example of motion vector derivation. [Figure 19]Figure 19 is a flowchart showing an example of inter-mode prediction. [Figure 20] Figure 20 is a flowchart showing an example of interpretation using merge mode. [Figure 21] Figure 21 is a conceptual diagram illustrating an example of motion vector derivation processing using merge mode. [Figure 22] Figure 22 is a flowchart showing an example of FRUC (frame rate up conversion) processing. [Figure 23] Figure 23 is a conceptual diagram illustrating an example of pattern matching (bilateral matching) between two blocks along a motion trajectory. [Figure 24] Figure 24 is a conceptual diagram illustrating an example of pattern matching (template matching) between a template in the current picture and a block in a referenced picture. [Figure 25A] Figure 25A is a conceptual diagram illustrating an example of deriving a subblock-level motion vector based on the motion vectors of multiple adjacent blocks. [Figure 25B] Figure 25B is a conceptual diagram illustrating an example of deriving motion vectors for subblock units in an affine mode with three control points. [Figure 26A] Figure 26A is a conceptual diagram illustrating the affine merge mode. [Figure 26B] Figure 26B is a conceptual diagram illustrating an affine merge mode with two control points. [Figure 26C] Figure 26C is a conceptual diagram illustrating an affine merge mode with three control points. [Figure 27] Figure 27 is a flowchart showing an example of processing in affine merge mode. [Figure 28A] Figure 28A is a conceptual diagram illustrating an affine intermode with two control points. [Figure 28B]Figure 28B is a conceptual diagram illustrating an affine intermode with three control points. [Figure 29] Figure 29 is a flowchart showing an example of affine intermode processing. [Figure 30A] Figure 30A is a conceptual diagram illustrating an affine intermode in which the current block has three control points and the adjacent block has two control points. [Figure 30B] Figure 30B is a conceptual diagram illustrating an affine intermode in which the current block has two control points and the adjacent block has three control points. [Figure 31A] Figure 31A is a flowchart showing merge modes including DMVR (decoder motion vector refinement). [Figure 31B] Figure 31B is a conceptual diagram illustrating an example of DMVR processing. [Figure 32] Figure 32 is a flowchart showing an example of predictive image generation. [Figure 33] Figure 33 is a flowchart showing another example of predictive image generation. [Figure 34] Figure 34 is a flowchart showing another example of predictive image generation. [Figure 35] Figure 35 is a flowchart illustrating an example of predictive image correction processing using OBMC (overlapped block motion compensation). [Figure 36] Figure 36 is a conceptual diagram illustrating an example of predictive image correction processing using OBMC. [Figure 37] Figure 37 is a conceptual diagram illustrating the generation of two triangular predicted images. [Figure 38] Figure 38 is a conceptual diagram illustrating a model that assumes uniform linear motion. [Figure 39]Figure 39 is a conceptual diagram illustrating an example of a predictive image generation method using brightness correction processing by LIC (local illumination compensation). [Figure 40] Figure 40 is a block diagram showing an example implementation of an encoding device. [Figure 41] Figure 41 is a block diagram showing the functional configuration of a decoding device according to an embodiment. [Figure 42] Figure 42 is a flowchart showing an example of the overall decoding process by the decoding device. [Figure 43] Figure 43 is a flowchart showing an example of the processing performed in the prediction processing unit of the decoding device. [Figure 44] Figure 44 is a flowchart showing another example of the processing performed in the prediction processing unit of the decoding device. [Figure 45] Figure 45 is a flowchart showing an example of inter-mode prediction in a decoding device. [Figure 46] Figure 46 is a block diagram showing an example of a decoding device implementation. [Figure 47] Figure 47 shows an example of the encoding structure of interlaced content. [Figure 48] Figure 48 is a flowchart showing an example of a decoding method when decoding is initiated from an IRAP picture by a decoding device according to the first embodiment of the embodiment. [Figure 49] Figure 49 is a flowchart showing an example of a decoding method when decoding is initiated from an IRAP picture by a decoding device according to the second embodiment of the embodiment. [Figure 50] Figure 50 shows another example of the encoding structure of interlaced content. [Figure 51] Figure 51 is a block diagram showing an example of an implementation of an encoding device according to an embodiment. [Figure 52] Figure 52 is a flowchart showing an example of the operation of the encoding device shown in Figure 51. [Figure 53]Figure 53 is a block diagram showing an example of an implementation of a decoding device according to an embodiment. [Figure 54] Figure 54 is a flowchart showing an example of the operation of the decoding device shown in Figure 53. [Figure 55] Figure 55 is a block diagram showing the overall configuration of the content supply system that realizes the content distribution service. [Figure 56] Figure 56 is a conceptual diagram showing an example of an encoding structure during scalable encoding. [Figure 57] Figure 57 is a conceptual diagram showing an example of an encoding structure during scalable encoding. [Figure 58] Figure 58 is a conceptual diagram showing an example of a web page display screen. [Figure 59] Figure 59 is a conceptual diagram showing an example of a web page display screen. [Figure 60] Figure 60 is a block diagram showing an example of a smartphone. [Figure 61] Figure 61 is a block diagram showing an example of a smartphone configuration. [Modes for carrying out the invention]

[0013] For example, an encoding device according to one aspect of the present disclosure is an encoding device for encoding an image, comprising a circuit and a memory connected to the circuit, wherein the circuit encodes the image in operation according to an encoding structure that includes an IRAP (Intra Random Access Point) picture, a plurality of leading pictures output before the IRAP picture in output order, and a plurality of trailing pictures output after the IRAP picture in output order, and when encoding the image, up to one of the plurality of trailing pictures is encoded before the plurality of leading pictures in encoding order, and the plurality of trailing pictures, excluding the up to one trailing picture, are encoded after the plurality of leading pictures in encoding order.

[0014] Thus, when encoding randomly accessible pictures, the encoding device may be able to encode them using a more efficient encoding structure. Furthermore, by encoding with a more efficient encoding structure, the encoding device can reduce the processing load of searching for randomly accessible pictures during decoding, thus potentially improving processing efficiency.

[0015] Here, for example, when the circuit encodes the image, if a flag indicating whether the content is interlaced encoded in one field per access unit is 0, all of the multiple trailing pictures may be encoded in the encoding order after the multiple leading pictures.

[0016] Furthermore, for example, when encoding the image, if a flag indicating whether the content is interlaced with one field per access unit is set to 1, the circuit may encode up to one of the multiple trailing pictures before the multiple leading pictures in encoding order, and encode the other multiple trailing pictures, excluding the up to one trailing picture, after the multiple leading pictures in encoding order.

[0017] Furthermore, a decoding device according to one aspect of the present disclosure is a decoding device for decoding an image, comprising a circuit and a memory connected to the circuit, wherein the circuit decodes the image in operation according to an encoding structure that includes an IRAP picture, a plurality of reading pictures output before the IRAP picture in output order, and a plurality of trailing pictures output after the IRAP picture in output order, and when decoding the image, at most one of the plurality of trailing pictures is decoded before the plurality of reading pictures in decoding order, and the plurality of trailing pictures, excluding the maximum one trailing picture, are decoded after the plurality of reading pictures in decoding order.

[0018] This could allow the decryption device to use a more efficient encoding structure when decrypting randomly accessible pictures. Furthermore, by using a more efficient encoding structure, the decryption device can reduce the processing load of searching for randomly accessible pictures during decryption, potentially improving processing efficiency.

[0019] Here, for example, when decoding the image, if the circuit indicates that the content is interlaced with one field per access unit, it may decode all of the trailing pictures after the reading pictures in the decoding order.

[0020] Furthermore, for example, when decoding the image, if a flag indicating whether the content is interlaced with one field per access unit is set to 1, the circuit may decode up to one of the multiple trailing pictures before the multiple reading pictures in the decoding order, and decode the other multiple trailing pictures, excluding the up to one trailing picture, after the multiple reading pictures in the decoding order.

[0021] Furthermore, for example, an encoding method according to one aspect of the present disclosure is an encoding method for encoding an image, which encodes the image according to an encoding structure including an IRAP picture, a plurality of leading pictures that are output before the IRAP picture in output order, and a plurality of trailing pictures that are output after the IRAP picture in output order, and when encoding the image, up to one of the plurality of trailing pictures is encoded before the plurality of leading pictures in encoding order, and the plurality of trailing pictures, excluding the up to one trailing picture, are encoded after the plurality of leading pictures in encoding order.

[0022] This means that the encoding method may be able to encode randomly accessible pictures using a more efficient encoding structure. Furthermore, by encoding with a more efficient encoding structure, the encoding method can reduce the processing load of searching for randomly accessible pictures during decoding, thus potentially improving processing efficiency.

[0023] Furthermore, for example, a decoding method according to one aspect of the present disclosure is a decoding method for decoding an image, which decodes the image according to an encoding structure including an IRAP picture, a plurality of reading pictures that are output before the IRAP picture in output order, and a plurality of trailing pictures that are output after the IRAP picture in output order, and when decoding the image, at most one of the plurality of trailing pictures is decoded before the plurality of reading pictures in decoding order, and the plurality of trailing pictures, excluding the maximum one trailing picture, are decoded after the plurality of reading pictures in decoding order.

[0024] This means that the decoding method may be able to use a more efficient encoding structure when decoding randomly accessible pictures. Furthermore, by decoding with a more efficient encoding structure, the decoding device can reduce the processing load of searching for randomly accessible pictures during decoding, thus potentially improving processing efficiency.

[0025] Furthermore, these comprehensive or specific embodiments may be implemented as systems, devices, methods, integrated circuits, computer programs, or non-temporary recording media such as computer-readable CD-ROMs, or as any combination of systems, devices, methods, integrated circuits, computer programs, and recording media.

[0026] The embodiments will be described in detail below with reference to the drawings. Note that the embodiments described below are all general or specific examples. The numerical values, shapes, materials, components, arrangement and connection forms of components, steps, relationships and sequences of steps shown in the following embodiments are examples only and are not intended to limit the scope of the claims.

[0027] Embodiments of encoding and decoding devices are described below. These embodiments are examples of encoding and decoding devices to which the processes and / or configurations described in each aspect of this disclosure can be applied. The processes and / or configurations can also be implemented in encoding and decoding devices different from those in the embodiments. For example, with respect to the processes and / or configurations applicable to the embodiments, one of the following may be implemented:

[0028] (1) Any of the multiple components of the encoding or decoding device of the embodiments described in each aspect of the present disclosure may be replaced or combined with other components described in any of the aspects of the present disclosure.

[0029] (2) In the encoding or decoding device of the embodiment, any modifications such as addition, replacement, or deletion of functions or processes performed by some of the multiple components of the encoding or decoding device may be made. For example, any of the functions or processes may be replaced or combined with other functions or processes described in any of the embodiments of this disclosure.

[0030] (3) In the methods performed by the encoding or decoding apparatus of the embodiment, any modifications, such as additions, replacements, and deletions, may be made to some of the processes included in the method. For example, any of the processes in the method may be replaced with or combined with other processes described in any of the embodiments of this disclosure.

[0031] (4) Some of the multiple components constituting the encoding or decoding device of the embodiment may be combined with components described in any of the embodiments of this disclosure, or with components that have some of the functions described in any of the embodiments of this disclosure, or with components that perform some of the processing performed by the components described in any of the embodiments of this disclosure.

[0032] (5) Components that provide some of the functions of the encoding or decoding device of the embodiment, or components that perform some of the processing of the encoding or decoding device of the embodiment, may be combined with or replaced with components described in any of the aspects of the disclosure and components that provide some of the functions described in any of the aspects of the disclosure, or components that perform some of the processing described in any of the aspects of the disclosure.

[0033] (6) In a method performed by an encoding or decoding device of an embodiment, any of the processes included in the method may be replaced or combined with any of the processes described in any of the embodiments of the present disclosure.

[0034] (7) Some of the processes included in the methods performed by the encoding or decoding device of the embodiment may be combined with the processes described in any of the embodiments of this disclosure.

[0035] (8) The methods of carrying out the processes and / or configurations described in each aspect of the present disclosure are not limited to the encoding or decoding devices of the embodiments. For example, the processes and / or configurations may be carried out in devices used for purposes other than the video encoding or video decoding disclosed in the embodiments.

[0036] [Encoding device] First, the encoding device according to the embodiment will be described. Figure 1 is a block diagram showing the functional configuration of the encoding device 100 according to the embodiment. The encoding device 100 is a video encoding device that encodes video in block units.

[0037] As shown in Figure 1, the encoding device 100 is a device that encodes an image in block units and comprises a division unit 102, a subtraction unit 104, a transformation unit 106, a quantization unit 108, an entropy encoding unit 110, an inverse quantization unit 112, an inverse transformation unit 114, an addition unit 116, a block memory 118, a loop filter unit 120, a frame memory 122, an intra prediction unit 124, an inter prediction unit 126, and a prediction control unit 128.

[0038] The encoding device 100 can be implemented, for example, by a general-purpose processor and memory. In this case, when a software program stored in memory is executed by the processor, the processor functions as a splitting unit 102, a subtraction unit 104, a conversion unit 106, a quantization unit 108, an entropy encoding unit 110, an inverse quantization unit 112, an inverse conversion unit 114, an addition unit 116, a loop filter unit 120, an intra prediction unit 124, an inter prediction unit 126, and a prediction control unit 128. Alternatively, the encoding device 100 may be implemented as one or more dedicated electronic circuits corresponding to the splitting unit 102, a subtraction unit 104, a conversion unit 106, a quantization unit 108, an entropy encoding unit 110, an inverse quantization unit 112, an inverse conversion unit 114, an addition unit 116, a loop filter unit 120, an intra prediction unit 124, an inter prediction unit 126, and a prediction control unit 128.

[0039] The following describes the overall processing flow of the encoding device 100, followed by a description of each component included in the encoding device 100.

[0040] [Overall flow of the encoding process] Figure 2 is a flowchart showing an example of the overall encoding process performed by the encoding device 100.

[0041] First, the splitting unit 102 of the encoding device 100 divides each picture contained in the input image, which is a moving image, into multiple fixed-size blocks (for example, 128 x 128 pixels) (step Sa_1). Then, the splitting unit 102 selects a splitting pattern (also called a block shape) for these fixed-size blocks (step Sa_2). In other words, the splitting unit 102 further divides the fixed-size blocks into multiple blocks that constitute the selected splitting pattern. Then, the encoding device 100 performs the processing in steps Sa_3 to Sa_9 for each of these multiple blocks (i.e., the block to be encoded).

[0042] In other words, the prediction processing unit, which consists of all or part of the intra prediction unit 124, the inter prediction unit 126, and the prediction control unit 128, generates a prediction signal (also called a prediction block) for the block to be encoded (also called the current block) (step Sa_3).

[0043] Next, the subtraction unit 104 generates the difference between the block to be encoded and the predicted block as the predicted residual (also called the difference block) (step Sa_4).

[0044] Next, the transformation unit 106 and the quantization unit 108 generate multiple quantization coefficients by performing transformation and quantization on the difference block (step Sa_5). A block consisting of multiple quantization coefficients is also called a coefficient block.

[0045] Next, the entropy coding unit 110 generates an encoded signal by encoding (specifically, entropy coding) its coefficient block and the prediction parameters related to the generation of the prediction signal (step Sa_6). The encoded signal is also called an encoded bitstream, compressed bitstream, or stream.

[0046] Next, the inverse quantization unit 112 and the inverse transform unit 114 reconstruct multiple predicted residuals (i.e., difference blocks) by performing inverse quantization and inverse transform on the coefficient block (step Sa_7).

[0047] Next, the adder 116 reconstructs the current block into a reconstructed image (also called a reconstructed block or decoded image block) by adding the predicted block to the restored difference block (step Sa_8). This generates the reconstructed image.

[0048] Once this reconstructed image is generated, the loop filter unit 120 performs filtering on the reconstructed image as needed (step Sa_9).

[0049] Then, the encoding device 100 determines whether or not the encoding of the entire picture is complete (step Sa_10). If it determines that it is not complete (No. in step Sa_10), it repeats the process from step Sa_2.

[0050] In the example described above, the encoding device 100 selects one division pattern for a fixed-size block and encodes each block according to that division pattern. However, it may also encode each block according to multiple division patterns. In this case, the encoding device 100 may evaluate the cost of each of the multiple division patterns and select, for example, the encoded signal obtained by encoding according to the division pattern with the smallest cost as the output encoded signal.

[0051] As shown in the diagram, these steps Sa_1 to Sa_10 are performed sequentially by the encoding device 100. Alternatively, some of these processes may be performed in parallel, or the order of these processes may be changed.

[0052] [Divided part] The splitting unit 102 divides each picture contained in the input video into multiple blocks and outputs each block to the subtraction unit 104. For example, the splitting unit 102 first divides the picture into blocks of a fixed size (e.g., 128x128). Other fixed block sizes may be used. These fixed-size blocks are sometimes called coding tree units (CTUs). Then, the splitting unit 102 divides each of the fixed-size blocks into blocks of a variable size (e.g., 64x64 or less) based on, for example, a recursive quadtree and / or binary tree block partitioning. In other words, the splitting unit 102 selects a partitioning pattern. These variable-size blocks are sometimes called coding units (CUs), prediction units (PUs), or transformation units (TUs). Note that in various processing examples, CUs, PUs, and TUs do not need to be distinguished, and some or all of the blocks in the picture may become processing units for CUs, PUs, and TUs.

[0053] Figure 3 is a conceptual diagram showing an example of block partitioning in the embodiment. In Figure 3, solid lines represent block boundaries due to quadtree block partitioning, and dashed lines represent block boundaries due to binary tree block partitioning.

[0054] Here, block 10 is a 128x128 pixel square block (128x128 block). This 128x128 block 10 is first divided into four 64x64 square blocks (quadtree block partitioning).

[0055] The top-left 64x64 block is further divided vertically into two rectangular 32x64 blocks, and the left 32x64 block is further divided vertically into two rectangular 16x64 blocks (binary tree block partitioning). As a result, the top-left 64x64 block is divided into two 16x64 blocks 11 and 12 and a 32x64 block 13.

[0056] The 64x64 block in the upper right is horizontally divided into two rectangular 64x32 blocks, 14 and 15 (binary tree block division).

[0057] The bottom-left 64x64 block is divided into four square 32x32 blocks (quadrutree block division). Of the four 32x32 blocks, the top-left and bottom-right blocks are further divided. The top-left 32x32 block is vertically divided into two rectangular 16x32 blocks, and the rightmost 16x32 block is further horizontally divided into two 16x16 blocks (binary tree block division). The bottom-right 32x32 block is horizontally divided into two 32x16 blocks (binary tree block division). As a result, the bottom-left 64x64 block is divided into 16x32 block 16, two 16x16 blocks 17 and 18, two 32x32 blocks 19 and 20, and two 32x16 blocks 21 and 22.

[0058] The 64x64 block 23 in the bottom right will not be divided.

[0059] As described above, in Figure 3, block 10 is divided into 13 variable-sized blocks 11-23 based on recursive quad-tree and binary tree block partitioning. Such partitioning is sometimes called QTBT (quad-tree plus binary tree) partitioning.

[0060] In Figure 3, one block was divided into four or two blocks (quadrutree or binary tree block partitioning), but partitioning is not limited to these. For example, one block may be divided into three blocks (ternary tree block partitioning). Partitioning that includes such ternary tree block partitioning is sometimes called MBT (multi-type tree) partitioning.

[0061] [Picture composition: slice / tile] To decode pictures in parallel, the pictures may be composed of slice units or tile units. A picture consisting of slice units or tile units may be composed of a division unit 102.

[0062] A slice is the basic coding unit that makes up a picture. A picture is composed of, for example, one or more slices. A slice consists of one or more consecutive Coding Tree Units (CTUs).

[0063] Figure 4A is a conceptual diagram showing an example of slice configuration. For example, a picture contains 11 × 8 CTUs and is divided into four slices (slice 1-4). Slice 1 consists of 16 CTUs, slice 2 consists of 21 CTUs, slice 3 consists of 29 CTUs, and slice 4 consists of 22 CTUs. Here, each CTU in the picture belongs to one of the slices. The shape of the slice is the horizontal division of the picture. The boundaries of the slice do not have to be at the edges of the screen, but can be anywhere among the boundaries of the CTUs within the screen. The processing order (encoding order or decoding order) of the CTUs in a slice is, for example, the raster scan order. A slice also contains header information and encoded data. The header information may describe the characteristics of the slice, such as the CTU address at the beginning of the slice and the slice type.

[0064] A tile is a rectangular area that makes up a picture. Each tile may be assigned a number called a TileId in the order of the raster scan.

[0065] Figure 4B is a conceptual diagram showing an example of tile configuration. For example, a picture contains 11 × 8 CTUs and is divided into four rectangular tiles (tiles 1-4). When tiles are used, the processing order of CTUs is changed compared to when tiles are not used. When tiles are not used, multiple CTUs in a picture are processed in raster scan order. When tiles are used, in each of the multiple tiles, at least one CTU is processed in raster scan order. For example, as shown in Figure 4B, the processing order of multiple CTUs contained in tile 1 is from the left end of the first row of tile 1 to the right end of the first row of tile 1, and then from the left end of the second row of tile 1 to the right end of the second row of tile 1.

[0066] Note that one tile may contain one or more slices, and one slice may contain one or more tiles.

[0067] [Subtraction Unit] The subtraction unit 104 subtracts the predicted signal (predicted samples input from the prediction control unit 128, shown below) from the original signal (original sample) in block units that are input from the division unit 102 and divided by the division unit 102. In other words, the subtraction unit 104 calculates the prediction error (also called residual) of the block to be encoded (hereinafter referred to as the current block). The subtraction unit 104 then outputs the calculated prediction error (residual) to the conversion unit 106.

[0068] The source signal is the input signal to the encoding device 100, and is a signal representing the image of each picture that makes up the video (for example, a luminance (luma) signal and two chroma (chroma) signals). In the following, the signal representing an image may also be referred to as a sample.

[0069] [Conversion section] The conversion unit 106 converts the prediction error in the spatial domain into conversion coefficients in the frequency domain and outputs the conversion coefficients to the quantization unit 108. Specifically, the conversion unit 106 performs a predetermined discrete cosine transform (DCT) or discrete sine transform (DST) on the prediction error in the spatial domain, for example. The predetermined DCT or DST may be predetermined.

[0070] The transformation unit 106 may also adaptively select a transformation type from among several transformation types and use a transformation basis function corresponding to the selected transformation type to convert the prediction error into transformation coefficients. Such a transformation is sometimes called an EMT (explicit multiple core transform) or an AMT (adaptive multiple transform).

[0071] Multiple transformation types include, for example, DCT-II, DCT-V, DCT-VIII, DST-I, and DST-VII. Figure 5A is a table showing transformation basis functions corresponding to example transformation types. In Figure 5A, N represents the number of input pixels. The selection of a transformation type from among these multiple transformation types may depend, for example, on the type of prediction (intra-prediction and inter-prediction) or on the intra-prediction mode.

[0072] Information indicating whether or not to apply EMT or AMT (e.g., called an EMT flag or AMT flag) and information indicating the selected conversion type are typically signaled at the CU level. However, the signaling of this information is not limited to the CU level and may be at other levels (e.g., bit sequence level, picture level, slice level, tile level, or CTU level).

[0073] Furthermore, the transformation unit 106 may retransform the transformation coefficients (transformation results). Such retransformation is sometimes called AST (adaptive secondary transform) or NSST (non-separable secondary transform). For example, the transformation unit 106 performs retransformation for each subblock (e.g., 4x4 subblock) contained in the block of transformation coefficients corresponding to the intra-prediction error. Information indicating whether or not to apply NSST and information regarding the transformation matrix used for NSST are usually signaled at the CU level. However, the signaling of this information is not limited to the CU level and may be at other levels (e.g., sequence level, picture level, slice level, tile level, or CTU level).

[0074] The transformation unit 106 may be subjected to either a separable transformation or a non-separable transformation. A separable transformation is a method in which the input is separated into directions equal to the number of dimensions and transformed multiple times, while a non-separable transformation is a method in which, when the input is multidimensional, two or more dimensions are treated as one dimension and transformed together.

[0075] For example, one example of a non-separable transformation is to treat a 4x4 block as a single array with 16 elements and then perform the transformation on that array using a 16x16 transformation matrix.

[0076] Another example of a non-separable transformation is a transformation (Hypercube Givens Transform) in which a 4x4 input block is treated as a single array with 16 elements, and then multiple Givens rotations are performed on that array.

[0077] In the conversion unit 106, the type of basis for conversion to the frequency domain can be switched depending on the region within the CU. One example is SVT (Spatially Varying Transform). In SVT, as shown in Figure 5B, the CU is divided into two equal parts horizontally or vertically, and only one of the regions is converted to the frequency domain. The type of conversion basis can be set for each region; for example, DST7 and DCT8 are used. In this example, only one of the two regions within the CU is converted, and the other is not, but both regions may also be converted. Furthermore, the division method can be made more flexible, not only by dividing into two equal parts, but also by dividing into four equal parts, or by separately encoding information indicating the division and signaling it in the same way as CU division. Note that SVT is sometimes called SBT (Sub-block Transform).

[0078] [Quantization section] The quantization unit 108 quantizes the conversion coefficients output from the conversion unit 106. Specifically, the quantization unit 108 scans the conversion coefficients of the current block in a predetermined scan order and quantizes the conversion coefficients based on the quantization parameter (QP) corresponding to the scanned conversion coefficients. The quantization unit 108 then outputs the quantized conversion coefficients of the current block (hereinafter referred to as quantization coefficients) to the entropy coding unit 110 and the inverse quantization unit 112. The predetermined scan order may be set in advance.

[0079] A predetermined scan order is the order for quantization / inverse quantization of the conversion coefficients. For example, a predetermined scan order may be defined as ascending frequency (from low to high frequency) or descending frequency (from high to low frequency).

[0080] The quantization parameter (QP) is a parameter that defines the quantization step (quantization width). For example, if the value of the quantization parameter increases, the quantization step also increases. In other words, if the value of the quantization parameter increases, the quantization error increases.

[0081] Furthermore, quantization matrices may be used for quantization. For example, several types of quantization matrices may be used corresponding to frequency conversion sizes such as 4x4 and 8x8, prediction modes such as intra-prediction and inter-prediction, and pixel components such as luminance and chrominance. Note that quantization refers to the process of digitizing values ​​sampled at predetermined intervals and associating them with predetermined levels. In this technical field, it may also be referred to using other expressions such as rounding, scaling, and rounding, or rounding, rounding, and scaling may be employed. The predetermined intervals and levels may be predetermined.

[0082] There are two methods for using quantization matrices: using a quantization matrix directly set on the encoding device, and using a default quantization matrix (default matrix). By directly setting the quantization matrix on the encoding device, it is possible to set a quantization matrix that corresponds to the image features. However, in this case, there is a disadvantage that the amount of code increases due to the encoding of the quantization matrix.

[0083] On the other hand, there is also a method that does not use a quantization matrix, and quantizes both the high-frequency and low-frequency components in the same way. This method is equivalent to using a quantization matrix where all coefficients have the same value (a flat matrix).

[0084] The quantization matrix may be specified, for example, as an SPS (Sequence Parameter Set) or a PPS (Picture Parameter Set). An SPS contains the parameters used for the sequence, and a PPS contains the parameters used for the picture. SPS and PPS are sometimes simply referred to as parameter sets.

[0085] [Entropy coding unit] The entropy coding unit 110 generates an encoded signal (encoded bitstream) based on the quantization coefficients input from the quantization unit 108. Specifically, the entropy coding unit 110, for example, binarizes the quantization coefficients, arithmetically encodes the binary signal, and outputs a compressed bitstream or sequence.

[0086] [Dequantization section] The inverse quantization unit 112 inversely quantizes the quantization coefficients input from the quantization unit 108. Specifically, the inverse quantization unit 112 inversely quantizes the quantization coefficients of the current block in a predetermined scanning order. Then, the inverse quantization unit 112 outputs the inversely quantized conversion coefficients of the current block to the inverse conversion unit 114. The predetermined scanning order may be set in advance.

[0087] [Inverse Transformation Section] The inverse transform unit 114 restores the prediction error (residual) by performing an inverse transform on the transformation coefficients input from the inverse quantization unit 112. Specifically, the inverse transform unit 114 restores the prediction error of the current block by performing an inverse transform on the transformation coefficients corresponding to the transformation by the transformation unit 106. The inverse transform unit 114 then outputs the restored prediction error to the adder unit 116.

[0088] Furthermore, the recovered prediction error usually does not match the prediction error calculated by the subtraction unit 104 because information is typically lost due to quantization. In other words, the recovered prediction error usually includes quantization errors.

[0089] [Addition section] The adder 116 reconstructs the current block by adding the prediction error input from the inverse transformer 114 and the prediction sample input from the prediction control unit 128. The adder 116 then outputs the reconstructed block to the block memory 118 and the loop filter unit 120. The reconstructed block is sometimes called the local decoded block.

[0090] [Block memory] The block memory 118 is a storage unit for storing blocks within the encoded picture (referred to as the current picture) that are referenced in intra prediction, for example. Specifically, the block memory 118 stores the reconstructed blocks output from the adder 116.

[0091] [Frame memory] The frame memory 122 is a storage unit for storing reference pictures used, for example, for interpretation, and is sometimes called a frame buffer. Specifically, the frame memory 122 stores the reconstructed blocks filtered by the loop filter unit 120.

[0092] [Loop Filter Section] The loop filter unit 120 applies a loop filter to the block reconstructed by the adder unit 116 and outputs the filtered reconstructed block to the frame memory 122. A loop filter is a filter used within the encoding loop (in-loop filter), and includes, for example, a deblocking filter (DF or DBF), sample adaptive offset (SAO), and adaptive loop filter (ALF).

[0093] In ALF, a least-squares error filter is applied to remove coding distortion. For example, for each 2x2 subblock within the current block, one filter selected from several filters is applied based on the direction and activity of the local gradient.

[0094] Specifically, first, subblocks (e.g., 2x2 subblocks) are classified into multiple classes (e.g., 15 or 25 classes). The classification of subblocks is based on the direction and activity of the gradient. For example, a classification value C (e.g., C = 5D + A) is calculated using the gradient direction value D (e.g., 0-2 or 0-4) and the gradient activity value A (e.g., 0-4). Then, based on the classification value C, the subblocks are classified into multiple classes.

[0095] The gradient direction value D is derived, for example, by comparing gradients in multiple directions (e.g., horizontal, vertical, and two diagonal directions). The gradient activation value A is derived, for example, by adding the gradients in multiple directions and quantizing the sum.

[0096] Based on the results of this classification, a filter for the subblock is determined from among multiple filters.

[0097] For example, a circularly symmetric shape is used as the filter shape in ALF. Figures 6A to 6C show several examples of filter shapes used in ALF. Figure 6A shows a 5x5 diamond-shaped filter, Figure 6B shows a 7x7 diamond-shaped filter, and Figure 6C shows a 9x9 diamond-shaped filter. Information indicating the filter shape is usually signaled at the picture level. However, the signaling of information indicating the filter shape is not limited to the picture level and may be at other levels (e.g., sequence level, slice level, tile level, CTU level, or CU level).

[0098] The on / off status of ALF may be determined, for example, at the picture level or CU level. For example, the decision to apply ALF to luminance may be made at the CU level, and the decision to apply ALF to color difference may be made at the picture level. Information indicating whether ALF is on or off is usually signaled at the picture level or CU level. However, the signaling of information indicating whether ALF is on or off is not limited to the picture level or CU level, but may be at other levels (e.g., sequence level, slice level, tile level, or CTU level).

[0099] The coefficient sets for multiple selectable filters (e.g., up to 15 or 25 filters) are typically signaled at the picture level. However, signaling of the coefficient sets is not limited to the picture level; it may be at other levels (e.g., sequence level, slice level, tile level, CTU level, CU level, or subblock level).

[0100] [Loop Filter Section > Deblocking Filter] In a deblocking filter, the loop filter section 120 reduces distortion at block boundaries by applying a filter to the block boundaries of the reconstructed image.

[0101] Figure 7 is a block diagram showing an example of a detailed configuration of the loop filter section 120, which functions as a deblocking filter.

[0102] The loop filter unit 120 includes a boundary determination unit 1201, a filter determination unit 1203, a filter processing unit 1205, a processing determination unit 1208, a filter characteristic determination unit 1207, and switches 1202, 1204, and 1206.

[0103] The boundary determination unit 1201 determines whether or not a pixel to be deblocked and filtered (i.e., a target pixel) is located near a block boundary. The boundary determination unit 1201 then outputs the determination result to the switch 1202 and the processing determination unit 1208.

[0104] If the boundary determination unit 1201 determines that the target pixel is located near a block boundary, switch 1202 outputs the image before filtering to switch 1204. Conversely, if the boundary determination unit 1201 determines that the target pixel is not located near a block boundary, switch 1202 outputs the image before filtering to switch 1206.

[0105] The filter determination unit 1203 determines whether or not to perform a deblocking filter on the target pixel based on the pixel values ​​of at least one surrounding pixel located around the target pixel. The filter determination unit 1203 then outputs the determination result to the switch 1204 and the processing determination unit 1208.

[0106] If the filter determination unit 1203 determines that deblocking filtering should be performed on the target pixel, switch 1204 outputs the pre-filtered image acquired via switch 1202 to the filter processing unit 1205. Conversely, if the filter determination unit 1203 determines that deblocking filtering should not be performed on the target pixel, switch 1204 outputs the pre-filtered image acquired via switch 1202 to switch 1206.

[0107] When the filter processing unit 1205 acquires an image before filtering via switches 1202 and 1204, it performs a deblocking filter process on the target pixel, using the filter characteristics determined by the filter characteristic determination unit 1207. The filter processing unit 1205 then outputs the filtered pixel to switch 1206.

[0108] Switch 1206 selectively outputs pixels that have not undergone deblocking and filtering, and pixels that have undergone deblocking and filtering by the filtering processing unit 1205, in accordance with the control by the processing determination unit 1208.

[0109] The processing determination unit 1208 controls the switch 1206 based on the determination results of the boundary determination unit 1201 and the filter determination unit 1203. Specifically, if the boundary determination unit 1201 determines that a target pixel is near a block boundary, and the filter determination unit 1203 determines that the target pixel should undergo deblocking and filtering, the processing determination unit 1208 outputs the deblocked and filtered pixel from the switch 1206. In all other cases, the processing determination unit 1208 outputs the undeblocked and unfiltered pixel from the switch 1206. This output of pixels is repeated, resulting in the filtered image being output from the switch 1206.

[0110] Figure 8 is a conceptual diagram showing an example of a deblocking filter with symmetrical filter characteristics with respect to block boundaries.

[0111] In deblocking filtering, for example, one of two deblocking filters with different characteristics, namely a strong filter and a weak filter, is selected using pixel values ​​and quantization parameters. In the strong filter, as shown in Figure 8, if pixels p0~p2 and pixels q0~q2 exist on either side of a block boundary, the respective pixel values ​​of pixels q0~q2 are changed to pixel values ​​q'0~q'2 by performing an operation shown in the following equation, for example.

[0112] q'0=(p1+2×p0+2×q0+2×q1+q2+4) / 8 q'1=(p0+q0+q1+q2+2) / 4 q'2=(p0+q0+q1+3×q2+2×q3+4) / 8

[0113] In the above equations, p0~p2 and q0~q2 are the pixel values ​​of pixels p0~p2 and pixels q0~q2, respectively. Also, q3 is the pixel value of pixel q3, which is adjacent to pixel q2 on the opposite side of the block boundary. Furthermore, the coefficient multiplied by the pixel value of each pixel used in the deblocking filter process on the right-hand side of each of the above equations is the filter coefficient.

[0114] Furthermore, in the deblocking filter process, clipping may be performed to ensure that the calculated pixel values ​​do not exceed a threshold. In this clipping process, the calculated pixel values ​​according to the above formula are clipped to "calculated pixel value ± 2 × threshold" using a threshold determined from the quantization parameters. This prevents excessive smoothing.

[0115] Figure 9 is a conceptual diagram illustrating the block boundaries where deblocking filtering is performed. Figure 10 is a conceptual diagram showing an example of a Bs value.

[0116] The block boundaries on which deblocking filtering is performed are, for example, the boundaries of the PU (Prediction Unit) or TU (Transform Unit) of an 8x8 pixel block, as shown in Figure 9. Deblocking filtering can be performed in units of 4 rows or 4 columns. First, for blocks P and Q shown in Figure 9, the Bs (Boundary Strength) value is determined as shown in Figure 10.

[0117] According to the Bs value in Figure 10, it is determined whether or not to perform deblocking filtering of different strengths, even for block boundaries belonging to the same image. Deblocking filtering is performed on the color difference signal when the Bs value is 2. Deblocking filtering is performed on the luminance signal when the Bs value is 1 or greater and predetermined conditions are met. These predetermined conditions may be set in advance. Note that the criteria for determining the Bs value are not limited to those shown in Figure 10 and may be determined based on other parameters.

[0118] [Prediction Processing Unit (Intra Prediction Unit, Inter Prediction Unit, Prediction Control Unit)] Figure 11 is a flowchart showing an example of processing performed in the prediction processing unit of the encoding device 100. The prediction processing unit consists of all or some of the components of the intra-prediction unit 124, the inter-prediction unit 126, and the prediction control unit 128.

[0119] The prediction processing unit generates a predicted image of the current block (step Sb_1). This predicted image is also called a predicted signal or predicted block. Predicted signals include, for example, intra-predicted signals or inter-predicted signals. Specifically, the prediction processing unit generates a predicted image of the current block using the reconstructed image already obtained by generating predicted blocks, difference blocks, coefficient blocks, difference blocks, and decoded image blocks.

[0120] The reconstructed image may be, for example, the image of the reference picture, or it may be the image of the encoded block within the current picture, which is the picture containing the current block. The encoded block within the current picture may be, for example, the adjacent block of the current block.

[0121] Figure 12 is a flowchart showing another example of processing performed in the prediction processing unit of the encoding device 100.

[0122] The prediction processing unit generates a predicted image using a first method (step Sc_1a), a second method (step Sc_1b), and a third method (step Sc_1c). The first, second, and third methods are different methods for generating predicted images, and may be, for example, an interpretation method, an intraprediction method, and other prediction methods. These prediction methods may use the reconstructed images described above.

[0123] Next, the prediction processing unit selects one of the multiple prediction images generated in steps Sc_1a, Sc_1b, and Sc_1c (step Sc_2). This selection of a prediction image, i.e., the selection of a method or mode to obtain the final prediction image, may be based on the cost of each generated prediction image. Alternatively, the selection of the prediction image may be based on the parameters used in the coding process. The coding device 100 may signal information to identify the selected prediction image, method, or mode into a coded signal (also called a coded bitstream). This information may be, for example, a flag. This allows the decoding device to generate a prediction image according to the method or mode selected by the coding device 100 based on this information. In the example shown in Figure 12, the prediction processing unit selects one of the prediction images after generating prediction images for each method. However, the prediction processing unit may select a method or mode based on the parameters used in the coding process described above before generating those prediction images, and then generate the prediction image according to that method or mode.

[0124] For example, the first and second methods are intra-prediction and inter-prediction, respectively, and the prediction processing unit may select the final predicted image for the current block from the predicted images generated according to these prediction methods.

[0125] Figure 13 is a flowchart showing another example of processing performed in the prediction processing unit of the encoding device 100.

[0126] First, the prediction processing unit generates a predicted image using intra-prediction (step Sd_1a), and then generates a predicted image using inter-prediction (step Sd_1b). The predicted image generated by intra-prediction is also called the intra-prediction image, and the predicted image generated by inter-prediction is also called the inter-prediction image.

[0127] Next, the prediction processing unit evaluates the intra-predicted image and the inter-predicted image (step Sd_2). Cost may be used in this evaluation. That is, the prediction processing unit calculates the cost C of the intra-predicted image and the inter-predicted image. This cost C can be calculated using the formula of the RD optimization model, for example, C = D + λ × R. In this formula, D is the coding distortion of the predicted image, which can be expressed as, for example, the sum of the absolute differences between the pixel values ​​of the current block and the pixel values ​​of the predicted image. R is the generated code amount of the predicted image, which specifically is the code amount required to encode motion information and other data for generating the predicted image. λ is, for example, a Lagrange multiplier.

[0128] The prediction processing unit then selects the prediction image with the smallest cost C from the intra-predicted image and inter-predicted image as the final prediction image for the current block (step Sd_3). In other words, a prediction method or mode for generating the prediction image for the current block is selected.

[0129] [Intra Prediction Unit] The intra-prediction unit 124 generates a prediction signal (intra-prediction signal) by performing intra-prediction (also called in-screen prediction) of the current block by referring to the block in the current picture stored in the block memory 118. Specifically, the intra-prediction unit 124 generates an intra-prediction signal by performing intra-prediction by referring to samples (e.g., luminance values, color difference values) of blocks adjacent to the current block, and outputs the intra-prediction signal to the prediction control unit 128.

[0130] For example, the intra-prediction unit 124 performs intra-prediction using one of a predetermined set of intra-prediction modes. The set of intra-prediction modes typically includes one or more non-directional prediction modes and multiple directional prediction modes. The predetermined set of modes may be predetermined.

[0131] One or more non-directional prediction modes include, for example, the Planar prediction mode and DC prediction mode as defined in the H.265 / HEVC standard.

[0132] Multiple directional prediction modes include, for example, the 33 directional prediction modes defined in the H.265 / HEVC standard. Note that multiple directional prediction modes may also include 32 additional directional prediction modes (a total of 65 directional prediction modes). Figure 14 is a conceptual diagram showing all 67 intra-prediction modes (2 non-directional prediction modes and 65 directional prediction modes) that can be used in intra-prediction. Solid arrows represent the 33 directions defined in the H.265 / HEVC standard, and dashed arrows represent the additional 32 directions (the 2 non-directional prediction modes are not shown in Figure 14).

[0133] In various processing examples, luminance blocks may be referenced in the intra-prediction of chrominance blocks. That is, the chrominance components of the current block may be predicted based on the luminance components of the current block. Such intra-prediction is sometimes called CCLM (cross-component linear model) prediction. Such an intra-prediction mode for chrominance blocks that references luminance blocks (e.g., called CCLM mode) may be added as one of the intra-prediction modes for chrominance blocks.

[0134] The intra-prediction unit 124 may correct the pixel values ​​after intra-prediction based on the gradient of the horizontal / vertical reference pixels. Intra-prediction with such correction is sometimes called PDPC (position dependent intra-prediction combination). Information indicating whether or not PDPC is applied (for example, called a PDPC flag) is usually signaled at the CU level. However, the signaling of this information is not limited to the CU level and may be at other levels (for example, sequence level, picture level, slice level, tile level, or CTU level).

[0135] [International Prediction Department] The inter-prediction unit 126 generates a prediction signal (inter-prediction signal) by performing inter-prediction (also called inter-screen prediction) of the current block by referring to a reference picture stored in the frame memory 122 that is different from the current picture. Inter-prediction is performed in units of the current block or the current sub-block within the current block (e.g., a 4x4 block). For example, the inter-prediction unit 126 performs motion estimation within the reference picture for the current block or current sub-block and finds the reference block or sub-block that best matches that current block or current sub-block. Then, the inter-prediction unit 126 obtains motion information (e.g., a motion vector) that compensates for the movement or change from the reference block or sub-block to the current block or sub-block. Based on that motion information, the inter-prediction unit 126 performs motion compensation (or motion prediction) and generates an inter-prediction signal for the current block or sub-block. The inter-prediction unit 126 outputs the generated inter-prediction signal to the prediction control unit 128.

[0136] The motion information used for motion compensation may be signaled as an interpretation signal in various forms. For example, the motion vector may be signaled. As another example, the difference between the motion vector and the predicted motion vector (motion vector predictor) may be signaled.

[0137] [Basic flow of interpretation] Figure 15 is a flowchart illustrating an example of the basic flow of interpretation prediction.

[0138] The interpretation unit 126 first generates a predicted image (steps Se_1 to Se_3). Next, the subtraction unit 104 generates the difference between the current block and the predicted image as the predicted residual (step Se_4).

[0139] Here, the interpretation unit 126 generates a predicted image by determining the motion vector (MV) of the current block (steps Se_1 and Se_2) and performing motion compensation (step Se_3). The interpretation unit 126 also determines the MV by selecting a candidate motion vector (candidate MV) (step Se_1) and deriving the MV (step Se_2). The selection of a candidate MV is performed, for example, by selecting at least one candidate MV from a list of candidate MVs. In the derivation of the MV, the interpretation unit 126 may determine the selected at least one candidate MV as the MV of the current block by further selecting at least one candidate MV from the at least one candidate MV. Alternatively, the interpretation unit 126 may determine the MV of the current block by searching the region of the reference picture indicated by each of the selected at least one candidate MVs. This search of the region of the reference picture may be called motion estimation.

[0140] Furthermore, in the example described above, steps Se_1 to Se_3 are performed by the interpretation unit 126, but processing such as step Se_1 or step Se_2 may be performed by other components included in the encoding device 100.

[0141] [Flowchart for deriving motion vectors] Figure 16 is a flowchart showing an example of motion vector derivation.

[0142] The interpretation unit 126 derives the MV of the current block in a mode that encodes motion information (e.g., MV). In this case, for example, motion information is encoded as prediction parameters and converted into a signal. That is, the encoded motion information is included in the encoded signal (also called an encoded bitstream).

[0143] Alternatively, the interpretation unit 126 derives MV in a mode that does not encode motion information. In this case, motion information is not included in the encoded signal.

[0144] Here, the modes for MV derivation may include the normal intermode, merge mode, FRUC mode, and affine mode, as described later. Among these modes, the modes that encode motion information include the normal intermode, merge mode, and affine mode (specifically, the affine intermode and affine merge mode). Note that the motion information may include not only the MV but also the predicted motion vector selection information, as described later. Modes that do not encode motion information include the FRUC mode, etc. The interpretation unit 126 selects a mode from these multiple modes for deriving the MV of the current block and derives the MV of the current block using the selected mode.

[0145] Figure 17 is a flowchart showing another example of motion vector derivation.

[0146] The interpretation unit 126 derives the MV of the current block in a mode that encodes the differential MV. In this case, for example, the differential MV is encoded as a prediction parameter and converted into a signal. That is, the encoded differential MV is included in the encoded signal. This differential MV is the difference between the MV of the current block and its predicted MV.

[0147] Alternatively, the interpretation unit 126 derives MV in a mode that does not encode the differential MV. In this case, the encoded differential MV is not included in the encoded signal.

[0148] As mentioned above, the modes for MV derivation include the normal inter, merge mode, FRUC mode, and affine mode, which will be described later. Of these modes, the modes that encode differential MV include the normal inter mode and affine mode (specifically, affine inter mode). Modes that do not encode differential MV include the FRUC mode, merge mode, and affine mode (specifically, affine merge mode). The inter prediction unit 126 selects a mode from these multiple modes to derive the MV of the current block, and uses the selected mode to derive the MV of the current block.

[0149] [Flowchart for deriving motion vectors] Figure 18 is a flowchart illustrating another example of motion vector derivation. There are several modes for MV derivation, i.e., interpretation modes, which can be broadly divided into modes that encode differential MVs and modes that do not encode differential motion vectors. Modes that do not encode differential MVs include merge mode, FRUC mode, and affine mode (specifically, affine merge mode). Details of these modes will be described later, but simply put, merge mode derives the MV of the current block by selecting motion vectors from surrounding encoded blocks, and FRUC mode derives the MV of the current block by performing a search between encoded regions. Affine mode assumes an affine transformation and derives the motion vectors of each of the multiple subblocks that make up the current block as the MV of the current block.

[0150] Specifically, as shown in the figure, the interpretation unit 126 derives the motion vector by merge mode (Sf_2) when the interpretation mode information indicates 0 (0 in Sf_1). The interpretation unit 126 also derives the motion vector by FRUC mode (Sf_3) when the interpretation mode information indicates 1 (1 in Sf_1). The interpretation unit 126 also derives the motion vector by affine mode (specifically, affine merge mode) when the interpretation mode information indicates 2 (2 in Sf_1) (Sf_4). The interpretation unit 126 also derives the motion vector by a mode that encodes the difference MV (for example, normal intermode) when the interpretation mode information indicates 3 (3 in Sf_1) (Sf_5).

[0151] [MV Derivation > Normal Intermode] The normal intermode is an interpretation mode that derives the MV of the current block based on blocks similar to the image of the current block, using the region of the reference picture indicated by the candidate MV. In this normal intermode, the differential MV is also encoded.

[0152] Figure 19 is a flowchart showing an example of inter-mode prediction.

[0153] The interpretation unit 126 first obtains multiple candidate MVs for the current block based on information such as the MVs of multiple encoded blocks surrounding the current block in time or space (step Sg_1). In other words, the interpretation unit 126 creates a candidate MV list.

[0154] Next, the interpretation unit 126 extracts N candidate MVs (where N is an integer greater than or equal to 2) from among the multiple candidate MVs obtained in step Sg_1 as predicted motion vector candidates (also called predicted MV candidates) according to a predetermined priority order (step Sg_2). Note that this priority order may be predetermined for each of the N candidate MVs.

[0155] Next, the interpretation unit 126 selects one of the N predicted motion vector candidates as the predicted motion vector (also called predicted MV) for the current block (step Sg_3). At this time, the interpretation unit 126 encodes the predicted motion vector selection information for identifying the selected predicted motion vector into a stream. The stream is the encoded signal or encoded bitstream described above.

[0156] Next, the interpretation unit 126 refers to the encoded reference picture and derives the MV of the current block (step Sg_4). At this time, the interpretation unit 126 further encodes the difference between the derived MV and the predicted motion vector as the difference MV into the stream. The encoded reference picture is a picture consisting of multiple blocks that have been reconstructed after encoding.

[0157] Finally, the interpretation unit 126 generates a predicted image of the current block by performing motion compensation on the current block using the derived MV and the encoded reference picture (step Sg_5). The predicted image is the interpretation signal described above.

[0158] Furthermore, information indicating the interprediction mode used to generate the predicted image (the normal intermode in the example above), which is included in the encoded signal, is encoded, for example, as a prediction parameter.

[0159] The candidate MV list may be the same as the list used in other modes. Furthermore, processing related to the candidate MV list may be applied to processing related to lists used in other modes. This processing related to the candidate MV list may include, for example, extracting or selecting candidate MVs from the candidate MV list, rearranging candidate MVs, or deleting candidate MVs.

[0160] [MV Derivation > Merge Mode] Merge mode is an interpretation mode that derives an MV by selecting a candidate MV from a list of candidate MVs as the MV of the current block.

[0161] Figure 20 is a flowchart showing an example of interpretation using merge mode.

[0162] The interpretation unit 126 first obtains multiple candidate MVs for the current block based on information such as the MVs of multiple encoded blocks surrounding the current block in time or space (step Sh_1). In other words, the interpretation unit 126 creates a candidate MV list.

[0163] Next, the interpretation unit 126 derives the MV of the current block by selecting one candidate MV from among the multiple candidate MVs obtained in step Sh_1 (step Sh_2). At this time, the interpretation unit 126 encodes MV selection information to identify the selected candidate MV into a stream.

[0164] Finally, the interpretation unit 126 generates a predicted image of the current block by performing motion compensation on the current block using the derived MV and the encoded reference picture (step Sh_3).

[0165] Furthermore, information indicating the inter-prediction mode (the merge mode in the example above) used to generate the predicted image, which is included in the encoded signal, is encoded, for example, as a prediction parameter.

[0166] Figure 21 is a conceptual diagram illustrating an example of the process of deriving the motion vector of the current picture using merge mode.

[0167] First, a list of predicted MVs is generated, containing registered candidates for predicted MVs. Candidates for predicted MVs include spatially adjacent predicted MVs, which are the MVs of multiple encoded blocks located spatially around the target block; temporally adjacent predicted MVs, which are the MVs of nearby blocks projected onto the target block's position in the encoded reference picture; combined predicted MVs, which are generated by combining the MV values ​​of spatially adjacent predicted MVs and temporally adjacent predicted MVs; and zero predicted MVs, which are MVs with a value of zero.

[0168] Next, one predicted MV is selected from the multiple predicted MVs registered in the predicted MV list to determine it as the MV for the target block.

[0169] Furthermore, the variable-length coding unit encodes the merge_idx signal, which indicates which predicted MV was selected, by writing it to a stream.

[0170] Note that the predicted MVs registered in the predicted MV list explained in Figure 21 are just examples, and the number of predicted MVs may differ from the number shown in the figure, the configuration may not include some of the types of predicted MVs shown in the figure, or it may include predicted MVs other than those shown in the figure.

[0171] The final MV may be determined by performing the DMVR (decoder motion vector refinement) process described later, using the MV of the target block derived by merge mode.

[0172] The candidates for the predicted MV are the candidate MVs mentioned above, and the predicted MV list is the candidate MV list mentioned above. The candidate MV list may also be referred to as the candidate list. Furthermore, merge_idx is the MV selection information.

[0173] [MV Derivation > FRUC Mode] Motion information may be derived on the decoding side without being signaled from the encoding side. As mentioned above, the merge mode specified in the H.265 / HEVC standard may be used. Alternatively, motion information may be derived by performing motion search on the decoding side, for example. In this embodiment, motion search is performed on the decoding side without using the pixel values ​​of the current block.

[0174] Here, we will explain the mode in which motion detection is performed on the decoding device side. This mode in which motion detection is performed on the decoding device side is sometimes called PMMVD (pattern matched motion vector derivation) mode or FRUC (frame rate up-conversion) mode.

[0175] An example of FRUC processing is shown in flowchart form in Figure 22. First, a list of multiple candidates (i.e., a candidate MV list, which may be the same as the merge list) is generated by referencing the motion vectors of encoded blocks spatially or temporally adjacent to the current block, each having a predicted motion vector (MV) (step Si_1). Next, the best candidate MV is selected from the multiple candidate MVs registered in the candidate MV list (step Si_2). For example, an evaluation value is calculated for each candidate MV included in the candidate MV list, and one candidate MV is selected based on the evaluation value. Then, a motion vector for the current block is derived based on the motion vector of the selected candidate (step Si_4). Specifically, for example, the motion vector of the selected candidate (best candidate MV) is directly derived as the motion vector for the current block. Alternatively, for example, the motion vector for the current block may be derived by performing pattern matching in the area surrounding the position in the reference picture corresponding to the motion vector of the selected candidate. In other words, a search is performed in the area surrounding the best candidate MV using pattern matching and evaluation values ​​in the reference picture. If an MV with a better evaluation value is found, the best candidate MV may be updated to that MV and set as the final MV for the current block. It is also possible to configure the system so that it does not perform the process of updating to an MV with a better evaluation value.

[0176] Finally, the interpretation unit 126 generates a predicted image of the current block by performing motion compensation on the current block using the derived MV and the encoded reference picture (step Si_5).

[0177] The same processing method can be used when processing at the sub-block level.

[0178] The evaluation value may be calculated by various methods. For example, the reconstructed image of a region in a reference picture corresponding to a motion vector may be compared with the reconstructed image of a predetermined region (which may be, for example, a region in another reference picture or a region in an adjacent block of the current picture, as shown below). The predetermined region may be set in advance.

[0179] Furthermore, the difference in pixel values ​​between the two reconstructed images may be calculated and used as the evaluation value for the motion vector. Alternatively, the evaluation value may be calculated using other information in addition to the difference value.

[0180] Next, we will explain in detail an example of pattern matching. First, one candidate MV included in the candidate MV list (e.g., merge list) is selected as the starting point for the search using pattern matching. For example, first pattern matching or second pattern matching may be used. First pattern matching and second pattern matching are sometimes called bilateral matching and template matching, respectively.

[0181] [MV Derivation > FRUC > Bilateral Matching] In the first pattern matching, pattern matching is performed between two blocks in two different reference pictures that are aligned with the motion trajectory of the current block. Therefore, in the first pattern matching, a region in another reference picture aligned with the motion trajectory of the current block is used as a predetermined region for calculating the evaluation value of the candidate described above. This predetermined region may be set in advance.

[0182] Figure 23 is a conceptual diagram illustrating an example of first pattern matching (bilateral matching) between two blocks in two reference pictures along a motion trajectory. As shown in Figure 23, in first pattern matching, two motion vectors (MV0, MV1) are derived by searching for the most matching pair among two pairs of blocks in two different reference pictures (Ref0, Ref1) that are along the motion trajectory of the current block. Specifically, for the current block, the difference between the reconstructed image at a specified position in the first encoded reference picture (Ref0) specified by the candidate MV and the reconstructed image at a specified position in the second encoded reference picture (Ref1) specified by a symmetric MV scaled by the display time interval of the candidate MV is derived, and an evaluation value is calculated using the obtained difference value. It is possible to select the candidate MV with the best evaluation value among multiple candidate MVs as the final MV, which can yield good results.

[0183] Under the assumption of a continuous motion trajectory, the motion vector (MV0, MV1) pointing to two reference blocks is proportional to the temporal distance (TD0, TD1) between the current picture (Cur Pic) and the two reference pictures (Ref0, Ref1). For example, if the current picture is temporally located between the two reference pictures and the temporal distances from the current picture to the two reference pictures are equal, then the first pattern matching derives a mirror-symmetric bidirectional motion vector.

[0184] [MV Derivation > FRUC > Template Matching] In the second pattern matching (template matching), pattern matching is performed between the template in the current picture (blocks adjacent to the current block in the current picture (e.g., blocks above and / or to the left)) and the blocks in the reference picture. Therefore, in the second pattern matching, the blocks adjacent to the current block in the current picture are used as a predetermined area for calculating the evaluation value of the candidates mentioned above.

[0185] Figure 24 is a conceptual diagram illustrating an example of pattern matching (template matching) between a template in the current picture and a block in the reference picture. As shown in Figure 24, in the second pattern matching, the motion vector of the current block is derived by searching in the reference picture (Ref0) for the block that best matches the block adjacent to the current block in the current picture (Cur Pic). Specifically, for the current block, the difference between the reconstructed image of both or either of the left-adjacent and top-adjacent encoded regions and the reconstructed image at the equivalent position in the encoded reference picture (Ref0) specified by the candidate MV is derived, and an evaluation value is calculated using the obtained difference value. Among the multiple candidate MVs, the candidate MV with the best evaluation value is selected as the best candidate MV.

[0186] Information indicating whether or not to apply such a FRUC mode (e.g., called a FRUC flag) may be signaled at the CU level. Furthermore, if FRUC mode is applied (e.g., the FRUC flag is true), information indicating the applicable pattern matching method (first pattern matching or second pattern matching) may be signaled at the CU level. Note that the signaling of this information is not limited to the CU level, but may be at other levels (e.g., sequence level, picture level, slice level, tile level, CTU level, or subblock level).

[0187] [MV Derivation > Affine Mode] Next, we will describe an affine mode in which motion vectors are derived at the sub-block level based on the motion vectors of multiple adjacent blocks. This mode is sometimes called the affine motion compensation prediction mode.

[0188] Figure 25A is a conceptual diagram illustrating an example of deriving a subblock-level motion vector based on the motion vectors of multiple adjacent blocks. In Figure 25A, the current block contains 16 4x4 subblocks. Here, the motion vector v0 of the upper left corner control point of the current block is derived based on the motion vectors of the adjacent blocks, and similarly, the motion vector v1 of the upper right corner control point of the current block is derived based on the motion vectors of the adjacent subblocks. Then, the two motion vectors v0 and v1 may be projected by the following equation (1A), and the motion vector (v) of each subblock within the current block is derived. x ,v y ) may be derived.

[0189]

number

[0190] Here, x and y represent the horizontal and vertical positions of the subblock, respectively, and w represents a predetermined weighting coefficient. The predetermined weighting coefficient may be determined in advance.

[0191] Information indicating such affine modes (e.g., called an affine flag) may be signaled at the CU level. However, the signaling of this information indicating affine modes is not limited to the CU level, but may be at other levels (e.g., sequence level, picture level, slice level, tile level, CTU level, or subblock level).

[0192] Furthermore, such affine modes may include several modes in which the method of deriving the motion vectors of the upper-left and upper-right corner control points differs. For example, there are two affine modes: the affine inter (also called the affine normal inter) mode and the affine merge mode.

[0193] [MV Derivation > Affine Mode] Figure 25B is a conceptual diagram illustrating an example of deriving motion vectors for subblock units in an affine mode with three control points. In Figure 25B, the current block contains 16 4x4 subblocks. Here, the motion vector v0 of the upper left corner control point of the current block is derived based on the motion vectors of adjacent blocks, and similarly, the motion vector v1 of the upper right corner control point of the current block is derived based on the motion vectors of adjacent blocks, and the motion vector v2 of the lower left corner control point of the current block is derived based on the motion vectors of adjacent blocks. Then, the three motion vectors v0, v1 and v2 may be projected by the following equation (1B), and the motion vector (v) of each subblock within the current block is derived. x ,v y ) may be derived.

[0194]

number

[0195] Here, x and y represent the horizontal and vertical positions of the subblock center, respectively, w represents the width of the current block, and h represents the height of the current block.

[0196] Affine modes with different numbers of control points (e.g., two and three) may be switched and signaled at the CU level. Information indicating the number of control points for the affine mode used at the CU level may also be signaled at other levels (e.g., sequence level, picture level, slice level, tile level, CTU level, or subblock level).

[0197] Furthermore, an affine mode having three control points may include several modes in which the method of deriving the motion vectors of the upper-left, upper-right, and lower-left corner control points differs. For example, there are two affine modes: the affine inter (also called the affine normal inter) mode and the affine merge mode.

[0198] [MV Derivation > Affine Merge Mode] Figures 26A, 26B, and 26C are conceptual diagrams illustrating the affine merge mode.

[0199] In affine merge mode, as shown in Figure 26A, the predicted motion vectors for each control point of the current block are calculated based on multiple motion vectors corresponding to blocks encoded in affine mode among the encoded blocks adjacent to the current block: Block A (left), Block B (top), Block C (upper right), Block D (lower left), and Block E (upper left). Specifically, these blocks are examined in the order of encoded Block A (left), Block B (top), Block C (upper right), Block D (lower left), and Block E (upper left), and the first valid block encoded in affine mode is identified. Based on the multiple motion vectors corresponding to this identified block, the predicted motion vectors for the control points of the current block are calculated.

[0200] For example, as shown in Figure 26B, if block A, which is adjacent to the left of the current block, is encoded in affine mode with two control points, then motion vectors v3 and v4 are derived projected onto the upper-left and upper-right corners of the encoded block containing block A. Then, from the derived motion vectors v3 and v4, the predicted motion vector v0 for the upper-left corner control point and the predicted motion vector v1 for the upper-right corner control point of the current block are calculated.

[0201] For example, as shown in Figure 26C, if block A, which is adjacent to the left of the current block, is encoded in affine mode with three control points, then motion vectors v3, v4, and v5 are derived projected onto the upper left, upper right, and lower left corners of the encoded block containing block A. From these derived motion vectors v3, v4, and v5, the predicted motion vector v0 for the upper left corner control point of the current block, the predicted motion vector v1 for the upper right corner control point, and the predicted motion vector v2 for the lower left corner control point are calculated.

[0202] Furthermore, this method for deriving predicted motion vectors may also be used to derive the predicted motion vectors for each control point of the current block in step Sj_1 of Figure 29, which will be described later.

[0203] Figure 27 is a flowchart showing an example of the affine merge mode.

[0204] In affine merge mode, as shown in the figure, the interpretation unit 126 first derives the predicted MV for each control point of the current block (step Sk_1). The control points are the upper left and upper right corners of the current block, as shown in Figure 25A, or the upper left, upper right, and lower left corners of the current block, as shown in Figure 25B.

[0205] In other words, as shown in Figure 26A, the interpretation unit 126 examines these blocks in the order of encoded block A (left), block B (top), block C (upper right), block D (lower left), and block E (upper left), and identifies the first valid block encoded in affine mode.

[0206] Then, if block A is identified and block A has two control points, as shown in Figure 26B, the interpretation unit 126 calculates the motion vector v0 of the upper left corner control point and the motion vector v1 of the upper right corner control point of the current block from the motion vectors v3 and v4 of the upper left and upper right corners of the encoded block containing block A. For example, the interpretation unit 126 calculates the predicted motion vector v0 of the upper left corner control point and the predicted motion vector v1 of the upper right corner control point of the current block by projecting the motion vectors v3 and v4 of the upper left and upper right corners of the encoded block onto the current block.

[0207] Alternatively, if block A is identified and block A has three control points, as shown in Figure 26C, the interpretation unit 126 calculates the motion vector v0 of the upper left corner control point, the motion vector v1 of the upper right corner control point, and the motion vector v2 of the lower left corner control point of the current block from the motion vectors v3, v4, and v5 of the upper left, upper right, and lower left corners of the encoded block containing block A. For example, the interpretation unit 126 calculates the predicted motion vector v0 of the upper left corner control point, the predicted motion vector v1 of the upper right corner control point, and the motion vector v2 of the lower left corner control point of the current block by projecting the motion vectors v3, v4, and v5 of the upper left, upper right, and lower left corners of the encoded block onto the current block.

[0208] Next, the interpretation unit 126 performs motion compensation for each of the multiple subblocks contained in the current block. That is, for each of the multiple subblocks, the interpretation unit 126 calculates the motion vector of that subblock as an affine MV using two predicted motion vectors v0 and v1 and equation (1A) described above, or three predicted motion vectors v0, v1 and v2 and equation (1B) described above (step Sk_2). Then, the interpretation unit 126 performs motion compensation for that subblock using these affine MVs and the encoded reference picture (step Sk_3). As a result, motion compensation is performed on the current block, and a predicted image of that current block is generated.

[0209] [MV Derivation > Affine Intermode] Figure 28A is a conceptual diagram illustrating an affine intermode with two control points.

[0210] In this affine intermode, as shown in Figure 28A, a motion vector selected from the motion vectors of the encoded blocks A, B, and C adjacent to the current block is used as the predicted motion vector v0 for the control point at the upper left corner of the current block. Similarly, a motion vector selected from the motion vectors of the encoded blocks D and E adjacent to the current block is used as the predicted motion vector v1 for the control point at the upper right corner of the current block.

[0211] Figure 28B is a conceptual diagram illustrating an affine intermode with three control points.

[0212] In this affine intermode, as shown in Figure 28B, the motion vector selected from the motion vectors of the encoded blocks A, B, and C adjacent to the current block is used as the predicted motion vector v0 for the control point at the upper left corner of the current block. Similarly, the motion vector selected from the motion vectors of the encoded blocks D and E adjacent to the current block is used as the predicted motion vector v1 for the control point at the upper right corner of the current block. Furthermore, the motion vector selected from the motion vectors of the encoded blocks F and G adjacent to the current block is used as the predicted motion vector v2 for the control point at the lower left corner of the current block.

[0213] Figure 29 is a flowchart showing an example of an affine intermode.

[0214] As shown in the figure, in affine intermode, the interpretation unit 126 first derives the predicted MV(v0,v1) or (v0,v1,v2) for each of two or three control points of the current block (step Sj_1). The control points are the upper left corner, upper right corner, or lower left corner of the current block, as shown in Figure 25A or Figure 25B.

[0215] In other words, the interpretation unit 126 derives the predicted motion vector (v0,v1) or (v0,v1,v2) of the control point of the current block by selecting the motion vector of one of the encoded blocks near each control point of the current block shown in Figure 28A or Figure 28B. At this time, the interpretation unit 126 encodes predicted motion vector selection information into a stream to identify the two selected motion vectors.

[0216] For example, the interpretation unit 126 may determine, using cost evaluation or the like, which block's motion vector to select as the predicted motion vector for the control point from the encoded blocks adjacent to the current block, and write a flag to the bitstream indicating which predicted motion vector was selected.

[0217] Next, the interpretation unit 126 performs motion search (steps Sj_3 and Sj_4) while updating the predicted motion vectors selected or derived in step Sj_1 (step Sj_2). That is, the interpretation unit 126 calculates the motion vectors of each subblock corresponding to the updated predicted motion vectors as affine MVs using the above-mentioned equation (1A) or equation (1B) (step Sj_3). Then, the interpretation unit 126 performs motion compensation for each subblock using these affine MVs and encoded reference pictures (step Sj_4). As a result, in the motion search loop, the interpretation unit 126 determines, for example, the predicted motion vector that yields the smallest cost as the motion vector of the control point (step Sj_5). At this time, the interpretation unit 126 further encodes the difference between the determined MV and the predicted motion vector as a difference MV into a stream.

[0218] Finally, the interpretation unit 126 generates a predicted image of the current block by performing motion compensation on the current block using the determined MV and the encoded reference picture (step Sj_6).

[0219] [MV Derivation > Affine Intermode] When switching between affine modes with different numbers of control points (e.g., two and three) at the CU level to generate signals, the number of control points may differ between the encoded block and the current block. Figures 30A and 30B are conceptual diagrams illustrating how to derive the control point prediction vector when the number of control points differs between the encoded block and the current block.

[0220] For example, as shown in Figure 30A, if the current block has three control points at the upper left corner, upper right corner, and lower left corner, and block A adjacent to the left of the current block is encoded in affine mode with two control points, then motion vectors v3 and v4 projected onto the upper left and upper right corner positions of the encoded block containing block A are derived. Then, from the derived motion vectors v3 and v4, the predicted motion vector v0 for the upper left corner control point and the predicted motion vector v1 for the upper right corner control point of the current block are calculated. Furthermore, from the derived motion vectors v0 and v1, the predicted motion vector v2 for the lower left corner control point is calculated.

[0221] For example, as shown in Figure 30B, if the current block has two control points, the upper left corner and the upper right corner, and block A adjacent to the left of the current block is encoded in affine mode with three control points, then motion vectors v3, v4, and v5 are derived projected onto the upper left, upper right corner, and lower left corner positions of the encoded block containing block A. Then, from the derived motion vectors v3, v4, and v5, the predicted motion vector v0 for the upper left corner control point of the current block and the predicted motion vector v1 for the upper right corner control point are calculated.

[0222] This method for deriving predicted motion vectors may also be used to derive the predicted motion vectors for each control point of the current block in step Sj_1 of Figure 29.

[0223] [MV Derivation > DMVR] Figure 31A is a flowchart showing the relationship between merge mode and DMVR.

[0224] The interpretation unit 126 derives the motion vector of the current block in merge mode (step Sl_1). Next, the interpretation unit 126 determines whether or not to perform a motion vector search, i.e., a motion search (step Sl_2). If the interpretation unit 126 determines not to perform a motion search (No in step Sl_2), it determines the motion vector derived in step Sl_1 as the final motion vector for the current block (step Sl_4). In other words, in this case, the motion vector of the current block is determined in merge mode.

[0225] On the other hand, if it is determined in step Sl_1 to perform a motion search (Yes in step Sl_2), the interpretation unit 126 derives the final motion vector for the current block by searching the surrounding region of the reference picture indicated by the motion vector derived in step Sl_1 (step Sl_3). In other words, in this case, the motion vector of the current block is determined by the DMVR.

[0226] Figure 31B is a conceptual diagram illustrating an example of DMVR processing for determining MV.

[0227] First, the optimal MVP set for the current block (for example, in merge mode) is designated as the candidate MV. Then, according to the candidate MV(L0), reference pixels are identified from the first reference picture (L0), which is an encoded picture in the L0 direction. Similarly, according to the candidate MV(L1), reference pixels are identified from the second reference picture (L1), which is an encoded picture in the L1 direction. A template is generated by taking the average of these reference pixels.

[0228] Next, using the template, search the peripheral regions of the candidate MVs of the first reference picture (L0) and the second reference picture (L1), respectively, and determine the MV with the minimum cost as the final MV. Note that the cost value may be calculated using, for example, the difference value between each pixel value of the template and each pixel value of the search region, and the candidate MV value, etc.

[0229] Note that typically, in the encoding device and the decoding device described later, the configuration and operation of the processes described here are basically common.

[0230] Even if it is not the processing example itself described here, any processing may be used as long as it is a processing capable of searching the periphery of the candidate MV to derive the final MV.

[0231] [Motion Compensation > BIO / OBMC] In motion compensation, there are modes for generating a predicted image and correcting the predicted image. Those modes are, for example, BIO and OBMC described later.

[0232] FIG. 32 is a flowchart showing an example of generating a predicted image.

[0233] The inter prediction unit 126 generates a predicted image (step Sm_1) and corrects the predicted image by, for example, any of the above-described modes (step Sm_2).

[0234] FIG. 33 is a flowchart showing another example of generating a predicted image.

[0235] The interpretation unit 126 determines the motion vector of the current block (step Sn_1). Next, the interpretation unit 126 generates a predicted image (step Sn_2) and determines whether or not to perform correction processing (step Sn_3). If the interpretation unit 126 determines that correction processing should be performed (Yes in step Sn_3), it generates the final predicted image by correcting the predicted image (step Sn_4). On the other hand, if the interpretation unit 126 determines that no correction processing should be performed (No in step Sn_3), it outputs the predicted image as the final predicted image without correction (step Sn_5).

[0236] Furthermore, motion compensation includes a mode that corrects brightness when generating the predicted image. One such mode is the LIC, which will be discussed later.

[0237] Figure 34 is a flowchart showing another example of predictive image generation.

[0238] The interpretation unit 126 derives the motion vector of the current block (step So_1). Next, the interpretation unit 126 determines whether or not to perform brightness correction processing (step So_2). If the interpretation unit 126 determines to perform brightness correction processing (Yes in step So_2), it generates a predicted image while performing brightness correction (step So_3). In other words, the predicted image is generated by LIC. On the other hand, if the interpretation unit 126 determines not to perform brightness correction processing (No in step So_2), it generates a predicted image using normal motion compensation without performing brightness correction (step So_4).

[0239] [Motion Compensation > OBMC] Interpretation signals may be generated using not only the motion information of the current block obtained through motion search, but also the motion information of adjacent blocks. Specifically, interpretation signals may be generated at the sub-block level within the current block by weighted addition of a prediction signal based on motion information obtained through motion search (within the reference picture) and a prediction signal based on motion information of adjacent blocks (within the current picture). Such interpretation (motion compensation) is sometimes called OBMC (overlapped block motion compensation).

[0240] In OBMC mode, information indicating the size of subblocks for OBMC (e.g., called OBMC block size) may be signaled at the sequence level. Furthermore, information indicating whether or not to apply OBMC mode (e.g., called OBMC flag) may be signaled at the CU level. Note that the signaling levels of this information are not limited to the sequence level and CU level, but may be other levels (e.g., picture level, slice level, tile level, CTU level, or subblock level).

[0241] Let's explain an example of the OBMC mode in more detail. Figures 35 and 36 are flowcharts and conceptual diagrams illustrating the overview of the predictive image correction process using OBMC processing.

[0242] First, as shown in Figure 36, a predicted image (Pred) is obtained using normal motion compensation with the motion vector (MV) assigned to the current block to be processed. In Figure 36, the arrow "MV" points to the reference picture, indicating what the current block of the current picture is referencing in order to obtain the predicted image.

[0243] Next, the motion vector (MV_L) already derived for the encoded left-adjacent block is applied (reused) to the block to be encoded to obtain the predicted image (Pred_L). The motion vector (MV_L) is indicated by the arrow "MV_L" pointing from the current block to the reference picture. Then, the first correction of the predicted image is performed by superimposing the two predicted images, Pred and Pred_L. This has the effect of blending the boundaries between adjacent blocks.

[0244] Similarly, the motion vector (MV_U) already derived for the encoded upper adjacent block is applied (reused) to the block to be encoded to obtain the predicted image (Pred_U). The motion vector (MV_U) is indicated by an arrow "MV_U" pointing from the current block to the reference picture. Then, the predicted image Pred_U is superimposed on the predicted image that has undergone the first correction (e.g., Pred and Pred_L) to perform a second correction of the predicted image. This has the effect of blending the boundaries between adjacent blocks. The predicted image obtained by the second correction is the final predicted image of the current block, with the boundaries with adjacent blocks blended (smoothed).

[0245] The above example is a two-pass correction method using left-adjacent and top-adjacent blocks, but the correction method may also be a three-pass or more-pass correction method using right-adjacent and / or bottom-adjacent blocks.

[0246] Furthermore, the area to be superimposed does not have to be the entire pixel area of ​​the block, but rather only a portion of the area near the block boundary.

[0247] Here, we have described the OBMC predictive image correction process for obtaining a single predictive image Pred by superimposing additional predictive images Pred_L and Pred_U onto a single reference picture. However, if the predictive image is corrected based on multiple reference images, the same process may be applied to each of the multiple reference pictures. In such cases, by performing OBMC image correction based on multiple reference pictures, a corrected predictive image is obtained from each reference picture, and then the final predictive image is obtained by further superimposing these multiple corrected predictive images.

[0248] In OBMC, the unit of the target block may be the prediction block unit, or it may be a subblock unit obtained by further dividing the prediction block.

[0249] One method for determining whether to apply OBMC processing is to use an obmc_flag signal, which indicates whether or not to apply OBMC processing. As a specific example, the encoding device may determine whether the target block belongs to a region with complex motion. If it belongs to a region with complex motion, the encoding device sets the obmc_flag to a value of 1 and applies OBMC processing to perform encoding. If it does not belong to a region with complex motion, it sets the obmc_flag to a value of 0 and encodes the block without applying OBMC processing. On the other hand, the decoding device decodes the obmc_flag described in the stream (e.g., compressed sequence) and switches whether or not to apply OBMC processing depending on its value to perform decoding.

[0250] In the example described above, the interpretation unit 126 generates one rectangular prediction image for the current rectangular block. However, the interpretation unit 126 may generate multiple prediction images of shapes different from rectangles for the current rectangular block, and then combine these multiple prediction images to generate the final rectangular prediction image. The shapes different from rectangles may be, for example, triangles.

[0251] FIG. 37 is a conceptual diagram for explaining the generation of two triangular prediction images.

[0252] The inter prediction unit 126 generates a prediction image of a triangle by performing motion compensation on the first partition of the triangle in the current block using the first MV of the first partition. Similarly, the inter prediction unit 126 generates a prediction image of a triangle by performing motion compensation on the second partition of the triangle in the current block using the second MV of the second partition. Then, the inter prediction unit 126 generates a prediction image of a rectangle same as the current block by combining these prediction images.

[0253] In the example shown in FIG. 37, the first partition and the second partition are each a triangle, but they may be trapezoids, or they may have different shapes from each other. Further, in the example shown in FIG. 37, the current block is composed of two partitions, but it may be composed of three or more partitions.

[0254] Also, the first partition and the second partition may overlap. That is, the first partition and the second partition may include the same pixel region. In this case, a prediction image of the current block may be generated using the prediction image in the first partition and the prediction image in the second partition.

[0255] Also, in this example, an example in which prediction images are generated by inter prediction for both two partitions is shown, but prediction images may be generated by intra prediction for at least one partition.

[0256] [Motion Compensation > BIO] Next, a method for deriving a motion vector will be described. First, a mode for deriving a motion vector based on a model assuming uniform linear motion will be described. This mode is sometimes called the BIO (bi - directional optical flow) mode.

[0257] Figure 38 is a conceptual diagram illustrating a model that assumes uniform linear motion. In Figure 38, (vx, vy) represents the velocity vector, and τ0 and τ1 represent the temporal distance between the current picture (Cur Pic) and the two reference pictures (Ref0, Ref1), respectively. (MVx0, MVy0) represents the motion vector corresponding to reference picture Ref0, and (MVx1, MVy1) represents the motion vector corresponding to reference picture Ref1.

[0258] In this case, under the assumption of uniform linear motion of the velocity vector (vx, vy), (MVx0, MVy0) and (MVx1, MVy1) can be expressed as (vxτ0, vyτ0) and (-vxτ1, -vyτ1), respectively, and the following optical flow equation (2) may be adopted.

[0259]

number

[0260] Here, I(k) represents the luminance value of the reference image k (k=0,1) after motion compensation. This optical flow equation shows that the sum of (i) the time derivative of the luminance value, (ii) the product of the horizontal velocity and the horizontal component of the spatial gradient of the reference image, and (iii) the product of the vertical velocity and the vertical component of the spatial gradient of the reference image is equal to zero. Based on this optical flow equation and Hermite interpolation, block-level motion vectors obtained from merge lists, etc., may be corrected on a pixel-by-pixel basis.

[0261] Furthermore, motion vectors may be derived on the decoding side using a method different from that used for deriving motion vectors based on a model that assumes uniform linear motion. For example, motion vectors may be derived on a sub-block basis based on the motion vectors of multiple adjacent blocks.

[0262] [Motion Compensation > LIC] Next, we will describe an example of a mode that generates a predicted image (prediction) using LIC (local illumination compensation) processing.

[0263] Figure 39 is a conceptual diagram illustrating an example of a predictive image generation method using brightness correction processing by LIC processing.

[0264] First, the MV is derived from the encoded reference picture to obtain the reference image corresponding to the current block.

[0265] Next, information is extracted showing how the luminance values ​​have changed between the reference picture and the current picture for the current block. This extraction is based on the luminance pixel values ​​of the encoded left adjacent reference region (peripheral reference region) and encoded upper adjacent reference region (peripheral reference region) in the current picture, and the luminance pixel values ​​at the equivalent positions in the reference picture specified by the derived MV. Then, the luminance correction parameter is calculated using the information showing how the luminance values ​​have changed.

[0266] A predicted image for the current block is generated by applying the brightness correction parameters to the reference image within the reference picture specified in MV.

[0267] Note that the shape of the peripheral reference region in Figure 39 is just one example, and other shapes may be used.

[0268] Furthermore, although this explanation describes the process of generating a predicted image from a single reference picture, the process is similar when generating predicted images from multiple reference pictures. Alternatively, the brightness correction process can be applied to each reference picture obtained from the reference picture in the same manner as described above before generating the predicted image.

[0269] One method for determining whether or not to apply LIC processing is to use a signal called lic_flag, which indicates whether or not to apply LIC processing. For example, in an encoding device, it is determined whether the current block belongs to a region where brightness changes are occurring. If it belongs to a region where brightness changes are occurring, the value of lic_flag is set to 1, and LIC processing is applied and encoding is performed. If it does not belong to a region where brightness changes are occurring, the value of lic_flag is set to 0, and encoding is performed without applying LIC processing. On the other hand, in a decoding device, the lic_flag written in the stream can be decoded, and the device may switch whether or not to apply LIC processing depending on its value before performing decoding.

[0270] Another way to determine whether to apply LIC processing is, for example, by checking whether LIC processing was applied to surrounding blocks. A specific example is, if the current block is in merge mode, the system checks whether the surrounding encoded blocks selected during the MV derivation in merge mode were encoded with LIC processing. Based on this result, the system switches whether to apply LIC processing and then performs the encoding. Note that in this example, the same process is applied to the decoding device.

[0271] The LIC processing (luminance correction processing) method was explained using Figure 39, and its details will be explained below.

[0272] First, the interpretation unit 126 derives motion vectors to obtain a reference image corresponding to the block to be encoded from a reference picture which is an encoded picture.

[0273] Next, the interpretation unit 126 extracts information indicating how the luminance values ​​have changed between the reference picture and the picture to be encoded, using the luminance pixel values ​​of the left-adjacent and upper-adjacent encoded peripheral reference regions and the luminance pixel values ​​at equivalent positions in the reference picture specified by the motion vector, and calculates luminance correction parameters. For example, let p0 be the luminance pixel value of a pixel in the peripheral reference region of the picture to be encoded, and p1 be the luminance pixel value of a pixel in the peripheral reference region of the reference picture at an equivalent position to that pixel. The interpretation unit 126 calculates coefficients A and B as luminance correction parameters to optimize A×p1+B=p0 for multiple pixels in the peripheral reference region.

[0274] Next, the interpretation unit 126 generates a predicted image for the encoding target block by performing brightness correction processing on the reference image in the reference picture specified by the motion vector using brightness correction parameters. For example, let p2 be the brightness pixel value in the reference image, and p3 be the brightness pixel value of the predicted image after brightness correction processing. The interpretation unit 126 generates the predicted image after brightness correction processing by calculating A × p2 + B = p3 for each pixel in the reference image.

[0275] Note that the shape of the surrounding reference region in Figure 39 is just one example, and other shapes may be used. Also, only a part of the surrounding reference region shown in Figure 39 may be used. For example, a region containing a predetermined number of pixels obtained by thinning out the upper adjacent pixels and the left adjacent pixels may be used as the surrounding reference region. Furthermore, the surrounding reference region is not limited to the region adjacent to the block to be encoded, but may also be a region not adjacent to the block to be encoded. The predetermined number of pixels may be set in advance.

[0276] Furthermore, in the example shown in Figure 39, the peripheral reference region within the reference picture is the region specified by the motion vector of the picture to be encoded, but it may also be a region specified by another motion vector. For example, this other motion vector may be the motion vector of the peripheral reference region within the picture to be encoded.

[0277] Although the operation of the encoding device 100 has been described here, the operation of the decoding device 200 is typically similar.

[0278] Furthermore, LIC processing may be applied not only to luminance but also to chrominance. In this case, correction parameters may be derived individually for each of Y, Cb, and Cr, or a common correction parameter may be used for any of them.

[0279] Furthermore, LIC processing may be applied on a subblock basis. For example, correction parameters may be derived using the surrounding reference region of the current subblock and the surrounding reference region of the reference subblock within the reference picture specified by the MV of the current subblock.

[0280] [Prediction Control Unit] The prediction control unit 128 selects either the intra-prediction signal (the signal output from the intra-prediction unit 124) or the inter-prediction signal (the signal output from the inter-prediction unit 126), and outputs the selected signal as the prediction signal to the subtraction unit 104 and the addition unit 116.

[0281] As shown in Figure 1, in various encoding device examples, the prediction control unit 128 may output prediction parameters that are input to the entropy encoding unit 110. The entropy encoding unit 110 may generate an encoded bitstream (or sequence) based on the prediction parameters input from the prediction control unit 128 and the quantization coefficients input from the quantization unit 108. The prediction parameters may be used in the decoding device. The decoding device may receive and decode the encoded bitstream and perform the same processing as the prediction processing performed in the intra-prediction unit 124, inter-prediction unit 126, and prediction control unit 128. The prediction parameters may include a selected prediction signal (e.g., a motion vector, prediction type, or prediction mode used in the intra-prediction unit 124 or inter-prediction unit 126), or any index, flag, or value that is based on or indicates the prediction processing performed in the intra-prediction unit 124, inter-prediction unit 126, and prediction control unit 128.

[0282] [Example of an encoding device implementation] Figure 40 is a block diagram showing an example implementation of the encoding device 100. The encoding device 100 includes a processor a1 and memory a2. For example, the multiple components of the encoding device 100 shown in Figure 1 are implemented by the processor a1 and memory a2 shown in Figure 40.

[0283] Processor a1 is a circuit that performs information processing and is a circuit that can access memory a2. For example, processor a1 is a dedicated or general-purpose electronic circuit for encoding moving images. Processor a1 may be a processor such as a CPU. Alternatively, processor a1 may be a collection of multiple electronic circuits. Furthermore, for example, processor a1 may perform the roles of multiple components of the encoding device 100 shown in Figure 1, etc.

[0284] Memory a2 is a dedicated or general-purpose memory in which information for the processor a1 to encode moving images is stored. Memory a2 may be an electronic circuit and may be connected to the processor a1. Memory a2 may also be included in the processor a1. Memory a2 may also be a collection of multiple electronic circuits. Memory a2 may also be a magnetic disk or an optical disk, or may be described as storage or a recording medium. Memory a2 may also be a non-volatile memory or a volatile memory.

[0285] For example, memory a2 may store the video to be encoded, or it may store the bit sequence corresponding to the encoded video. Alternatively, memory a2 may store a program for processor a1 to encode the video.

[0286] Furthermore, for example, memory a2 may play the role of an information storage component among the multiple components of the encoding device 100 shown in Figure 1, etc. For example, memory a2 may play the role of the block memory 118 and frame memory 122 shown in Figure 1. More specifically, reconstructed blocks and reconstructed pictures may be stored in memory a2.

[0287] Furthermore, it is not necessary for the encoding device 100 to implement all of the components shown in Figure 1, etc., nor is it necessary for all of the processes described above to be performed. Some of the components shown in Figure 1, etc., may be included in other devices, and some of the processes described above may be performed by other devices.

[0288] [Decoding device] Next, we will describe a decoding device capable of decoding an encoded signal (encoded bitstream) output from the above-mentioned encoding device 100. Figure 41 is a block diagram showing the functional configuration of a decoding device 200 according to an embodiment. The decoding device 200 is a video decoding device that decodes video in block units.

[0289] As shown in Figure 41, the decoding device 200 includes an entropy decoding unit 202, an inverse quantization unit 204, an inverse transform unit 206, an adder unit 208, a block memory 210, a loop filter unit 212, a frame memory 214, an intra prediction unit 216, an inter prediction unit 218, and a prediction control unit 220.

[0290] The decoding device 200 can be implemented, for example, by a general-purpose processor and memory. In this case, when the software program stored in memory is executed by the processor, the processor functions as an entropy decoding unit 202, an inverse quantization unit 204, an inverse transformation unit 206, an addition unit 208, a loop filter unit 212, an intra prediction unit 216, an inter prediction unit 218, and a prediction control unit 220. Alternatively, the decoding device 200 may be implemented as one or more dedicated electronic circuits corresponding to the entropy decoding unit 202, the inverse quantization unit 204, the inverse transformation unit 206, the addition unit 208, the loop filter unit 212, the intra prediction unit 216, the inter prediction unit 218, and the prediction control unit 220.

[0291] The following describes the overall processing flow of the decoding device 200, followed by a description of each component included in the decoding device 200.

[0292] [Overall flow of the decryption process] Figure 42 is a flowchart showing an example of the overall decoding process by the decoding device 200.

[0293] First, the entropy decoding unit 202 of the decoding device 200 identifies a division pattern for a fixed-size block (for example, 128 × 128 pixels) (step Sp_1). This division pattern is the one selected by the encoding device 100. Then, the decoding device 200 performs steps Sp_2 to Sp_6 for each of the multiple blocks that make up that division pattern.

[0294] In other words, the entropy decoding unit 202 decodes (specifically, performs entropy decoding) the encoded quantization coefficients and prediction parameters of the block to be decoded (also called the current block) (step Sp_2).

[0295] Next, the inverse quantization unit 204 and the inverse transformation unit 206 reconstruct multiple predicted residuals (i.e., difference blocks) by performing inverse quantization and inverse transformation on multiple quantization coefficients (step Sp_3).

[0296] Next, the prediction processing unit, which consists of all or part of the intra prediction unit 216, the inter prediction unit 218, and the prediction control unit 220, generates a prediction signal (also called a prediction block) for the current block (step Sp_4).

[0297] Next, the addition unit 208 reconstructs the current block into a reconstructed image (also called a decoded image block) by adding the predicted block to the difference block (step Sp_5).

[0298] Then, once this reconstructed image is generated, the loop filter unit 212 performs filtering on the reconstructed image (step Sp_6).

[0299] The decryption device 200 then determines whether or not the entire picture has been decrypted (step Sp_7). If it determines that it has not been completed (No. in step Sp_7), it repeats the process from step Sp_1.

[0300] As shown in the diagram, steps Sp_1 to Sp_7 are performed sequentially by the decoding device 200. Alternatively, some of these steps may be performed in parallel, or their order may be changed.

[0301] [Entropy Decoder] The entropy decoding unit 202 entropically decodes the encoded bitstream. Specifically, the entropy decoding unit 202 arithmetically decodes the encoded bitstream into a binary signal, for example. Then, the entropy decoding unit 202 debinarizes the binary signal. The entropy decoding unit 202 outputs the quantization coefficients in block units to the inverse quantization unit 204. The entropy decoding unit 202 may also output prediction parameters included in the encoded bitstream (see Figure 1) to the intra-prediction unit 216, inter-prediction unit 218, and prediction control unit 220 in the embodiment. The intra-prediction unit 216, inter-prediction unit 218, and prediction control unit 220 can perform the same prediction processing as the intra-prediction unit 124, inter-prediction unit 126, and prediction control unit 128 on the encoding device side.

[0302] [Dequantization section] The inverse quantization unit 204 inversely quantizes the quantization coefficients of the decoded block (hereinafter referred to as the current block), which is the input from the entropy decoding unit 202. Specifically, for each quantization coefficient of the current block, the inverse quantization unit 204 inversely quantizes the quantization coefficient based on the quantization parameter corresponding to that quantization coefficient. The inverse quantization unit 204 then outputs the inversely quantized quantization coefficients (i.e., transformation coefficients) of the current block to the inverse transformation unit 206.

[0303] [Inverse Transformation Section] The inverse transform unit 206 restores the prediction error by inversely transforming the transformation coefficients, which are input from the inverse quantization unit 204.

[0304] For example, if the information decoded from the encoded bitstream indicates that EMT or AMT should be applied (e.g., the AMT flag is true), the inverse transform unit 206 inversely transforms the transformation coefficients of the current block based on the information indicating the decoded transformation type.

[0305] For example, if the information decoded from the encoded bitstream indicates that NSST should be applied, the inverse transform unit 206 applies inverse retransformation to the transformation coefficients.

[0306] [Addition section] The adder 208 reconstructs the current block by adding the prediction error, which is the input from the inverse transformer 206, and the prediction sample, which is the input from the prediction control unit 220. The adder 208 then outputs the reconstructed block to the block memory 210 and the loop filter unit 212.

[0307] [Block memory] The block memory 210 is a storage unit for storing blocks that are referenced in intra prediction and are located within the decoded picture (hereinafter referred to as the current picture). Specifically, the block memory 210 stores the reconstructed blocks output from the adder 208.

[0308] [Loop Filter Section] The loop filter unit 212 applies a loop filter to the block reconstructed by the adder unit 208 and outputs the filtered reconstructed block to the frame memory 214 and the display device, etc.

[0309] If the information interpreted from the encoded bitstream indicating ALF on / off indicates ALF is on, one filter is selected from among several filters based on the direction and activity of the local gradient, and the selected filter is applied to the reconstruction block.

[0310] [Frame memory] The frame memory 214 is a memory unit for storing reference pictures used for interpretation, and is sometimes called a frame buffer. Specifically, the frame memory 214 stores the reconstructed blocks filtered by the loop filter unit 212.

[0311] [Prediction Processing Unit (Intra Prediction Unit, Inter Prediction Unit, Prediction Control Unit)] Figure 43 is a flowchart showing an example of processing performed in the prediction processing unit of the decoding device 200. The prediction processing unit consists of all or some of the components of the intra-prediction unit 216, the inter-prediction unit 218, and the prediction control unit 220.

[0312] The prediction processing unit generates a predicted image of the current block (step Sq_1). This predicted image is also called a predicted signal or predicted block. Predicted signals include, for example, intra-predicted signals or inter-predicted signals. Specifically, the prediction processing unit generates a predicted image of the current block using the reconstructed image already obtained by generating predicted blocks, difference blocks, coefficient blocks, difference blocks, and decoded image blocks.

[0313] The reconstructed image may be, for example, the image of the reference picture, or it may be the image of the decoded block within the current picture, which is the picture containing the current block. The decoded block within the current picture is, for example, the adjacent block to the current block.

[0314] Figure 44 is a flowchart showing another example of the processing performed in the prediction processing unit of the decoding device 200.

[0315] The prediction processing unit determines a method or mode for generating the predicted image (step Sr_1). For example, this method or mode may be determined based on, for example, prediction parameters.

[0316] If the prediction processing unit determines a first method as the mode for generating the prediction image, it generates the prediction image according to that first method (step Sr_2a). If the prediction processing unit determines a second method as the mode for generating the prediction image, it generates the prediction image according to that second method (step Sr_2b). If the prediction processing unit determines a third method as the mode for generating the prediction image, it generates the prediction image according to that third method (step Sr_2c).

[0317] The first, second, and third methods are different methods for generating predictive images, and may be, for example, an interpretation method, an intrapretation method, and other prediction methods, respectively. These prediction methods may use the reconstructed images described above.

[0318] [Intra Prediction Unit] The intra-prediction unit 216 generates a prediction signal (intra-prediction signal) by performing intra-prediction based on the intra-prediction mode decoded from the encoded bitstream, and by referring to the blocks in the current picture stored in the block memory 210. Specifically, the intra-prediction unit 216 generates an intra-prediction signal by performing intra-prediction by referring to samples (e.g., luminance values, chrominance values) of blocks adjacent to the current block, and outputs the intra-prediction signal to the prediction control unit 220.

[0319] Furthermore, if an intra-prediction mode that references a luminance block is selected in the intra-prediction of a color difference block, the intra-prediction unit 216 may predict the color difference component of the current block based on the luminance component of the current block.

[0320] Furthermore, if the information decoded from the encoded bitstream indicates the application of PDPC, the intra-prediction unit 216 corrects the pixel value after intra-prediction based on the gradient of the reference pixels in the horizontal / vertical directions.

[0321] [International Prediction Department] The inter-prediction unit 218 predicts the current block by referring to a reference picture stored in the frame memory 214. Prediction is performed in units of the current block or sub-blocks within the current block (e.g., 4x4 blocks). For example, the inter-prediction unit 218 generates an inter-prediction signal for the current block or sub-block by performing motion compensation using motion information (e.g., motion vectors) decoded from the encoded bitstream (e.g., prediction parameters output from the entropy decoding unit 202), and outputs the inter-prediction signal to the prediction control unit 220.

[0322] If the information decoded from the encoded bitstream indicates that the OBMC mode should be applied, the interpretation unit 218 generates an interpretation prediction signal using not only the motion information of the current block obtained by motion search, but also the motion information of the adjacent block.

[0323] Furthermore, if the information decoded from the encoded bitstream indicates that FRUC mode should be applied, the interpretation unit 218 derives motion information by performing a motion search according to the pattern matching method (bilateral matching or template matching) decoded from the encoded stream. Then, the interpretation unit 218 performs motion compensation (prediction) using the derived motion information.

[0324] Furthermore, when the BIO mode is applied, the inter-prediction unit 218 derives motion vectors based on a model that assumes uniform linear motion. Also, if the information decoded from the encoded bitstream indicates that the affine motion compensation prediction mode should be applied, the inter-prediction unit 218 derives motion vectors on a sub-block basis based on the motion vectors of multiple adjacent blocks.

[0325] [MV Derivation > Normal Intermode] If the information decoded from the encoded bitstream indicates that the normal intermode should be applied, the interpretation unit 218 derives the motion video (MV) based on the information decoded from the encoded stream and uses that MV to perform motion compensation (prediction).

[0326] Figure 45 is a flowchart showing an example of inter-mode prediction in the decoding device 200.

[0327] The interpretation unit 218 of the decoding device 200 performs motion compensation for each block. Based on information such as the MVs of multiple decoded blocks surrounding the current block in time or space, the interpretation unit 218 obtains multiple candidate MVs for the current block (step Ss_1). In other words, the interpretation unit 218 creates a list of candidate MVs.

[0328] Next, the interpretation unit 218 extracts N candidate MVs (where N is an integer greater than or equal to 2) from among the multiple candidate MVs obtained in step Ss_1 as predicted motion vector candidates (also called predicted MV candidates) according to a predetermined priority order (step Ss_2). Note that this priority order may be predetermined for each of the N predicted MV candidates.

[0329] Next, the interpretation unit 218 decodes the predicted motion vector selection information from the input stream (i.e., the encoded bitstream), and uses the decoded predicted motion vector selection information to select one predicted MV candidate from among the N predicted MV candidates as the predicted motion vector (also called the predicted MV) for the current block (step Ss_3).

[0330] Next, the interpretation unit 218 decodes the differential MV from the input stream and derives the MV of the current block by adding the decoded differential MV (the difference value) to the selected predicted motion vector (step Ss_4).

[0331] Finally, the interpretation unit 218 generates a predicted image of the current block by performing motion compensation on the current block using the derived MV and the decoded reference picture (step Ss_5).

[0332] [Prediction Control Unit] The prediction control unit 220 selects either the intra-prediction signal or the inter-prediction signal and outputs the selected signal as the prediction signal to the summer unit 208. Overall, the configuration, functions, and processing of the prediction control unit 220, intra-prediction unit 216, and inter-prediction unit 218 on the decoding device side may correspond to the configuration, functions, and processing of the prediction control unit 128, intra-prediction unit 124, and inter-prediction unit 126 on the encoding device side.

[0333] [Example of a decryption device implementation] Figure 46 is a block diagram showing an example implementation of the decoding device 200. The decoding device 200 includes a processor b1 and memory b2. For example, the multiple components of the decoding device 200 shown in Figure 41 are implemented by the processor b1 and memory b2 shown in Figure 46.

[0334] Processor b1 is a circuit that performs information processing and is a circuit that can access memory b2. For example, processor b1 is a dedicated or general-purpose electronic circuit that decodes encoded video (i.e., encoded bitstream). Processor b1 may be a processor such as a CPU. Alternatively, processor b1 may be a collection of multiple electronic circuits. Furthermore, for example, processor b1 may perform the roles of multiple components of the decoding device 200 shown in Figure 41, etc.

[0335] Memory b2 is a dedicated or general-purpose memory in which information for the processor b1 to decode the encoded bitstream is stored. Memory b2 may be an electronic circuit and may be connected to the processor b1. Memory b2 may also be included in the processor b1. Memory b2 may also be a collection of multiple electronic circuits. Memory b2 may also be a magnetic disk or an optical disk, or may be described as storage or a recording medium. Memory b2 may also be non-volatile memory or volatile memory.

[0336] For example, memory b2 may store a video image or an encoded bitstream. Alternatively, memory b2 may store a program for processor b1 to decode the encoded bitstream.

[0337] Furthermore, for example, memory b2 may play the role of an information storage component among the multiple components of the decoding device 200 shown in Figure 41, etc. Specifically, memory b2 may play the role of the block memory 210 and frame memory 214 shown in Figure 41. More specifically, reconstructed blocks and reconstructed pictures may be stored in memory b2.

[0338] Furthermore, the decoding device 200 does not necessarily have to implement all of the components shown in Figure 41, etc., nor does it have to perform all of the processes described above. Some of the components shown in Figure 41, etc., may be included in other devices, and some of the processes described above may be performed by other devices.

[0339] [Definitions of each term] Each term may be defined as follows, for example:

[0340] A picture is an array of multiple luminance samples in a monochrome format, or two corresponding arrays of multiple luminance samples and multiple color difference samples in 4:2:0, 4:2:2, and 4:4:4 color formats. A picture may be a frame or a field.

[0341] The frame is composed of a top field where multiple sample rows 0, 2, 4, ... are generated, and a bottom field where multiple sample rows 1, 3, 5, ... are generated.

[0342] A slice is an integer number of coded tree units contained within one independent slice segment and all subsequent dependent slice segments that precede (if any) the next independent slice segment within the same access unit.

[0343] A tile is a rectangular region of a picture containing multiple coding tree blocks within a particular tile sequence and a particular tile row. A tile may also be a rectangular region of a frame that is intended to be decoded and coded independently, although loop filters may still be applied across the edges of the tile.

[0344] A block is an MxN (N rows x M columns) array of multiple samples, or an MxN array of multiple transformation coefficients. A block may also be a square or rectangular region of multiple pixels consisting of multiple matrices of one luminance and two chrominance values.

[0345] A CTU (Coded Tree Unit) may be a coded tree block of multiple luminance samples of a picture having three sample sequences, or two corresponding coded tree blocks of multiple chrominance samples. Alternatively, a CTU may be a coded tree block of multiple samples of either a monochrome picture or a picture coded using a syntax structure used to code three separate color planes and multiple samples.

[0346] The superblock may consist of one or two mode information blocks, or it may be a 64x64 pixel square block that can be recursively divided into four 32x32 blocks and further divided.

[0347] [First aspect] Random access to a bitstream is achieved by inserting IRAP (Intra Random Access Point) pictures at random access points within the bitstream. In other words, in video coding, IRAP pictures provide random access points within the bitstream, allowing playback (decoding) from the IRAP picture even in the middle of the bitstream. That is, an IRAP picture is a randomly accessible picture that can correctly decode the bitstream even if the previous picture is unavailable.

[0348] Furthermore, for an IRAP picture, the picture that precedes it in the output order is called the leading picture, and the picture that follows it in the output order is called the trailing picture.

[0349] More specifically, a trailing picture is a picture output after the IRAP picture in the output order, and also follows the associated IRAP picture in the decoding order. Since the trailing picture does not use the reference image that is positioned earlier in the output order of the associated IRAP picture during decoding, it can be decoded using the associated IRAP picture when it has been decoded.

[0350] On the other hand, the leading picture is the picture output before the IRAP picture in the output order, but in the decoding order, it is the picture that follows the associated IRAP picture. When decoding, the leading picture usually uses the reference picture that is located earlier in the output order of the associated IRAP picture, so even if the randomly accessed IRAP picture has been decoded, it cannot be decoded. For this reason, in typical operation, when the decoding device 200 randomly accesses the bitstream and starts decoding from the IRAP picture, the leading picture of that IRAP picture is skipped, and only the trailing picture of that IRAP picture is decoded and output.

[0351] In this embodiment of video coding, as with HEVC, there are concepts of IRAP picture, leading picture, and trailing picture.

[0352] However, HEVC has a constraint that the leading picture of an IRAP picture must be decoded before all of its trailing pictures in the decoding order.

[0353] These limitations in HEVC become excessive when interlacing content. An example of such excessive limitations is illustrated in Figure 47.

[0354] Figure 47 shows an example of the encoding structure of interlaced content. Figure 47 shows the encoding structure of interlaced content, where each field is encoded within each access unit.

[0355] In Figure 47, the grayscale pictures IDR0 and I8 represent IRAP pictures. The numbers in the pictures indicate the decoding order (encoding order), and the arrows indicate the reference destination. IDR stands for Instantaneous Decoding Refresh, meaning that all subsequent pictures in the decoding order are decodeable. The trailing pictures of IDR0 are B1 to B7. The trailing pictures of I8 are B9, B14 to B19, and the leading pictures of I8 are B10 to B13. The letter B indicates that bidirectional referencing is possible.

[0356] As shown in Figure 47, in an encoding structure where each field is encoded within each access unit and the content is interlaced, the bottom field is expected to be decoded after the top field, which together with the top field constitutes a single picture, in the decoding order. For example, in the example shown in Figure 47, B9 is expected to be decoded after I8 in the decoding order.

[0357] However, I8 is the IRAP top field picture, and B9 is the trailing bottom field picture. In other words, in HEVC, due to the constraints mentioned above, decoding I8, which is the IRAP top field picture, immediately after decoded I8, which is the trailing field picture, is not permitted because it would be decoded before the leading picture of I8.

[0358] Thus, due to the constraints mentioned above, HEVC prohibits the encoding structure shown in Figure 47. Therefore, it becomes necessary to circumvent the above constraints by using a regular I-picture instead of an IRAP picture for I8, which may reduce encoding efficiency. Furthermore, since control using IRAP pictures cannot be performed when performing random access during decoding, the processing may become more complex.

[0359] Therefore, in the first embodiment, a method will be described that modifies (relaxes) the constraints in HEVC described above, enabling encoding and decoding with the encoding structure shown in Figure 47.

[0360] More specifically, in this embodiment, the above-mentioned constraint may be relaxed to a constraint that, in the decoding order between the associated IRAP picture and the reading picture, only one trailing picture may be decoded before the others. With this relaxed constraint, at most one trailing picture associated with an IRAP picture can be decoded before all the reading pictures associated with that IRAP picture in the decoding order. In this embodiment, by relaxing the constraint in HEVC in this way, the coding structure shown in Figure 47 can be used.

[0361] The following describes an example of a decoding method that applies the constraints (relaxed constraints) of this embodiment when decoding is initiated from an IRAP picture.

[0362] Figure 48 is a flowchart showing an example of a decoding method performed by the decoding device 200 according to the first embodiment when decoding is started from an IRAP picture.

[0363] First, the entropy decoding unit 202 of the decoding device 200 decodes the IRAP picture in the middle of the bitstream (S10).

[0364] Next, the entropy decoding unit 202 checks whether the type of the next target picture to be decoded is a trailing picture (S11).

[0365] In step S11, if the next target picture is a trailing picture (yes in S11), the entropy decoding unit 202 decodes only that target picture, i.e., its trailing picture (S12). If the next target picture is not a trailing picture in step S11 (no in S11), the process proceeds to step S13.

[0366] Next, the entropy decoding unit 202 skips decoding all reading pictures associated with the IRAP picture that was decoded in step S10 (S13). This is because all of these reading pictures cannot be decoded because the reference pictures that are located before the IRAP picture decoded in step S10 in the output order have not been decoded.

[0367] Next, the entropy decoding unit 202 decodes all remaining pictures in the bitstream associated with the IRAP picture decoded in step S10 (S14).

[0368] The above describes the processing of the decoding device 200 as an example, but the processing of the encoding device 100 is similar. The difference is that, in order to decode the IRAP picture so that it can be randomly accessed during decoding, encoding is done from the beginning, not in the middle of the bitstream. As a result, the entropy encoding unit 110 of the encoding device 100 must perform the encoding process corresponding to step S13 without skipping it. This is because the reference picture that is located before the IRAP picture encoded in step S10 in the output order is also encoded. Other than that, the only difference is whether the necessary signals are encoded into a stream or decoded from a stream, and the encoding device and decoding device are basically the same.

[0369] [Effects of the first embodiment] According to the first embodiment, the encoding device 100 and encoding method may be able to encode randomly accessible pictures with a more efficient encoding structure. Furthermore, by encoding with a more efficient encoding structure, the encoding device 100 can reduce the processing load of searching for randomly accessible pictures during decoding, thereby potentially improving processing efficiency.

[0370] Furthermore, according to the first embodiment, the decoding device 200 and decoding method may be able to decode randomly accessible pictures using a more efficient coding structure. In addition, by decoding with a more efficient coding structure, the decoding device 200 can reduce the processing load of searching for randomly accessible pictures during decoding, and thus potentially improve processing efficiency.

[0371] Therefore, in the first embodiment, when each field is encoded within each access unit and the content is interlaced, it may be possible to use an encoding structure that is more efficient in encoding.

[0372] [Second aspect] In the following section, we will describe an example of a second aspect where HEVC constraints or modified HEVC constraints are applied depending on the content of the bitstream.

[0373] When interlacing content by encoding each field within each access unit, the information to be interlaced needs to be encoded into a bitstream. One way to encode this information is to use a field_seq_flag. This field_seq_flag is typically sent as a signal within a Sequence Parameter Set (SPS). In other words, field_seq_flag can be used as a flag indicating whether the content is interlaced with one field per access unit.

[0374] Here, if field_seq_flag is 1, it indicates that the picture contained in each access unit in the bitstream is a field picture. In this embodiment, a constraint is applied that, in the decoding order between the associated IRAP picture and the reading picture, only one trailing picture is allowed to be decoded before the others.

[0375] On the other hand, if field_seq_flag is 0, the same constraints as HEVC apply. That is, the constraint applies that the leading picture of an IRAP picture must be decoded before all of its trailing pictures in the decoding order. In other words, if field_seq_flag is 0, the constraint applies that the trailing picture cannot be decoded before the associated IRAP picture in the decoding order.

[0376] In this way, by using field_seq_flag, when field_seq_flag is 1, the encoding structure shown in Figure 47 can be used.

[0377] In summary, the limitations in this embodiment are as follows:

[0378] (1) If field_seq_flag is 0 and the next picture to be decoded is a leading picture, all leading pictures are decoded in the decoding order, preceding all trailing pictures associated with the same IRAP picture.

[0379] (2) If field_seq_flag is 1, then at most one trailing picture can be decoded ahead of all the reading pictures associated with the same IRAP picture in the decoding order.

[0380] The following describes an example of a decoding method that applies the constraints of this embodiment when decoding is initiated from an IRAP picture.

[0381] Figure 49 is a flowchart showing an example of a decoding method when decoding is initiated from an IRAP picture by a decoding device according to the second embodiment of the embodiment.

[0382] First, the entropy decoding unit 202 of the decoding device 200 decodes the IRAP picture in the middle of the bitstream (S20).

[0383] Next, the entropy decoding unit 202 checks the value of field_seq_flag contained in the bitstream and verifies whether field_seq_flag is equal to 1 (S21).

[0384] In step S21, if field_seq_flag is equal to 1 (yes in S21), the entropy decoding unit 202 checks whether the type of the next target picture to be decoded is a trailing picture (S22). If field_seq_flag is not equal to 1 in step S21 (no in S21), the entropy decoding unit 202 proceeds to step S25, which will be described later.

[0385] In step S22, if the next target picture is a trailing picture (yes in S22), the entropy decoding unit 202 decodes only that target picture, i.e., its trailing picture (S23). If the next target picture is not a trailing picture in step S22 (no in S22), the entropy decoding unit 202 proceeds to step S25, which will be described later.

[0386] Next, the entropy decoding unit 202 assigns the trailing picture decoded in step S23 to the second field of the IRAP picture (S24).

[0387] To illustrate using Figure 47 as an example, let's say the IRAP picture decoded in step S20 is I8, and the trailing picture decoded in step S23 is B9. In this case, I8 is the IRAP top field picture, and B9 is the trailing bottom field picture. Therefore, in relation to I8, B9 becomes the IRAP bottom field picture, and in step S24, B9 is assigned to the second field of the IRAP picture.

[0388] In other words, if the IRAP picture decoded in step S20 is the top field picture, the entropy decoding unit 202 will assign the trailing picture decoded in step S23 as the bottom field picture in the same frame containing the IRAP picture. On the other hand, if the IRAP picture decoded in step S20 is the bottom field picture, the entropy decoding unit 202 will assign the trailing picture decoded in step S23 as the top field picture in the same frame containing the IRAP picture.

[0389] Next, the entropy decoding unit 202 skips decoding all reading pictures associated with the IRAP picture that was decoded in step S20 (S25). This is because all of these reading pictures cannot be decoded because the reference pictures that are located before the IRAP picture decoded in step S20 in the output order have not been decoded.

[0390] Next, the entropy decoding unit 202 decodes all remaining pictures in the bitstream associated with the IRAP picture decoded in step S20 (S26).

[0391] The above describes the processing of the decoding device 200 as an example, but the processing of the encoding device 100 is similar. The difference is that, as explained in the first embodiment, the IRAP picture is decoded so that it can be randomly accessed during decoding, so it is encoded from the beginning, not in the middle of the bitstream. As a result, the entropy encoding unit 110 of the encoding device 100 must perform the encoding process corresponding to step S25 without skipping it. This is because the reference picture that is located before the IRAP picture encoded in step S20 in the output order is also encoded. Other than that, the only difference is whether the necessary signals are encoded into a stream or decoded from a stream, and the encoding device and decoding device are basically the same.

[0392] [Effects of the second embodiment] According to the second embodiment, when each field is encoded within each access unit and the content is interlaced, relaxed constraints of HEVC are applied, and when interlacing is not performed, the same HEVC constraints as before are applied. This makes it possible to use a more efficient encoding structure when interlacing is performed.

[0393] More specifically, according to the second embodiment, the encoding device 100 and encoding method can apply relaxed HEVC constraints to interlaced content simply by writing a flag value to the SPS indicating whether it is interlaced when encoding a randomly accessible picture. This makes it possible to use IRAP pictures in the middle of the stream in interlaced content and to encode with an encoding structure that can improve encoding efficiency by omitting the encoding of syntax at a lower level than the NAL unit header.

[0394] Furthermore, the encoding device 100 can apply relaxed HEVC constraints to interlaced content simply by writing a flag value to the SPS indicating whether it is interlaced. This reduces the processing load of searching for randomly accessible pictures during decoding, potentially improving processing efficiency.

[0395] Furthermore, according to the second embodiment, the decoding device 200 and decoding method may be able to decode randomly accessible pictures using a more efficient encoding structure. In addition, by decoding with a more efficient encoding structure, the decoding device 200 can reduce the processing load of searching for randomly accessible pictures during decoding, thus potentially improving processing efficiency. On the other hand, with HEVC, it was not possible to search for randomly accessible pictures during decoding without performing processing such as checking the syntax at a lower level than the NAL unit header, which resulted in a higher processing load compared to this embodiment.

[0396] In the first and second embodiments, the encoding structure of interlaced content was explained using the example of the IRAP picture being the top-field picture in Figure 47, but it is not limited to this. The same can be said even if the IRAP picture is the bottom-field picture, as shown in Figure 50. Here, Figure 50 is a diagram showing another example of the encoding structure of interlaced content. The same symbols and numbers are used for elements similar to those in Figure 47, so their explanation is omitted.

[0397] (modified version) The second aspect described above explains the case where the constraints in HEVC are relaxed when field_seq_flag is 1, but it is not limited to this. The constraints in HEVC may also be removed when field_seq_flag is 1.

[0398] In other words, if field_seq_flag indicates 1, there may be no constraint on the order of the leading picture and the trailing picture. More specifically, an encoding device for encoding an image comprises a circuit and a memory connected to the circuit, wherein the circuit encodes the image according to an encoding structure that includes an IRAP picture, a plurality of leading pictures output before the IRAP picture in output order, and a plurality of trailing pictures output after the IRAP picture in output order. When the circuit encodes the image, if a flag indicating whether the content is interlaced encoded with one field per access unit indicates 1, the circuit may encode each of the plurality of trailing pictures and each of the plurality of leading pictures without any constraint on the encoding order. Furthermore, the decoding device for decoding an image comprises a circuit and a memory connected to the circuit, wherein the circuit decodes the image according to an encoding structure that includes an IRAP picture, a plurality of leading pictures output before the IRAP picture in output order, and a plurality of trailing pictures output after the IRAP picture in output order. When decoding the image, if a flag indicating whether the content is interlaced encoded in one field per access unit shows 1, the circuit only needs to decode each of the plurality of trailing pictures and each of the plurality of leading pictures without any constraints on the encoding order.

[0399] On the other hand, if field_seq_flag is 0, the same constraints as HEVC should apply. That is, if field_seq_flag is 0, the constraint should apply that the leading picture of an IRAP picture must be decoded in the decoding order, preceding all of the trailing pictures of that IRAP picture. More specifically, if the flag indicating whether the content is interlaced with one field per access unit is 0, then all of the trailing pictures should be encoded (or decoded) in the encoding order (or decoding order), after the leading pictures.

[0400] Furthermore, the flag is not limited to the field_seq_flag described in the second aspect. That is, it may be replaced with another name or parameter that has the same meaning, as long as it can be used as a flag to indicate whether the content is interlaced with one field per access unit. In addition, the information indicating field_seq_flag and the other name or parameter that has the same meaning may be encoded in a header area separate from the SPS, such as SEI (supplemental enhancement information).

[0401] [Example of an encoding device implementation] Figure 51 is a block diagram showing an implementation example of the encoding device 100 according to an embodiment. The encoding device 100 includes a circuit 160 and a memory 162. For example, the multiple components of the encoding device 100 shown in Figure 1 are implemented by the circuit 160 and memory 162 shown in Figure 51.

[0402] Circuit 160 is an information processing circuit and is a circuit that can access memory 162. For example, circuit 160 is a dedicated or general-purpose electronic circuit for encoding moving images. Circuit 160 may also be a processor such as a CPU. Alternatively, circuit 160 may be a collection of multiple electronic circuits. Furthermore, for example, circuit 160 may play the role of multiple components of the encoding device 100 shown in Figure 1, etc., excluding the component for storing information.

[0403] Memory 162 is a dedicated or general-purpose memory in which information for the circuit 160 to encode moving images is stored. Memory 162 may be an electronic circuit, or it may be connected to circuit 160. Memory 162 may also be included in circuit 160. Memory 162 may also be a collection of multiple electronic circuits. Memory 162 may also be a magnetic disk or an optical disk, or it may be described as storage or a recording medium. Memory 162 may also be a non-volatile memory or a volatile memory.

[0404] For example, memory 162 may store the video to be encoded, or it may store a bit sequence corresponding to the encoded video. Alternatively, memory 162 may store a program for circuit 160 to encode the video.

[0405] Furthermore, for example, memory 162 may play the role of an information storage component among the multiple components of the encoding device 100 shown in Figure 1, etc. Specifically, memory 162 may play the role of block memory 118 and frame memory 122 shown in Figure 1. More specifically, reconstructed blocks and reconstructed pictures may be stored in memory 162.

[0406] Furthermore, it is not necessary for the encoding device 100 to implement all of the multiple components shown in Figure 1, etc., nor is it necessary for all of the above-described processes to be performed. Some of the multiple components shown in Figure 1, etc., may be included in other devices, and some of the above-described processes may be executed by other devices. In this way, the prediction process in inter-prediction mode is efficiently performed when some of the multiple components shown in Figure 1, etc., are implemented in the encoding device 100 and some of the above-described processes are performed.

[0407] The following shows an example of the operation of the encoding device 100, as shown in Figure 51.

[0408] Figure 52 is a flowchart showing an example of the operation of the encoding device 100 shown in Figure 51. For example, when encoding a video, the encoding device 100 shown in Figure 51 performs the operations shown in Figure 52.

[0409] Specifically, the circuit 160 of the encoding device 100 performs the following processing during operation. That is, first, when the circuit 160 encodes an image according to an encoding structure that includes an IRAP picture, a plurality of reading pictures output before the IRAP picture in output order, and a plurality of trailing pictures output after the IRAP picture in output order, it encodes up to one of the plurality of trailing pictures before the plurality of reading pictures in encoding order (S311). Next, the circuit 160 encodes the plurality of trailing pictures, excluding the maximum one trailing picture, after the plurality of reading pictures in encoding order (S312).

[0410] This may enable the encoding device 100 to encode randomly accessible pictures using a more efficient encoding structure. Furthermore, by encoding with a more efficient encoding structure, the encoding device 100 can reduce the processing load required to search for randomly accessible pictures during decoding, thereby potentially improving processing efficiency.

[0411] [Example of a decryption device implementation] Figure 53 is a block diagram showing an example of an implementation of the decoding device 200 according to the embodiment. The decoding device 200 includes a circuit 260 and a memory 262. For example, the multiple components of the decoding device 200 shown in Figure 41 are implemented by the circuit 260 and memory 262 shown in Figure 53.

[0412] Circuit 260 is an information processing circuit and is a circuit that can access memory 262. For example, circuit 260 is a dedicated or general-purpose electronic circuit for decoding moving images. Circuit 260 may also be a processor such as a CPU. Alternatively, circuit 260 may be a collection of multiple electronic circuits. Furthermore, for example, circuit 260 may play the role of multiple components of the decoding device 200 shown in Figure 41, etc., excluding the component for storing information.

[0413] Memory 262 is a dedicated or general-purpose memory in which information for the circuit 260 to decode moving images is stored. Memory 262 may be an electronic circuit, or it may be connected to the circuit 260. Alternatively, memory 262 may be included in the circuit 260. Alternatively, memory 262 may be a collection of multiple electronic circuits. Alternatively, memory 262 may be a magnetic disk or an optical disk, or it may be described as storage or a recording medium. Alternatively, memory 262 may be a non-volatile memory or a volatile memory.

[0414] For example, memory 262 may store a bit sequence corresponding to an encoded video, or a video corresponding to a decoded bit sequence. Memory 262 may also store a program for circuit 260 to decode the video.

[0415] Furthermore, for example, memory 262 may play the role of an information storage component among the multiple components of the decoding device 200 shown in Figure 41, etc. Specifically, memory 262 may play the role of block memory 210 and frame memory 214 shown in Figure 41. More specifically, reconstructed blocks and reconstructed pictures may be stored in memory 262.

[0416] Furthermore, it is not necessary for the decoding device 200 to implement all of the components shown in Figure 41, etc., nor is it necessary for all of the processes described above to be performed. Some of the components shown in Figure 41, etc., may be included in other devices, and some of the processes described above may be performed by other devices. Then, motion compensation is efficiently performed in the decoding device 200 by implementing some of the components shown in Figure 41, etc., and performing some of the processes described above.

[0417] The following shows an example of the operation of the decoding device 200 shown in Figure 53. Figure 54 is a flowchart showing an example of the operation of the decoding device 200 shown in Figure 53. For example, when decoding a video image, the decoding device 200 shown in Figure 53 performs the operation shown in Figure 54.

[0418] Specifically, the circuit 260 of the decoding device 200 performs the following processing during operation. That is, first, when the circuit 260 decodes an image according to an encoding structure that includes an IRAP picture, a plurality of reading pictures output before the IRAP picture in output order, and a plurality of trailing pictures output after the IRAP picture in output order, it decodes up to one of the plurality of trailing pictures before the plurality of reading pictures in the decoding order (S411). Next, the circuit 260 decodes the plurality of trailing pictures, excluding the maximum one trailing picture, after the plurality of reading pictures in the decoding order (S412).

[0419] This may allow the decoding device 200 to use a more efficient encoding structure when decoding randomly accessible pictures. Furthermore, by decoding with a more efficient encoding structure, the decoding device 200 can reduce the processing load of searching for randomly accessible pictures during decoding, thereby potentially improving processing efficiency.

[0420] [supplement] Furthermore, the encoding device 100 and decoding device 200 in this embodiment may be used as an image encoding device and an image decoding device, respectively, or as a video encoding device and a video decoding device. Alternatively, the encoding device 100 and decoding device 200 may be used as inter-prediction devices (inter-screen prediction devices), respectively.

[0421] In other words, the encoding device 100 and the decoding device 200 may correspond only to the inter-prediction unit (inter-screen prediction unit) 126 and the inter-prediction unit (inter-screen prediction unit) 218, respectively. Other components such as the conversion unit 106 and the inverse conversion unit 206 may be included in other devices.

[0422] Furthermore, in this embodiment, each component may be implemented by being composed of dedicated hardware or by executing a software program suitable for each component. Each component may also be implemented by a program execution unit such as a CPU or processor reading and executing a software program recorded on a recording medium such as a hard disk or semiconductor memory.

[0423] Specifically, each of the encoding device 100 and the decoding device 200 may include a processing circuitry and a storage device electrically connected to and accessible from the processing circuitry. For example, the processing circuitry corresponds to circuit 160 or 260, and the storage device corresponds to memory 162 or 262.

[0424] The processing circuit includes at least one of dedicated hardware and a program execution unit, and performs processing using a memory device. Furthermore, if the processing circuit includes a program execution unit, the memory device stores the software program executed by that program execution unit.

[0425] Here, the software that implements the encoding device 100 or decoding device 200, etc., in this embodiment is the following program.

[0426] In other words, this program may cause a computer to perform an encoding method for encoding an image, which encodes the image according to an encoding structure including an IRAP picture, a plurality of leading pictures that are output before the IRAP picture in output order, and a plurality of trailing pictures that are output after the IRAP picture in output order, and when encoding the image, encode up to one of the plurality of trailing pictures before the plurality of leading pictures in encoding order, and encode the plurality of trailing pictures excluding the maximum one trailing picture after the plurality of leading pictures in encoding order.

[0427] Alternatively, this program may perform a decoding method for decoding an image, which decodes the image according to an encoding structure including an IRAP picture, a plurality of reading pictures output before the IRAP picture in output order, and a plurality of trailing pictures output after the IRAP picture in output order, and when decoding the image, it may execute a decoding method in which at most one of the plurality of trailing pictures is decoded before the plurality of reading pictures in decoding order, and the plurality of trailing pictures excluding the said at most one trailing picture are decoded after the plurality of reading pictures in decoding order.

[0428] Furthermore, each component may be a circuit, as described above. These circuits may form a single circuit as a whole, or they may be separate circuits. Also, each component may be implemented using a general-purpose processor, or it may be implemented using a dedicated processor.

[0429] Furthermore, a process performed by one component may be performed by another component. Also, the order in which processes are executed may be changed, and multiple processes may be executed in parallel. Additionally, the encoding / decoding device may comprise an encoding device 100 and a decoding device 200.

[0430] The first and second ordinal numbers used in the explanation may be changed as appropriate. Furthermore, ordinal numbers may be newly assigned to or removed from the constituent elements.

[0431] Although the embodiments of the encoding device 100 and the decoding device 200 have been described above based on these embodiments, the embodiments of the encoding device 100 and the decoding device 200 are not limited to these embodiments. Without departing from the spirit of this disclosure, various modifications that a person skilled in the art could conceive of are applied to these embodiments, and configurations constructed by combining components from different embodiments may also be included within the scope of the embodiments of the encoding device 100 and the decoding device 200.

[0432] One or more embodiments disclosed herein may be implemented in combination with at least some of the other embodiments disclosed herein. Furthermore, some processes, some configurations of the apparatus, some syntax, etc., described in the flowcharts of one or more embodiments disclosed herein may be implemented in combination with the other embodiments.

[0433] [Implementation and Application] In each of the above embodiments, each functional or operational block can typically be implemented by an MPU (micro processing unit) and memory, etc. Furthermore, the processing performed by each functional block may be implemented as a program execution unit, such as a processor, that reads and executes software (programs) recorded on a recording medium such as ROM. This software may be distributed. This software may be recorded on various recording media such as semiconductor memory. It is also possible to implement each functional block using hardware (dedicated circuits). Various combinations of hardware and software can be employed.

[0434] The processing described in each embodiment may be implemented by centralized processing using a single device (system), or by distributed processing using multiple devices. Furthermore, the processor executing the above program may be one or multiple. In other words, centralized processing may be performed, or distributed processing may be performed.

[0435] The embodiments of this disclosure are not limited to those described above, and various modifications are possible, which are also included within the scope of the embodiments of this disclosure.

[0436] Furthermore, here we will describe application examples of the video encoding method (image encoding method) or video decoding method (image decoding method) shown in each of the above embodiments, and various systems for implementing these application examples. Such systems may be characterized by having an image encoding device using the image encoding method, an image decoding device using the image decoding method, or an image encoding and decoding device that includes both. Other configurations of such systems can be appropriately modified as needed.

[0437] [Usage example] Figure 55 shows the overall configuration of a suitable content supply system ex100 for realizing a content distribution service. The service area for the communication service is divided into cells of a desired size, and within each cell, there are base stations ex106, ex107, ex108, ex109, and ex110, which are fixed radio stations in the illustrated example.

[0438] In this content supply system ex100, various devices such as a computer ex111, a game console ex112, a camera ex113, a home appliance ex114, and a smartphone ex115 are connected to the internet ex101 via an internet service provider ex102 or a communication network ex104, and base stations ex106~ex110. The content supply system ex100 may also be configured to connect any combination of the above devices. In various implementations, the devices may be directly or indirectly interconnected via a telephone network or short-range wireless, etc., without going through base stations ex106~ex110. Furthermore, the streaming server ex103 may be connected to various devices such as a computer ex111, a game console ex112, a camera ex113, a home appliance ex114, and a smartphone ex115 via the internet ex101, etc. The streaming server ex103 may also be connected to terminals in a hotspot on an airplane ex117 via satellite ex116.

[0439] Note that instead of base stations ex106~ex110, wireless access points or hotspots may be used. Also, streaming server ex103 may be connected directly to the communication network ex104 without going through the internet ex101 or internet service provider ex102, or it may be connected directly to the airplane ex117 without going through satellite ex116.

[0440] The camera ex113 is a device capable of taking still images and videos, such as a digital camera. The smartphone ex115 is a smartphone, mobile phone, or PHS (Personal Handy-phone System) that supports mobile communication systems such as 2G, 3G, 3.9G, 4G, and the upcoming 5G.

[0441] Home appliance ex114 refers to appliances such as refrigerators or equipment included in household fuel cell cogeneration systems.

[0442] In the content supply system ex100, live streaming becomes possible when a terminal with a shooting function is connected to the streaming server ex103 via a base station ex106 or the like. In live streaming, a terminal (such as a computer ex111, a game console ex112, a camera ex113, a home appliance ex114, a smartphone ex115, and a terminal inside an airplane ex117) may perform the encoding process described in each of the above embodiments on still images or video content captured by a user using the terminal, or it may multiplex the video data obtained by encoding with sound data encoded from the sound corresponding to the video, and then transmit the obtained data to the streaming server ex103. In other words, each terminal functions as an image encoding device according to one aspect of this disclosure.

[0443] Meanwhile, the streaming server ex103 streams the content data sent to the requesting client. The client is a computer ex111, a game console ex112, a camera ex113, a home appliance ex114, a smartphone ex115, or a terminal on an airplane ex117, etc., that is capable of decoding the encoded data. Each device that receives the distributed data may decode and play back the received data. That is, each device may function as an image decoding device according to one aspect of this disclosure.

[0444] [Distributed Processing] Furthermore, the streaming server ex103 may consist of multiple servers or computers that distribute data processing, recording, and distribution. For example, the streaming server ex103 may be implemented by a CDN (Content Delivery Network), where content delivery is achieved through a network connecting numerous edge servers distributed worldwide. In a CDN, the physically closest edge server can be dynamically assigned depending on the client. Latency can be reduced by caching and delivering content to the edge server. In addition, if several types of errors occur or the communication state changes due to increased traffic, processing can be distributed among multiple edge servers, the delivery entity can be switched to another edge server, or delivery can be continued by bypassing the failed part of the network, thus enabling high-speed and stable delivery.

[0445] Furthermore, beyond the distributed processing of the distribution itself, the encoding process of the captured data can be performed on each terminal, on the server side, or shared among them. For example, encoding generally involves two processing loops. In the first loop, the complexity or code amount of the image at the frame or scene level is detected. In the second loop, processing is performed to improve encoding efficiency while maintaining image quality. For example, if the terminal performs the first encoding process and the server that receives the content performs the second encoding process, it is possible to improve the quality and efficiency of the content while reducing the processing load on each terminal. In this case, if there is a request to receive and decode near real time, the first encoded data from the terminal can be received and played back on other terminals, enabling more flexible real-time distribution.

[0446] Another example is the camera ex113, which extracts features (quantities of features or characteristics) from an image, compresses the data related to the features as metadata, and sends it to the server. The server performs compression according to the meaning (or importance of content) of the image, for example, by determining the importance of objects from the features and switching the quantization precision. Feature data is particularly effective in improving the accuracy and efficiency of motion vector prediction during further compression on the server. Alternatively, a simple encoding such as VLC (Variable Length Coding) may be performed on the terminal, and a more computationally intensive encoding such as CABAC (Context-Adaptive Binary Arithmetic Coding) may be performed on the server.

[0447] Another example is a scenario in a stadium, shopping mall, or factory where multiple video data sets of nearly identical scenes may exist, captured by multiple terminals. In such cases, the encoding process is distributed among the multiple terminals that captured the footage, along with other terminals and servers as needed, by assigning encoding tasks to each unit, for example, at the Group of Picture (GOP) level, picture level, or tile level (a division of a picture). This reduces latency and enables more real-time performance.

[0448] Since multiple video data sets depict essentially the same scene, the server may manage and / or instruct the video data captured by each terminal to reference each other. Alternatively, the server may receive the encoded data from each terminal, change the reference relationships between the multiple data sets, or correct or replace the pictures themselves and re-encode them. This allows for the creation of a stream with improved quality and efficiency for each individual data set.

[0449] Furthermore, the server may transcode the video data to change its encoding method before distributing it. For example, the server may convert an MPEG-based encoding to a VP-based encoding (e.g., VP9), or convert H.264 to H.265, etc.

[0450] Thus, the encoding process can be performed by a terminal or one or more servers. Therefore, in the following, the terms "server" or "terminal" will be used to refer to the entity performing the processing, but some or all of the processing performed by the server may be performed by the terminal, and some or all of the processing performed by the terminal may be performed by the server. The same applies to the decoding process.

[0451] [3D, Multi-angle] It is becoming increasingly common to integrate and utilize images or videos of different scenes, or the same scene, captured from different angles, by multiple cameras ex113 and / or smartphones ex115, which are nearly synchronized with each other. Videos captured by each device can be integrated based on the relative positional relationship between the devices, or on areas where feature points contained in the video coincide.

[0452] The server may not only encode two-dimensional video but also encode still images automatically based on scene analysis of the video, or at a time specified by the user, and transmit them to the receiving terminal. Furthermore, if the server can obtain the relative positional relationship between the shooting terminals, it can generate a three-dimensional shape of the scene based not only on two-dimensional video but also on video of the same scene taken from different angles. The server may separately encode three-dimensional data generated by a point cloud or the like, or it may select or reconstruct video from video taken by multiple terminals to transmit to the receiving terminal based on the results of recognizing or tracking a person or object using the three-dimensional data.

[0453] In this way, users can enjoy scenes by arbitrarily selecting each video corresponding to each shooting terminal, or they can enjoy content in which a video from a selected viewpoint is extracted from 3D data reconstructed using multiple images or videos. Furthermore, along with the video, sound is also collected from multiple different angles, and the server may multiplex the sound from a specific angle or space with the corresponding video and transmit the multiplexed video and sound.

[0454] In recent years, content that links the real world with a virtual world, such as Virtual Reality (VR) and Augmented Reality (AR), has also become popular. In the case of VR images, the server may create separate viewpoint images for the right and left eyes and perform encoding that allows referencing between the viewpoint images using Multi-View Coding (MVC), or it may encode them as separate streams without referencing each other. When decoding the separate streams, it is advisable to synchronize playback so that the virtual 3D space is reproduced according to the user's viewpoint.

[0455] In the case of AR images, the server may superimpose virtual object information from the virtual space onto camera information from the real space, based on its three-dimensional position or the user's viewpoint movement. The decoding device may acquire or retain the virtual object information and three-dimensional data, generate a two-dimensional image according to the user's viewpoint movement, and create superimposed data by smoothly stitching them together. Alternatively, the decoding device may send the user's viewpoint movement to the server in addition to the request for virtual object information. The server may create superimposed data according to the viewpoint movement received from the three-dimensional data held by the server, encode the superimposed data, and distribute it to the decoding device. Typically, superimposed data has an α value indicating transparency in addition to RGB, and the server may set the α value of parts other than the object created from the three-dimensional data to 0, and encode the data in a state where those parts are transparent. Alternatively, the server may set predetermined RGB values ​​as the background, like in chroma keying, and generate data where parts other than the object are the background color. The predetermined RGB values ​​may be predetermined.

[0456] Similarly, the decryption process of the distributed data can be performed by the client (e.g., a terminal), the server, or a shared task between them. For example, one terminal may send a reception request to the server, another terminal may receive the content corresponding to that request, decrypt it, and then transmit the decrypted signal to a device with a display. By distributing the processing and selecting appropriate content regardless of the performance of the communication-capable terminals themselves, it is possible to play back data with high image quality. Another example is that while receiving large image data on a TV or similar device, a portion of the picture, such as tiles, may be decrypted and displayed on the viewer's personal terminal. This allows for sharing the overall picture while simultaneously allowing users to check their own area of ​​responsibility or areas they wish to examine in more detail.

[0457] In situations where multiple short-range, medium-range, or long-range wireless communication networks are available both indoors and outdoors, it may be possible to seamlessly receive content using distribution system standards such as MPEG-DASH. Users may freely select and switch in real time between decoding devices or display devices, such as their own terminals or displays located indoors or outdoors. Furthermore, decoding can be performed while switching between the decoding terminal and the display terminal using the user's location information. This makes it possible to map and display information on a part of the wall or ground of an adjacent building with a displayable device embedded, while the user is moving to their destination. It is also possible to switch the bitrate of the received data based on the ease of access to the encoded data on the network, such as when the encoded data is cached on a server that can be accessed quickly from the receiving terminal, or copied to an edge server in the content delivery service.

[0458] [Scalable encoding] Regarding content switching, we will explain using a scalable stream compressed and encoded using the video encoding method described in each of the embodiments above, as shown in Figure 56. The server may have multiple streams with the same content but different qualities as individual streams, but it may also be configured to switch content by taking advantage of the temporal / spatial scalability of the stream realized by encoding it in layers, as shown in the figure. In other words, the decoding side can freely switch between decoding low-resolution and high-resolution content by deciding which layer to decode according to internal factors such as performance and external factors such as the state of the communication bandwidth. For example, if a user was watching a video on their smartphone ex115 while on the go and wants to continue watching it on a device such as an internet TV after returning home, the device only needs to decode the same stream up to a different layer, thus reducing the burden on the server.

[0459] Furthermore, as described above, in addition to a configuration where each layer encodes a picture and scalability is achieved by an enhancement layer above the base layer, the enhancement layer may also include metadata based on image statistics. The decoding side may generate high-quality content by super-resolution the picture in the base layer based on the metadata. Super-resolution may improve the signal-to-noise ratio while maintaining and / or increasing the resolution. The metadata may include information for identifying linear or nonlinear filter coefficients used in the super-resolution process, or information for identifying parameter values ​​in the filtering process, machine learning, or least-squares operation used in the super-resolution process.

[0460] Alternatively, a configuration may be provided in which the picture is divided into tiles or the like according to the meaning of objects within the image. The decoding side decodes only a portion of the area by selecting the tiles to decode. Furthermore, by storing the attributes of the objects (people, cars, balls, etc.) and their positions in the image (coordinate positions within the same image, etc.) as metadata, the decoding side can identify the position of the desired object based on the metadata and determine the tile containing that object. For example, as shown in Figure 57, the metadata may be stored using a data storage structure different from the pixel data, such as the SEI (supplemental enhancement information) message in HEVC. This metadata indicates, for example, the position, size, or color of the main object.

[0461] Metadata may be stored in units consisting of multiple pictures, such as streams, sequences, or random access units. The decryption side can obtain information such as the time when a specific person appears in the video, and by combining the picture-level information with the time information, it can identify the picture in which the object exists and determine the object's position within the picture.

[0462] [Web page optimization] Figure 58 shows an example of a web page display screen on a computer ex111, etc. Figure 59 shows an example of a web page display screen on a smartphone ex115, etc. As shown in Figures 58 and 59, a web page may contain multiple linked images, which are links to image content, and their appearance may differ depending on the viewing device. When multiple linked images are visible on the screen, the display device (decoder) may display still images or I-pictures of each content as linked images until the user explicitly selects a linked image, or until a linked image approaches the center of the screen or the entire linked image is within the screen, or it may display a video such as a GIF animation using multiple still images or I-pictures, or it may receive only the base layer and decode and display the video.

[0463] When a linked image is selected by the user, the display device performs decoding, prioritizing the base layer, for example. If the HTML of the web page contains information indicating that the content is scalable, the display device may decode up to the enhancement layer. Furthermore, to ensure real-time performance, before selection or when the communication bandwidth is very limited, the display device can decode and display only forward-referenced pictures (I pictures, P pictures, and B pictures that only use forward references), thereby reducing the delay between the decoding time and the display time of the first picture (the delay from the start of content decoding to the start of display). In addition, the display device may deliberately ignore the reference relationships of the pictures and roughly decode all B pictures and P pictures using forward references, performing normal decoding as time passes and more pictures are received.

[0464] [Autonomous driving] Furthermore, when transmitting and receiving still image or video data, such as 2D or 3D map information, for autonomous driving or driving assistance of a vehicle, the receiving terminal may receive metadata such as weather or construction information in addition to image data belonging to one or more layers, and decode these in association with each other. The metadata may belong to a layer, or it may simply be multiplexed with the image data.

[0465] In this case, since the vehicle, drone, or airplane containing the receiving terminal is moving, the receiving terminal can transmit its own location information, enabling seamless reception and decoding while switching between base stations ex106 to ex110. Furthermore, the receiving terminal can dynamically switch how much metadata is received or how much map information is updated, depending on the user's selection, the user's situation, and / or the state of the communication bandwidth.

[0466] The content delivery system ex100 allows the client to receive, decode, and play back encoded information transmitted by the user in real time.

[0467] [Distribution of personal content] Furthermore, the ex100 content delivery system allows for unicast or multicast distribution of not only high-definition, long-duration content from video distribution companies, but also low-definition, short-duration content from individuals. It is expected that the amount of such individual content will continue to increase. To improve the quality of individual content, the server may perform editing before encoding. This can be achieved, for example, using a configuration like the following.

[0468] During shooting, or after shooting, the server performs recognition processing such as detecting shooting errors, searching for scenes, analyzing semantics, and detecting objects from the original image data or encoded data in real time. Based on the recognition results, the server manually or automatically edits the images, correcting out-of-focus or shaky images, deleting less important scenes such as those with lower brightness or out of focus compared to other pictures, emphasizing object edges, and altering color tones. The server then encodes the edited data based on the editing results. It is also known that viewership decreases if the shooting time is too long, so the server may automatically clip scenes with little movement, as well as less important scenes, based on the image processing results, to ensure that the content falls within a specific time range according to the shooting time. Alternatively, the server may generate and encode a digest based on the results of the semantic analysis of the scenes.

[0469] Personal content may contain elements that infringe on copyright, moral rights, or portrait rights, and the scope of sharing may exceed the intended scope, which can be inconvenient for the individual. Therefore, for example, the server may intentionally change the image to one that is out of focus, such as the faces of people at the edges of the screen or the interior of a house, before encoding. Furthermore, the server may recognize whether the face of a person other than those previously registered is visible in the image to be encoded, and if so, it may apply a mosaic effect to the face. Alternatively, as a pre-processing or post-processing step before encoding, the user may specify a person or background area that they want to process from a copyright perspective. The server may replace the specified area with another image or blur the focus. In the case of a person, the server can track the person in a video and replace the image of the person's face.

[0470] Viewing personal content with small data volumes requires real-time processing, so depending on the bandwidth, the decoder may prioritize receiving the base layer first and then decode and play it back. During this time, the decoder may receive the enhancement layer, and if playback is looped or if the content is played back more than once, it may play back the high-quality video including the enhancement layer. With a stream that uses this scalable encoding, it is possible to provide an experience where the video is rough when unselected or at the beginning of viewing, but gradually the stream becomes smarter and the image quality improves. In addition to scalable encoding, a similar experience can be provided even if the rough stream played back the first time and the second stream encoded by referencing the first video are configured as a single stream.

[0471] [Other examples of practical applications] Furthermore, these encoding or decoding processes are generally performed in the LSIex500 present in each terminal. The LSI (large scale integration circuitry) ex500 (see Figure 55) may be a single chip or a configuration consisting of multiple chips. Alternatively, video encoding or decoding software may be embedded in some recording medium (such as a CD-ROM, flexible disk, or hard disk) that can be read by a computer ex111, and the encoding or decoding process may be performed using that software. In addition, if the smartphone ex115 has a camera, video data acquired by that camera may be transmitted. This video data may be data encoded by the LSIex500 present in the smartphone ex115.

[0472] The LSIex500 may also be configured to be activated by downloading application software. In this case, the terminal first determines whether it supports the content encoding method or whether it has the capability to perform the specific service. If the terminal does not support the content encoding method or does not have the capability to perform the specific service, the terminal may download a codec or application software and then acquire and play the content.

[0473] Furthermore, not only the content supply system ex100 via the Internet ex101, but also digital broadcasting systems can incorporate at least one of the video encoding device (image encoding device) or video decoding device (image decoding device) of each of the above embodiments. While the content supply system ex100 has a configuration that is more suited to multicast than unicast, as it transmits and receives multiplexed data with video and sound multiplexed onto broadcast radio waves using satellites, etc., the encoding and decoding processes are similar and can be applied in the same way.

[0474] [Hardware configuration] Figure 60 shows further details of the smartphone ex115 shown in Figure 55. Figure 61 shows an example configuration of the smartphone ex115. The smartphone ex115 includes an antenna ex450 for transmitting and receiving radio waves with the base station ex110, a camera unit ex465 capable of taking video and still images, and a display unit ex458 that displays video captured by the camera unit ex465 and data decoded from video received by the antenna ex450. The smartphone ex115 further includes an operation unit ex466, such as a touch panel, an audio output unit ex457, such as a speaker for outputting voice or sound, an audio input unit ex456, such as a microphone for inputting voice, a memory unit ex467 capable of storing captured video or still images, recorded audio, received video or still images, encoded data such as emails, or decoded data, and a slot unit ex464, which is an interface unit with SIM ex468 for identifying the user and authenticating access to various data, including the network. External memory may be used instead of the memory unit ex467.

[0475] The main control unit ex460, which can comprehensively control the display unit ex458 and the operation unit ex466, is connected to the power supply circuit unit ex461, the operation input control unit ex462, the video signal processing unit ex455, the camera interface unit ex463, the display control unit ex459, the modulation / demodulation unit ex452, the multiplexing / decompression unit ex453, the audio signal processing unit ex454, the slot unit ex464, and the memory unit ex467 via the synchronization bus ex470.

[0476] The power supply circuit unit ex461, when the power key is turned on by the user, starts up the smartphone ex115 into an operational state and supplies power to each component from the battery pack.

[0477] The smartphone ex115 performs tasks such as voice calls and data communication based on the control of the main control unit ex460, which has a CPU, ROM, RAM, etc. During a call, the voice signal picked up by the voice input unit ex456 is converted into a digital voice signal by the voice signal processing unit ex454, spread spectrum processing is performed by the modulation / demodulation unit ex452, digital-to-analog conversion and frequency conversion processing are performed by the transmission / reception unit ex451, and the resulting signal is transmitted via the antenna ex450. Received data is amplified, subjected to frequency conversion and analog-to-digital conversion processing, despread spectrum processing is performed by the modulation / demodulation unit ex452, converted into an analog voice signal by the voice signal processing unit ex454, and then output from the voice output unit ex457. In data communication mode, text, still images, or video data can be transmitted via the operation input control unit ex462 under the control of the main control unit ex460 based on operations such as those performed by the operation unit ex466 on the main unit. Similar transmission and reception processing is performed. When transmitting video, still images, or video and audio in data communication mode, the video signal processing unit ex455 compresses and encodes the video signal stored in the memory unit ex467 or the video signal input from the camera unit ex465 using the video encoding method shown in each of the above embodiments, and sends the encoded video data to the multiplexing / decoding unit ex453. The audio signal processing unit ex454 encodes the audio signal picked up by the audio input unit ex456 while the camera unit ex465 is capturing video or still images, and sends the encoded audio data to the multiplexing / decoding unit ex453. The multiplexing / decoding unit ex453 multiplexes the encoded video data and encoded audio data in a predetermined manner, performs modulation and conversion processing in the modulation / demodulation unit (modulation / demodulation circuit unit) ex452 and the transmission / reception unit ex451, and transmits via the antenna ex450. The predetermined manner may be set in advance.

[0478] When receiving video attached to an email or chat, or video linked to a webpage, etc., the multiplexing / decomposition unit ex453 separates the multiplexed data received via antenna ex450, dividing it into a video data bitstream and an audio data bitstream. It then supplies the encoded video data to the video signal processing unit ex455 and the encoded audio data to the audio signal processing unit ex454 via the synchronization bus ex470. The video signal processing unit ex455 decodes the video signal using a video decoding method corresponding to the video encoding method shown in each of the above embodiments, and the video or still image contained in the linked video file is displayed from the display unit ex458 via the display control unit ex459. The audio signal processing unit ex454 decodes the audio signal, and the audio is output from the audio output unit ex457. As real-time streaming is becoming increasingly widespread, audio playback may be socially inappropriate depending on the user's situation. Therefore, as an initial setting, it is preferable to play only the video data without playing the audio signal, and to synchronize the audio playback only when the user performs an action such as clicking on the video data.

[0479] While the smartphone ex115 was used as an example here, other implementation forms are possible for terminals, such as a transmitting terminal with only an encoder and a receiving terminal with only a decoder, in addition to a transmitting / receiving terminal that has both an encoder and a decoder. In the explanation for digital broadcasting systems, multiplexed data in which audio data is multiplexed with video data is received or transmitted. However, the multiplexed data may also contain text data related to the video in addition to audio data. Furthermore, the video data itself may be received or transmitted instead of multiplexed data.

[0480] Although it was explained that the main control unit ex460, including the CPU, controls the encoding or decoding process, various terminals often have a GPU. Therefore, a configuration that leverages the GPU's performance to process a wide area at once using memory shared by the CPU and GPU, or memory whose addresses are managed so that it can be used in common, is also possible. This can shorten the encoding time, ensure real-time performance, and achieve low latency. In particular, it is efficient to perform motion detection, deblocking filters, SAO (Sample Adaptive Offset), and transformation / quantization processes at once on the GPU, rather than on the CPU, in units such as pictures. [Industrial applicability]

[0481] This disclosure can be used, for example, in television receivers, digital video recorders, car navigation systems, mobile phones, digital cameras, digital video cameras, video conferencing systems, or electronic mirrors. [Explanation of Symbols]

[0482] 100 Encoding device 102 Division 104 Subtraction Unit 106 Conversion Unit 108 Quantization section 110 Entropy coding unit 112, 204 Inverse quantization section 114, 206 Inverse Transform Section 116, 208 Addition section 118, 210 block memory 120, 212 Loop filter section 122,214 frame memory 124, 216 Intra Prediction Unit 126, 218 Interpretation Unit 128, 220 Prediction Control Unit 200 Decoders 202 Entropy Decoder 1201 Boundary determination section 1202, 1204, 1206 switches 1203 Filter determination unit 1205 Filter Processing Unit 1207 Filter Characterization Unit 1208 Processing determination unit a1, b1 processors a2, b2 memory

Claims

1. An encoding device for encoding images, Circuits and, The circuit comprises a memory connected to the aforementioned circuit, In operation, the aforementioned circuit The image is encoded according to an encoding structure that includes an IRAP picture, a plurality of leading pictures that are output before the IRAP picture in output order, and a plurality of trailing pictures that are output after the IRAP picture in output order. A bitstream containing a flag, which generates a bitstream containing the image, When encoding the aforementioned image, according to the flag, at most one of the multiple trailing pictures is encoded before the multiple leading pictures in the encoding order. The plurality of trailing pictures, excluding at most one trailing picture, are encoded in the encoding order after the plurality of leading pictures. The aforementioned flag indicates that the picture of each access unit in the bitstream is a field picture. The circuit encodes each access unit in the bitstream with an SEI (supplemental enhancement information) message that includes information indicating that the picture of the access unit is a field picture. Encoding device.

2. An encoding method for encoding images, The image is encoded according to an encoding structure that includes an IRAP picture, a plurality of leading pictures that are output before the IRAP picture in output order, and a plurality of trailing pictures that are output after the IRAP picture in output order. A bitstream containing a flag, which generates a bitstream containing the image, When encoding the aforementioned image, according to the flag, at most one of the multiple trailing pictures is encoded before the multiple leading pictures in the encoding order. The plurality of trailing pictures, excluding at most one trailing picture, are encoded in the encoding order after the plurality of leading pictures. The aforementioned flag indicates that the picture of each access unit in the bitstream is a field picture. Each access unit in the bitstream is encoded with an SEI (supplemental enhancement information) message containing information indicating that the picture of the access unit is a field picture. Encoding method.

3. A decoding device for decoding images, Circuits and, The circuit comprises a memory connected to the aforementioned circuit, In operation, the aforementioned circuit The image is decoded according to an encoding structure that includes an IRAP picture, a plurality of leading pictures output before the IRAP picture in output order, and a plurality of trailing pictures output after the IRAP picture in output order. A bitstream containing a flag, which generates a bitstream containing the image, When decoding the aforementioned image, according to the flag, at most one of the multiple trailing pictures is decoded before the multiple reading pictures in the decoding order. The plurality of trailing pictures, excluding at most one trailing picture, are decoded in the decoding order after the plurality of reading pictures. The aforementioned flag indicates that the picture of each access unit in the bitstream is a field picture. The circuit decodes each access unit in the bitstream by including an SEI (supplemental enhancement information) message that includes information indicating that the picture of the access unit is a field picture. Decoding device.

4. A decoding method for decoding an image, The image is decoded according to an encoding structure that includes an IRAP picture, a plurality of leading pictures output before the IRAP picture in output order, and a plurality of trailing pictures output after the IRAP picture in output order. A bitstream containing a flag, which generates a bitstream containing the image, When decoding the aforementioned image, according to the flag, at most one of the multiple trailing pictures is decoded before the multiple reading pictures in the decoding order. The plurality of trailing pictures, excluding at most one trailing picture, are decoded in the decoding order after the plurality of reading pictures. The aforementioned flag indicates that the picture of each access unit in the bitstream is a field picture. Each access unit in the bitstream is decoded by including an SEI (supplemental enhancement information) message containing information indicating that the picture of the access unit is a field picture. Decryption method.

5. The image is encoded according to an encoding structure that includes an IRAP picture, a plurality of leading pictures which are output before the IRAP picture in the output order, and a plurality of trailing pictures which are output after the IRAP picture in the output order. A bitstream containing a flag, which generates a bitstream containing the image, When encoding the aforementioned image, according to the flag, at most one of the multiple trailing pictures is encoded before the multiple leading pictures in the encoding order. The plurality of trailing pictures, excluding at most one trailing picture, are encoded in the encoding order after the plurality of leading pictures. The aforementioned flag indicates that the picture of each access unit in the bitstream is a field picture. Each access unit in the bitstream is encoded with an SEI (supplemental enhancement information) message containing information indicating that the picture of the access unit is a field picture. The bitstream is transmitted by Sending method.

6. A method for transmitting a bitstream, The bitstream contains information to cause the decoding device to execute a decoding method, and the decoding method is The image is decoded according to an encoding structure that includes an IRAP picture, a plurality of leading pictures output before the IRAP picture in output order, and a plurality of trailing pictures output after the IRAP picture in output order. The bitstream containing the flag generates the bitstream containing the image, When decoding the aforementioned image, according to the flag, at most one of the multiple trailing pictures is decoded before the multiple reading pictures in the decoding order. The plurality of trailing pictures, excluding at most one trailing picture, are decoded in the decoding order after the plurality of reading pictures. The aforementioned flag indicates that the picture of each access unit in the bitstream is a field picture. The process includes decrypting each access unit in the bitstream by including an SEI (supplemental enhancement information) message containing information indicating that the picture of the access unit is a field picture, Sending method.