Encoding device, decoding device, generating device, transmitting device, and non-temporary storage medium
By limiting the range of motion search or compensation in affine motion compensation, the encoding and decoding processes achieve efficient motion compensation, reducing memory bandwidth and improving processing efficiency.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Applications
- Current Assignee / Owner
- PANASONIC INTELLECTUAL PROPERTY CORP OF AMERICA
- Filing Date
- 2026-04-21
- Publication Date
- 2026-07-02
Smart Images

Figure 2026110689000001_ABST
Abstract
Description
[Technical Field]
[0001] This disclosure relates to an encoding device, a decoding device, a generating device, a transmitting device, and a non-temporary storage medium. [Background technology]
[0002] Traditionally, H.265 has been used as a standard for encoding moving images. H.265 is also known as HEVC (High Efficiency Video Coding). [Prior art documents] [Non-patent literature]
[0003] [Non-Patent Document 1] H.265(ISO / IEC 23008-2 HEVC(High Efficiency Coding)) [Overview of the project] [Problems that the invention aims to solve]
[0004] However, in the encoding and decoding of moving images, if affine motion compensation is performed inefficiently during interpretation, it may negatively impact processing efficiency, such as by not selecting affine motion compensation and instead performing motion compensation on the target block.
[0005] Therefore, this disclosure provides an encoding device, etc., that can efficiently perform motion compensation using affine motion compensation. [Means for solving the problem]
[0006] An encoding device according to one aspect of the present disclosure comprises a circuit and a memory, wherein the circuit uses the memory to perform motion compensation on a target block in an affine motion compensation prediction process in an inter prediction process of the target block by limiting the range in which motion search or motion compensation is performed, and in the affine motion compensation prediction process, the range in which motion search or motion compensation is performed is limited so that the variation between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block in the affine motion compensation prediction process falls within a predetermined range, and the variation is a value based on the difference between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block.
[0007] These comprehensive or specific embodiments may be implemented as systems, devices, methods, integrated circuits, computer programs, or non-temporary recording media such as computer-readable CD-ROMs, or as any combination of systems, devices, methods, integrated circuits, computer programs, and recording media. [Effects of the Invention]
[0008] An encoding device, etc., according to one aspect of this disclosure can efficiently perform motion compensation using affine motion compensation. [Brief explanation of the drawing]
[0009] [Figure 1] Figure 1 is a block diagram showing the functional configuration of the encoding device according to Embodiment 1. [Figure 2] Figure 2 shows an example of block division in Embodiment 1. [Figure 3] Figure 3 is a table showing the transformation basis functions corresponding to each transformation type. [Figure 4A] Figure 4A shows an example of the filter shape used in ALF. [Figure 4B] Figure 4B shows another example of the filter shape used in ALF. [Figure 4C]FIG. 4C is a diagram showing another example of the shape of the filter used in the ALF. [Figure 5A] FIG. 5A is a diagram showing 67 intra prediction modes in intra prediction. [Figure 5B] FIG. 5B is a flowchart for explaining the outline of the prediction image correction process by the OBMC process. [Figure 5C] FIG. 5C is a conceptual diagram for explaining the outline of the prediction image correction process by the OBMC process. [Figure 5D] FIG. 5D is a diagram showing an example of the FRUC. [Figure 6] FIG. 6 is a diagram for explaining pattern matching (bilateral matching) between two blocks along a motion trajectory. [Figure 7] FIG. 7 is a diagram for explaining pattern matching (template matching) between a template in the current picture and a block in the reference picture. [Figure 8] FIG. 8 is a diagram for explaining a model assuming a constant velocity linear motion. [Figure 9A] FIG. 9A is a diagram for explaining the derivation of a motion vector in sub-block units based on the motion vectors of a plurality of adjacent blocks. [Figure 9B] FIG. 9B is a diagram for explaining the outline of the motion vector derivation process in the merge mode. [Figure 9C] FIG. 9C is a conceptual diagram for explaining the outline of the DMVR process. [Figure 9D] FIG. 9D is a diagram for explaining the outline of a prediction image generation method using a luminance correction process by the LIC process. [Figure 10] FIG. 10 is a block diagram showing the functional configuration of the decoding device according to Embodiment 1. [Figure 11] FIG. 11 is a conceptual diagram for explaining the affine inter mode of affine motion compensation prediction. [Figure 12A] FIG. 12A is a conceptual diagram for explaining the affine merge mode of affine motion compensation prediction. 3] [Figure 12B] Figure 12B is a conceptual diagram illustrating the affine merge mode of affine motion compensation prediction. [Figure 13] Figure 13 is a block diagram showing the internal configuration for performing affine motion compensation prediction processing in the interpretation unit included in the encoding device in Embodiment 1. [Figure 14] Figure 14 is a flowchart showing the first processing procedure for the affine intermode of affine motion compensation by the interprediction unit of the encoding device in Embodiment 1. [Figure 15] Figure 15 is a flowchart showing the second processing procedure for the affine intermode of affine motion compensation by the interprediction unit of the encoding device in Embodiment 1. [Figure 16] Figure 16 is a flowchart showing the first processing procedure of the affine merge mode of affine motion compensation by the interprediction unit of the encoding device in Embodiment 1. [Figure 17] Figure 17 is a flowchart showing the second processing procedure of the affine merge mode of affine motion compensation by the interprediction unit of the encoding device in Embodiment 1. [Figure 18] Figure 18 is a block diagram showing an example of an implementation of the encoding device according to Embodiment 1. [Figure 19] Figure 19 is a flowchart showing an example of the operation of the encoding device according to Embodiment 1. [Figure 20] Figure 20 is a block diagram showing an example of an implementation of the decoding device according to Embodiment 1. [Figure 21] Figure 21 is a flowchart showing an example of the operation of the decoding device according to Embodiment 1. [Figure 22] Figure 22 is an overall diagram of the content supply system that realizes the content distribution service. [Figure 23] Figure 23 shows an example of an encoding structure during scalable encoding. [Figure 24] Figure 24 shows an example of an encoding structure during scalable encoding. [Figure 25]Figure 25 shows an example of how a web page is displayed. [Figure 26] Figure 26 shows an example of how a web page is displayed. [Figure 27] Figure 27 shows an example of a smartphone. [Figure 28] Figure 28 is a block diagram showing an example of a smartphone configuration. [Modes for carrying out the invention]
[0010] For example, an encoding device according to one aspect of the present disclosure comprises a circuit and a memory, wherein the circuit uses the memory to perform motion compensation for a target block by limiting the range in which motion search or motion compensation is performed in the affine motion compensation prediction process in the inter prediction process of the target block.
[0011] This allows the encoding device to efficiently perform motion compensation using affine motion compensation. More specifically, by limiting the range in which motion search or motion compensation is performed in the affine motion compensation prediction process, it becomes possible to suppress the variation in control point motion vectors in affine motion compensation prediction. As a result, the likelihood of affine motion compensation prediction being selected in interpretation increases, allowing for efficient motion compensation using affine motion compensation. Furthermore, it becomes possible to limit the region of the reference image to be acquired, potentially reducing the memory band width required for the external memory, which is the frame memory.
[0012] Here, for example, in the affine motion compensation prediction process, the range in which motion search or motion compensation is performed is limited so that the variation between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block in the affine motion compensation prediction process falls within a predetermined range.
[0013] This allows the encoding device to suppress variations in control point motion vectors in affine motion compensation prediction, increasing the likelihood that affine motion compensation prediction will be selected in interpretation. As a result, motion compensation using affine motion compensation can be performed efficiently.
[0014] Furthermore, for example, in the affine motion compensation prediction process, the limitations on the range in which motion search or motion compensation is performed are newly determined for each picture to be processed, or a set of multiple motion search ranges or motion compensation ranges is predetermined, and an appropriate set is selected for each picture to be processed.
[0015] This allows the range of motion search or motion compensation to be limited at predetermined timings or using a predetermined set, thereby reducing the processing load and thus the memory band width required for external memory.
[0016] Furthermore, for example, in the affine motion compensation prediction process, the limitations on the range in which motion search or motion compensation is performed are changed according to the type of picture being referenced.
[0017] This allows the range of motion search or motion compensation to be limited at predetermined timings or using a predetermined set, thereby reducing the memory band width required for external memory.
[0018] Furthermore, for example, in the affine motion compensation prediction process, the limitations on the range in which motion search or motion compensation is performed are determined for each predetermined profile and level.
[0019] This allows the limits on the range of motion search or motion compensation to be determined for each predetermined profile and level, thereby reducing the memory band width required for external memory.
[0020] Furthermore, for example, in the affine motion compensation prediction process, the limitation on the range in which motion search or motion compensation is performed is determined according to the computational processing power of the encoding side or the motion search processing power due to the memory bandwidth.
[0021] This allows the limits on the range of motion search or motion compensation to be determined according to the search processing capacity, thereby reducing the processing load according to the search processing capacity, and enabling efficient motion compensation using affine motion compensation.
[0022] Furthermore, for example, in the affine motion compensation prediction process, in addition to limiting the range of pixels that can be referenced as a limitation on the range in which motion search or motion compensation is performed, the referenced picture is also limited.
[0023] This not only reduces the memory band width required for external memory, but also enables efficient motion compensation using affine motion compensation.
[0024] Furthermore, for example, in the affine motion compensation prediction process, information regarding the limitation of the range in which motion search or motion compensation is performed is included in the header information of the VPS (Video Parameter Set), SPS (Sequence Parameter Set), or PPS (Picture Parameter Set) of the encoded bitstream.
[0025] This allows for efficient motion compensation using affine motion compensation.
[0026] Furthermore, for example, in the affine motion compensation prediction process, information regarding the limitation of the range in which motion search or motion compensation is performed includes, in addition to information limiting the range of pixels that can be referenced, information limiting the picture to be referenced.
[0027] This not only reduces the memory band width required for external memory, but also enables efficient motion compensation using affine motion compensation.
[0028] Furthermore, for example, in the affine motion compensation prediction process, if the region referenced by the motion vector derived from the control point motion vector to be evaluated is outside the range in which motion search is performed during the affine intermode motion search process, the control point motion vector is excluded from the candidates for motion search.
[0029] This not only reduces the memory band width required for external memory, but also enables efficient motion compensation using affine motion compensation.
[0030] Furthermore, for example, in the affine motion compensation prediction process, if the region referenced by the motion vector derived from the control point prediction motion vector obtained from a processed block adjacent to the block to be processed is outside the range in which motion search is performed in affine intermode, encoding as affine intermode is prohibited.
[0031] This not only reduces the memory band width required for external memory, but also enables efficient motion compensation using affine motion compensation.
[0032] Furthermore, for example, in the affine motion compensation prediction process, if the region referenced by the motion vector derived from the control point motion vector obtained from the processed block adjacent to the block to be processed is outside the range in which motion compensation is performed in affine merge mode, encoding in affine merge mode is prohibited.
[0033] This not only reduces the memory band width required for external memory, but also enables efficient motion compensation using affine motion compensation.
[0034] Furthermore, for example, a decoding device according to one aspect of the present disclosure comprises a circuit and a memory, wherein the circuit decodes an encoded stream by using the memory to perform motion compensation on the target block in the affine motion compensation prediction process in the inter prediction process of the target block, by limiting the range in which motion search or motion compensation is performed.
[0035] This allows the decoder to efficiently perform motion compensation using affine motion compensation. More specifically, by limiting the range in which motion search or motion compensation is performed in the affine motion compensation prediction process, it becomes possible to suppress the variation in control point motion vectors in affine motion compensation prediction. As a result, the likelihood of affine motion compensation prediction being selected in interpretation increases, allowing motion compensation to be performed efficiently using affine motion compensation.
[0036] Here, for example, in the affine motion compensation prediction process, the range in which motion search or motion compensation is performed is limited so that the variation between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block in the affine motion compensation prediction process falls within a predetermined range.
[0037] This allows the decoder to suppress variations in control point motion vectors in affine motion compensation prediction, increasing the likelihood that affine motion compensation prediction will be selected in interpretation. As a result, motion compensation using affine motion compensation can be performed efficiently.
[0038] Furthermore, for example, in the affine motion compensation prediction process, the limitations on the range in which motion search or motion compensation is performed are newly determined for each picture to be processed, or a set of multiple motion search ranges or motion compensation ranges is predetermined, and an appropriate set is selected for each picture to be processed.
[0039] This allows the range of motion search or motion compensation to be limited at predetermined timings or using a predetermined set, thereby reducing the processing load and thus the memory band width required for external memory.
[0040] Furthermore, for example, in the affine motion compensation prediction process, the limitations on the range in which motion search or motion compensation is performed are changed according to the type of picture being referenced.
[0041] This allows the range of motion search or motion compensation to be limited at predetermined timings or using a predetermined set, thereby reducing the memory band width required for external memory.
[0042] Furthermore, for example, in the affine motion compensation prediction process, the limitations on the range in which motion search or motion compensation is performed are determined for each predetermined profile and level.
[0043] This allows the limits on the range of motion search or motion compensation to be determined for each predetermined profile and level, thereby reducing the memory band width required for external memory.
[0044] Furthermore, for example, in the affine motion compensation prediction process, the limitation on the range in which motion search or motion compensation is performed is determined according to the computational processing power of the encoding side or the motion search processing power due to the memory bandwidth.
[0045] This allows the limits on the range of motion search or motion compensation to be determined according to the search processing capacity, thereby reducing the memory band width required for external memory and enabling efficient motion compensation using affine motion compensation.
[0046] Furthermore, for example, in the affine motion compensation prediction process, in addition to limiting the range of pixels that can be referenced as a limitation on the range in which motion search or motion compensation is performed, the referenced picture is also limited.
[0047] This not only reduces the memory band width required for external memory, but also enables efficient motion compensation using affine motion compensation.
[0048] Furthermore, for example, in the affine motion compensation prediction process, information regarding the limitation of the range in which motion search or motion compensation is performed is included in the header information of the VPS (Video Parameter Set), SPS (Sequence Parameter Set), or PPS (Picture Parameter Set) of the encoded bitstream.
[0049] This allows for efficient motion compensation using affine motion compensation.
[0050] Furthermore, for example, in the affine motion compensation prediction process, information regarding the limitation of the range in which motion search or motion compensation is performed includes, in addition to information limiting the range of pixels that can be referenced, information limiting the picture to be referenced.
[0051] This not only reduces the memory band width required for external memory, but also enables efficient motion compensation using affine motion compensation.
[0052] Furthermore, for example, in the affine motion compensation prediction process, if the region referenced by the motion vector derived from the control point motion vector to be evaluated is outside the range in which motion search is performed during the affine intermode motion search process, the control point motion vector is excluded from the candidates for motion search.
[0053] This not only reduces the memory band width required for external memory, but also enables efficient motion compensation using affine motion compensation.
[0054] Furthermore, for example, the affine motion compensation prediction process prohibits encoding as an affine intermode if the region referenced by the motion vector derived from the control point prediction motion vector obtained from a processed block adjacent to the block being processed is outside the range in which motion search is performed.
[0055] This not only reduces the memory band width required for external memory, but also enables efficient motion compensation using affine motion compensation.
[0056] Furthermore, for example, in the affine motion compensation prediction process, if the region referenced by the motion vector derived from the control point motion vector obtained from the processed block adjacent to the block to be processed is outside the range in which motion compensation is performed in affine merge mode, encoding in affine merge mode is prohibited.
[0057] This not only reduces the memory band width required for external memory, but also enables efficient motion compensation using affine motion compensation.
[0058] Furthermore, for example, an encoding method according to one aspect of this disclosure performs motion compensation for the target block by limiting the range in which motion search or motion compensation is performed in the affine motion compensation prediction process in the inter prediction process of the target block.
[0059] As a result, devices using this encoding method can efficiently perform motion compensation using affine motion compensation. More specifically, by limiting the range in which motion search or motion compensation is performed in the affine motion compensation prediction process, it becomes possible to suppress the variation in control point motion vectors in affine motion compensation prediction. This increases the likelihood that affine motion compensation prediction will be selected in interpretation, thus enabling efficient motion compensation using affine motion compensation. Furthermore, it becomes possible to limit the area of the reference image to be acquired, potentially reducing the memory band width required for the external memory, which is the frame memory.
[0060] Furthermore, for example, a decoding method according to one aspect of this disclosure decodes an encoded stream by performing motion compensation on the target block by limiting the range in which motion search or motion compensation is performed during the affine motion compensation prediction process in the interpretation process of the target block.
[0061] As a result, devices using this decoding method can efficiently perform motion compensation using affine motion compensation. More specifically, by limiting the range in which motion search or motion compensation is performed in the affine motion compensation prediction process, it becomes possible to suppress the variation in control point motion vectors in affine motion compensation prediction. This increases the likelihood that affine motion compensation prediction will be selected in interpretation, thus enabling efficient motion compensation using affine motion compensation. Furthermore, it becomes possible to limit the region of the reference image to be acquired, potentially reducing the memory band width required for the external memory, which is the frame memory.
[0062] Furthermore, these comprehensive or specific embodiments may be implemented as systems, devices, methods, integrated circuits, computer programs, or non-temporary recording media such as computer-readable CD-ROMs, or as any combination of systems, devices, methods, integrated circuits, computer programs, and recording media.
[0063] The embodiments will be described in detail below with reference to the drawings.
[0064] The embodiments described below are all comprehensive or specific examples. The numerical values, shapes, materials, components, arrangement and connection configurations of components, steps, and the order of steps shown in the following embodiments are examples only and are not intended to limit the scope of the claims. Furthermore, among the components in the following embodiments, those not described in the independent claim representing the highest-level concept will be described as optional components.
[0065] (Embodiment 1) First, an overview of Embodiment 1 will be given as an example of an encoding and decoding device to which the processes and / or configurations described in each aspect of this disclosure, described later, can be applied. However, Embodiment 1 is merely an example of an encoding and decoding device to which the processes and / or configurations described in each aspect of this disclosure can be applied, and the processes and / or configurations described in each aspect of this disclosure can also be implemented in encoding and decoding devices different from Embodiment 1.
[0066] When applying the processes and / or configurations described in each aspect of this disclosure to Embodiment 1, for example, one of the following may be performed:
[0067] (1) With respect to the encoding or decoding device of Embodiment 1, replace the component corresponding to the component described in each aspect of the disclosure with the component described in each aspect of the disclosure, among the plurality of components constituting the encoding or decoding device. (2) With respect to the encoding or decoding device of Embodiment 1, any modifications such as adding, replacing, or deleting functions or processes performed by some of the multiple components constituting the encoding or decoding device are made, and then the components corresponding to the components described in each aspect of the Disclosure are replaced with the components described in each aspect of the Disclosure. (3) Adding processing to and / or replacing, deleting, or otherwise modifying some of the processing included in the method performed by the encoding or decoding device of Embodiment 1, and then replacing the processing corresponding to the processing described in each aspect of the Disclosure with the processing described in each aspect of the Disclosure. (4) Combining some of the multiple components constituting the encoding or decoding device of Embodiment 1 with a component described in each aspect of the Disclosure, a component that has some of the functions of the component described in each aspect of the Disclosure, or a component that performs some of the processing performed by the component described in each aspect of the Disclosure. (5) A component that has some of the functions of some of the components constituting the encoding or decoding device of Embodiment 1, or a component that performs some of the processing performed by some of the components constituting the encoding or decoding device of Embodiment 1, in combination with a component described in each aspect of this disclosure, a component that has some of the functions of the components described in each aspect of this disclosure, or a component that performs some of the processing performed by the components described in each aspect of this disclosure. (6) With respect to the method performed by the encoding or decoding device of Embodiment 1, replace with the process corresponding to the process described in each aspect of the Disclosure among the plurality of processes included in the method with the process described in each aspect of the Disclosure. (7) Performing some of the processes included in the method performed by the encoding or decoding device of Embodiment 1 in combination with the processes described in each aspect of the present disclosure.
[0068] The methods of implementing the processes and / or configurations described in each aspect of this disclosure are not limited to the examples above. For example, they may be implemented in a device used for a purpose other than the video / image encoding device or video / image decoding device disclosed in Embodiment 1, or the processes and / or configurations described in each embodiment may be implemented individually. Furthermore, the processes and / or configurations described in different embodiments may be implemented in combination.
[0069] [Overview of the coding device] First, an overview of the encoding device according to Embodiment 1 will be described. Figure 1 is a block diagram showing the functional configuration of the encoding device 100 according to Embodiment 1. The encoding device 100 is a video / image encoding device that encodes video / images in block units.
[0070] As shown in Figure 1, the encoding device 100 is a device that encodes an image in block units and comprises a division unit 102, a subtraction unit 104, a transformation unit 106, a quantization unit 108, an entropy encoding unit 110, an inverse quantization unit 112, an inverse transformation unit 114, an addition unit 116, a block memory 118, a loop filter unit 120, a frame memory 122, an intra prediction unit 124, an inter prediction unit 126, and a prediction control unit 128.
[0071] The encoding device 100 can be implemented, for example, by a general-purpose processor and memory. In this case, when a software program stored in memory is executed by the processor, the processor functions as a splitting unit 102, a subtraction unit 104, a conversion unit 106, a quantization unit 108, an entropy encoding unit 110, an inverse quantization unit 112, an inverse conversion unit 114, an addition unit 116, a loop filter unit 120, an intra prediction unit 124, an inter prediction unit 126, and a prediction control unit 128. Alternatively, the encoding device 100 may be implemented as one or more dedicated electronic circuits corresponding to the splitting unit 102, a subtraction unit 104, a conversion unit 106, a quantization unit 108, an entropy encoding unit 110, an inverse quantization unit 112, an inverse conversion unit 114, an addition unit 116, a loop filter unit 120, an intra prediction unit 124, an inter prediction unit 126, and a prediction control unit 128.
[0072] The following describes each component included in the encoding device 100.
[0073] [Divided part] The splitting unit 102 divides each picture contained in the input video into multiple blocks and outputs each block to the subtraction unit 104. For example, the splitting unit 102 first divides the picture into blocks of a fixed size (e.g., 128x128). These fixed-size blocks are sometimes called coding tree units (CTUs). Then, based on recursive quadtree and / or binary tree block partitioning, the splitting unit 102 divides each of the fixed-size blocks into blocks of a variable size (e.g., 64x64 or less). These variable-size blocks are sometimes called coding units (CUs), prediction units (PUs), or transformation units (TUs). In this embodiment, CUs, PUs, and TUs do not need to be distinguished, and some or all of the blocks in the picture may become processing units for CUs, PUs, and TUs.
[0074] Figure 2 shows an example of block partitioning in Embodiment 1. In Figure 2, solid lines represent block boundaries due to quadtree block partitioning, and dashed lines represent block boundaries due to binary tree block partitioning.
[0075] Here, block 10 is a 128x128 pixel square block (128x128 block). This 128x128 block 10 is first divided into four 64x64 square blocks (quadtree block partitioning).
[0076] The top-left 64x64 block is further divided vertically into two rectangular 32x64 blocks, and the left 32x64 block is further divided vertically into two rectangular 16x64 blocks (binary tree block partitioning). As a result, the top-left 64x64 block is divided into two 16x64 blocks 11 and 12 and a 32x64 block 13.
[0077] The 64x64 block in the upper right is horizontally divided into two rectangular 64x32 blocks, 14 and 15 (binary tree block division).
[0078] The bottom-left 64x64 block is divided into four square 32x32 blocks (quadrutree block division). Of the four 32x32 blocks, the top-left and bottom-right blocks are further divided. The top-left 32x32 block is vertically divided into two rectangular 16x32 blocks, and the rightmost 16x32 block is further horizontally divided into two 16x16 blocks (binary tree block division). The bottom-right 32x32 block is horizontally divided into two 32x16 blocks (binary tree block division). As a result, the bottom-left 64x64 block is divided into 16x32 block 16, two 16x16 blocks 17 and 18, two 32x32 blocks 19 and 20, and two 32x16 blocks 21 and 22.
[0079] The 64x64 block 23 in the bottom right will not be divided.
[0080] As described above, in Figure 2, block 10 is divided into 13 variable-sized blocks 11-23 based on a recursive quad-tree and binary tree block partition. Such a partition is sometimes called a QTBT (quad-tree plus binary tree) partition.
[0081] In Figure 2, one block was divided into four or two blocks (quadrutree or binary tree block partitioning), but the partitioning is not limited to these. For example, one block may be divided into three blocks (ternary tree block partitioning). Partitioning that includes such ternary tree block partitioning is sometimes called MBT (multi-type tree) partitioning.
[0082] [Subtraction Unit] The subtraction unit 104 subtracts the predicted signal (predicted sample) from the original signal (original sample) in block units divided by the division unit 102. In other words, the subtraction unit 104 calculates the prediction error (also called the residual) of the block to be encoded (hereinafter referred to as the current block). The subtraction unit 104 then outputs the calculated prediction error to the conversion unit 106.
[0083] The source signal is the input signal to the encoding device 100, and is a signal representing the image of each picture that makes up the moving image (for example, a luminance (luma) signal and two chroma (chroma) signals). In the following, the signal representing the image may also be called a sample.
[0084] [Conversion section] The conversion unit 106 converts the prediction error in the spatial domain into conversion coefficients in the frequency domain and outputs the conversion coefficients to the quantization unit 108. Specifically, the conversion unit 106 performs a predetermined discrete cosine transform (DCT) or discrete sine transform (DST) on the prediction error in the spatial domain, for example.
[0085] The transformation unit 106 may also adaptively select a transformation type from among several transformation types and use a transformation basis function corresponding to the selected transformation type to convert the prediction error into transformation coefficients. Such a transformation is sometimes called an EMT (explicit multiple core transform) or an AMT (adaptive multiple transform).
[0086] Multiple transformation types include, for example, DCT-II, DCT-V, DCT-VIII, DST-I, and DST-VII. Figure 3 is a table showing the transformation basis functions corresponding to each transformation type. In Figure 3, N represents the number of input pixels. The selection of a transformation type from among these multiple transformation types may depend, for example, on the type of prediction (intra-prediction and inter-prediction) or on the intra-prediction mode.
[0087] Information indicating whether or not to apply such EMT or AMT (e.g., called an AMT flag) and information indicating the selected conversion type are signaled at the CU level. However, the signaling of this information is not limited to the CU level and may be at other levels (e.g., sequence level, picture level, slice level, tile level, or CTU level).
[0088] Furthermore, the transformation unit 106 may retransform the transformation coefficients (transformation results). Such retransformation is sometimes called AST (adaptive secondary transform) or NSST (non-separable secondary transform). For example, the transformation unit 106 performs retransformation for each subblock (e.g., 4x4 subblock) contained in the block of transformation coefficients corresponding to the intra-prediction error. Information indicating whether or not to apply NSST and information regarding the transformation matrix used for NSST are signaled at the CU level. Note that the signaling of this information is not limited to the CU level, but may be at other levels (e.g., sequence level, picture level, slice level, tile level, or CTU level).
[0089] Here, a separable transformation is a method in which the input is separated into directions equal to the number of dimensions and transformed multiple times, while a non-separable transformation is a method in which, when the input is multidimensional, two or more dimensions are treated as one dimension and transformed together.
[0090] For example, one example of a non-separable transformation is to treat a 4x4 block as a single array with 16 elements and then perform a transformation on that array using a 16x16 transformation matrix.
[0091] Similarly, the Hypercube Givens Transform, which treats a 4x4 input block as a single array with 16 elements and then performs multiple Givens rotations on that array, is another example of a non-separable transformation.
[0092] [Quantization section] The quantization unit 108 quantizes the conversion coefficients output from the conversion unit 106. Specifically, the quantization unit 108 scans the conversion coefficients of the current block in a predetermined scanning order and quantizes the conversion coefficients based on the quantization parameter (QP) corresponding to the scanned conversion coefficients. The quantization unit 108 then outputs the quantized conversion coefficients of the current block (hereinafter referred to as quantization coefficients) to the entropy coding unit 110 and the inverse quantization unit 112.
[0093] The predetermined order is the order for quantization / inverse quantization of the transformation coefficients. For example, the predetermined scanning order is defined as ascending frequency (from low frequency to high frequency) or descending frequency (from high frequency to low frequency).
[0094] Quantization parameters are parameters that define the quantization step (quantization width). For example, if the value of the quantization parameter increases, the quantization step also increases. In other words, if the value of the quantization parameter increases, the quantization error increases.
[0095] [Entropy coding unit] The entropy coding unit 110 generates an encoded signal (encoded bitstream) by variable-length encoding the quantization coefficients, which are input from the quantization unit 108. Specifically, the entropy coding unit 110, for example, binarizes the quantization coefficients and arithmetically encodes the binary signal.
[0096] [Dequantization section] The inverse quantization unit 112 inversely quantizes the quantization coefficients, which are input from the quantization unit 108. Specifically, the inverse quantization unit 112 inversely quantizes the quantization coefficients of the current block in a predetermined scanning order. Then, the inverse quantization unit 112 outputs the inversely quantized conversion coefficients of the current block to the inverse conversion unit 114.
[0097] [Inverse Transformation Section] The inverse transform unit 114 restores the prediction error by inversely transforming the transformation coefficients, which are input from the inverse quantization unit 112. Specifically, the inverse transform unit 114 restores the prediction error of the current block by performing an inverse transform on the transformation coefficients that corresponds to the transformation by the transformation unit 106. The inverse transform unit 114 then outputs the restored prediction error to the summation unit 116.
[0098] Furthermore, the recovered prediction error does not match the prediction error calculated by the subtraction unit 104 because information is lost due to quantization. In other words, the recovered prediction error includes quantization errors.
[0099] [Addition section] The adder 116 reconstructs the current block by adding the prediction error, which is the input from the inverse transformer 114, and the prediction sample, which is the input from the prediction control unit 128. The adder 116 then outputs the reconstructed block to the block memory 118 and the loop filter unit 120. The reconstructed block is sometimes called the local decoded block.
[0100] [Block memory] The block memory 118 is a storage unit for storing blocks within the picture to be encoded (hereinafter referred to as the current picture) that are referenced in intra prediction. Specifically, the block memory 118 stores the reconstructed blocks output from the adder 116.
[0101] [Loop Filter Section] The loop filter unit 120 applies a loop filter to the block reconstructed by the adder unit 116 and outputs the filtered reconstructed block to the frame memory 122. A loop filter is a filter used within the encoding loop (in-loop filter), and includes, for example, a deblocking filter (DF), sample adaptive offset (SAO), and adaptive loop filter (ALF).
[0102] In ALF, a least-squares error filter is applied to remove coding distortion. For example, for each 2x2 subblock within the current block, one filter selected from several filters is applied based on the direction and activity of the local gradient.
[0103] Specifically, first, subblocks (e.g., 2x2 subblocks) are classified into multiple classes (e.g., 15 or 25 classes). The classification of subblocks is based on the direction and activity of the gradient. For example, a classification value C (e.g., C = 5D + A) is calculated using the gradient direction value D (e.g., 0-2 or 0-4) and the gradient activity value A (e.g., 0-4). Then, based on the classification value C, the subblocks are classified into multiple classes (e.g., 15 or 25 classes).
[0104] The gradient direction value D is derived, for example, by comparing gradients in multiple directions (e.g., horizontal, vertical, and two diagonal directions). The gradient activation value A is derived, for example, by adding the gradients in multiple directions and quantizing the sum.
[0105] Based on the results of this classification, a filter for the subblock is determined from among multiple filters.
[0106] For example, a circularly symmetric shape is used as the filter shape in ALF. Figures 4A to 4C show several examples of filter shapes used in ALF. Figure 4A shows a 5x5 diamond-shaped filter, Figure 4B shows a 7x7 diamond-shaped filter, and Figure 4C shows a 9x9 diamond-shaped filter. Information indicating the filter shape is signaled at the picture level. However, the signaling of information indicating the filter shape is not limited to the picture level and may be at other levels (e.g., sequence level, slice level, tile level, CTU level, or CU level).
[0107] The on / off status of ALF is determined, for example, at the picture level or CU level. For instance, the decision to apply ALF to luminance is made at the CU level, and the decision to apply ALF to color difference is made at the picture level. Information indicating whether ALF is on or off is signaled at the picture level or CU level. However, the signaling of information indicating whether ALF is on or off is not limited to the picture level or CU level, but may be at other levels (e.g., sequence level, slice level, tile level, or CTU level).
[0108] The coefficient sets of multiple selectable filters (e.g., up to 15 or 25 filters) are signaled at the picture level. However, the signaling of the coefficient sets is not limited to the picture level; it may be at other levels (e.g., sequence level, slice level, tile level, CTU level, CU level, or subblock level).
[0109] [Frame memory] The frame memory 122 is a storage unit for storing reference pictures used in interpretation, and is sometimes called a frame buffer. Specifically, the frame memory 122 stores the reconstructed blocks filtered by the loop filter unit 120.
[0110] [Intra Prediction Unit] The intra-prediction unit 124 generates a prediction signal (intra-prediction signal) by performing intra-prediction (also called in-screen prediction) of the current block by referring to the block in the current picture stored in the block memory 118. Specifically, the intra-prediction unit 124 generates an intra-prediction signal by performing intra-prediction by referring to samples (e.g., luminance values, color difference values) of blocks adjacent to the current block, and outputs the intra-prediction signal to the prediction control unit 128.
[0111] For example, the intra-prediction unit 124 performs intra-prediction using one of a predetermined set of intra-prediction modes. The set of intra-prediction modes includes one or more non-directional prediction modes and multiple directional prediction modes.
[0112] One or more non-directional prediction modes include, for example, the Planar prediction mode and DC prediction mode as defined in the H.265 / HEVC (High-Efficiency Video Coding) standard (Non-Patent Document 1).
[0113] Multiple directional prediction modes include, for example, the 33 directional prediction modes defined in the H.265 / HEVC standard. Note that multiple directional prediction modes may also include 32 additional directional prediction modes (a total of 65 directional prediction modes). Figure 5A shows 67 intra-prediction modes (2 non-directional prediction modes and 65 directional prediction modes) in intra-prediction. Solid arrows represent the 33 directions defined in the H.265 / HEVC standard, and dashed arrows represent the additional 32 directions.
[0114] Furthermore, in the intra-prediction of a color difference block, a luminance block may be referenced. That is, the color difference component of the current block may be predicted based on the luminance component of the current block. Such intra-prediction is sometimes called CCLM (cross-component linear model) prediction. Such an intra-prediction mode for a color difference block that references a luminance block (e.g., called the CCLM mode) may be added as one of the intra-prediction modes for a color difference block.
[0115] The intra-prediction unit 124 may correct the pixel values after intra-prediction based on the gradient of the horizontal / vertical reference pixels. Intra-prediction with such correction is sometimes called PDPC (position dependent intra-prediction combination). Information indicating whether or not PDPC is applied (for example, called a PDPC flag) is signaled at, for example, the CU level. Note that the signaling of this information is not limited to the CU level, but may be at other levels (for example, sequence level, picture level, slice level, tile level, or CTU level).
[0116] [International Prediction Department] The inter-prediction unit 126 generates a prediction signal (inter-prediction signal) by performing inter-prediction (also called inter-screen prediction) of the current block by referring to a reference picture stored in the frame memory 122 that is different from the current picture. Inter-prediction is performed in units of the current block or sub-blocks within the current block (e.g., 4x4 blocks). For example, the inter-prediction unit 126 performs motion estimation within the reference picture for the current block or sub-block. Then, the inter-prediction unit 126 generates an inter-prediction signal for the current block or sub-block by performing motion compensation using motion information (e.g., motion vectors) obtained from the motion estimation. Finally, the inter-prediction unit 126 outputs the generated inter-prediction signal to the prediction control unit 128.
[0117] The motion information used for motion compensation is converted into a signal. A motion vector predictor may be used to convert the motion vector into a signal. In other words, the difference between the motion vector and the predicted motion vector may be converted into a signal.
[0118] Furthermore, an inter-prediction signal may be generated using not only the motion information of the current block obtained through motion search, but also the motion information of adjacent blocks. Specifically, an inter-prediction signal may be generated for each sub-block within the current block by weighted addition of a prediction signal based on motion information obtained through motion search and a prediction signal based on the motion information of adjacent blocks. Such inter-prediction (motion compensation) is sometimes called OBMC (overlapped block motion compensation).
[0119] In this OBMC mode, information indicating the size of the subblock for OBMC (e.g., called the OBMC block size) is signaled at the sequence level. Information indicating whether or not to apply OBMC mode (e.g., called the OBMC flag) is signaled at the CU level. Note that the signaling levels for this information are not limited to the sequence and CU levels; other levels (e.g., picture level, slice level, tile level, CTU level, or subblock level) may also be used.
[0120] Let's explain the OBMC mode in more detail. Figures 5B and 5C are flowcharts and conceptual diagrams illustrating the overview of the predictive image correction process using OBMC processing.
[0121] First, a predicted image (Pred) is obtained using normal motion compensation with the motion vector (MV) assigned to the block to be encoded.
[0122] Next, the motion vector (MV_L) of the encoded left adjacent block is applied to the block to be encoded to obtain a predicted image (Pred_L), and the first correction of the predicted image is performed by superimposing the predicted image and Pred_L with weights.
[0123] Similarly, the motion vector (MV_U) of the encoded upper adjacent block is applied to the block to be encoded to obtain a predicted image (Pred_U). The predicted image is then corrected a second time by weighting the first corrected predicted image and Pred_U, and this is used as the final predicted image.
[0124] While this explanation describes a two-stage correction method using the left adjacent block and the upper adjacent block, it is also possible to use the right adjacent block and the lower adjacent block to perform corrections more than two times.
[0125] Furthermore, the area to be superimposed does not have to be the entire pixel area of the block, but rather only a portion of the area near the block boundary.
[0126] Although this explanation describes the predictive image correction process using a single reference picture, the process is similar when correcting predictive images from multiple reference pictures. After obtaining corrected predictive images from each reference picture, the resulting predictive images are superimposed to create the final predictive image.
[0127] The processing target block may be a prediction block unit, or it may be a sub-block unit obtained by further dividing the prediction block.
[0128] One method for determining whether or not to apply OBMC processing is to use an obmc_flag signal, which indicates whether or not to apply OBMC processing. Specifically, in an encoding device, it is determined whether or not the block to be encoded belongs to a region with complex motion. If it belongs to a region with complex motion, the obmc_flag is set to a value of 1 and OBMC processing is applied to perform encoding. If it does not belong to a region with complex motion, the obmc_flag is set to a value of 0 and encoding is performed without applying OBMC processing. On the other hand, in a decoding device, the obmc_flag written in the stream is decoded, and the device switches whether or not to apply OBMC processing depending on its value and performs decoding.
[0129] Furthermore, motion information may be derived by the decoder without being converted into a signal. For example, the merge mode specified in the H.265 / HEVC standard may be used. Alternatively, motion information may be derived by performing a motion search on the decoder side. In this case, the motion search is performed without using the pixel values of the current block.
[0130] Here, we will explain the mode in which motion detection is performed on the decoding device side. This mode in which motion detection is performed on the decoding device side is sometimes called PMMVD (pattern matched motion vector derivation) mode or FRUC (frame rate up-conversion) mode.
[0131] An example of FRUC processing is shown in Figure 5D. First, a list of multiple candidates (which may be the same as the merge list) is generated, each having a predicted motion vector, by referencing the motion vectors of spatially or temporally adjacent encoded blocks to the current block. Next, the best candidate MV is selected from among the multiple candidate MVs registered in the candidate list. For example, an evaluation value is calculated for each candidate included in the candidate list, and one candidate is selected based on the evaluation value.
[0132] Then, based on the motion vectors of the selected candidates, a motion vector for the current block is derived. Specifically, for example, the motion vector of the selected candidate (best candidate MV) is directly derived as the motion vector for the current block. Alternatively, for example, the motion vector for the current block may be derived by performing pattern matching in the area surrounding the position in the reference picture corresponding to the motion vector of the selected candidate. That is, a similar search is performed in the area surrounding the best candidate MV, and if an MV with a better evaluation value is found, the best candidate MV may be updated to this MV and used as the final MV for the current block. It is also possible to configure the system so that this process is not performed.
[0133] The same processing method can be used when processing at the sub-block level.
[0134] The evaluation value is calculated by determining the difference value of the reconstructed image through pattern matching between a region in the reference picture corresponding to the motion vector and a predetermined region. Alternatively, the evaluation value may be calculated using other information in addition to the difference value.
[0135] For pattern matching, either first-order pattern matching or second-order pattern matching is used. First-order pattern matching and second-order pattern matching are sometimes called bilateral matching and template matching, respectively.
[0136] In the first pattern matching, pattern matching is performed between two blocks in two different reference pictures that are aligned with the motion trajectory of the current block. Therefore, in the first pattern matching, a region in another reference picture aligned with the motion trajectory of the current block is used as a predetermined region for calculating the evaluation value of the candidate described above.
[0137] Figure 6 illustrates an example of pattern matching (bilateral matching) between two blocks along a motion trajectory. As shown in Figure 6, in the first pattern matching, two motion vectors (MV0, MV1) are derived by searching for the best-matching pair of two blocks within two different reference pictures (Ref0, Ref1) that are along the motion trajectory of the current block. Specifically, for the current block, the difference between the reconstructed image at a specified position in the first encoded reference picture (Ref0) specified by the candidate MV and the reconstructed image at a specified position in the second encoded reference picture (Ref1) specified by the symmetric MV obtained by scaling the candidate MV by the display time interval is derived, and an evaluation value is calculated using the obtained difference value. It is preferable to select the candidate MV with the best evaluation value among multiple candidate MVs as the final MV.
[0138] Under the assumption of a continuous motion trajectory, the motion vector (MV0, MV1) pointing to two reference blocks is proportional to the temporal distance (TD0, TD1) between the current picture (Cur Pic) and the two reference pictures (Ref0, Ref1). For example, if the current picture is temporally located between the two reference pictures and the temporal distances from the current picture to the two reference pictures are equal, then the first pattern matching derives a mirror-symmetric bidirectional motion vector.
[0139] In the second pattern matching, pattern matching is performed between the template in the current picture (blocks adjacent to the current block in the current picture (e.g., blocks above and / or to the left)) and the blocks in the reference picture. Therefore, in the second pattern matching, the blocks adjacent to the current block in the current picture are used as a predetermined area for calculating the evaluation value of the candidates mentioned above.
[0140] Figure 7 illustrates an example of pattern matching (template matching) between a template in the current picture and a block in the reference picture. As shown in Figure 7, in the second pattern matching, the motion vector of the current block is derived by searching in the reference picture (Ref0) for the block that best matches the block adjacent to the current block (Cur block) in the current picture (Cur Pic). Specifically, for the current block, the difference is derived between the reconstructed image of the encoded region of both or either of the left adjacent and upper adjacent regions and the reconstructed image at the equivalent position in the encoded reference picture (Ref0) specified by the candidate MV. An evaluation value is calculated using the obtained difference value, and the candidate MV with the best evaluation value among multiple candidate MVs is selected as the best candidate MV.
[0141] Information indicating whether or not to apply such a FRUC mode (e.g., called the FRUC flag) is signaled at the CU level. Furthermore, if the FRUC mode is applied (e.g., the FRUC flag is true), information indicating the pattern matching method (first pattern matching or second pattern matching) (e.g., called the FRUC mode flag) is signaled at the CU level. Note that the signaling of this information is not limited to the CU level; it may be at other levels (e.g., sequence level, picture level, slice level, tile level, CTU level, or subblock level).
[0142] Here, we will explain the mode for deriving motion vectors based on a model that assumes uniform linear motion. This mode is called BIO (bi-directional optical). This is sometimes called flow mode.
[0143] Figure 8 is a diagram illustrating a model that assumes uniform linear motion. In Figure 8, (v x ,v y( ) indicates the velocity vector, and τ0 and τ1 respectively indicate the temporal distances between the current picture (Cur Pic) and two reference pictures (Ref0, Ref1). (MVx0, MVy0) indicates the motion vector corresponding to the reference picture Ref0, and (MVx1, MVy1) indicates the motion vector corresponding to the reference picture Ref1.
[0144] At this time, under the assumption of uniform linear motion of the velocity vector (v x , v y ), (MVx0, MVy0) and (MVx1, MVy1) are respectively (v x τ0, v y τ0) and (-v x τ1, -v y τ1), and the following optical flow equation (1) holds.
[0145]
Equation
[0146] Here, I (k) indicates the luminance value of the reference image k (k = 0, 1) after motion compensation. This optical flow equation indicates that the sum of (i) the temporal derivative of the luminance value, (ii) the product of the horizontal velocity and the horizontal component of the spatial gradient of the reference image, and (iii) the product of the vertical velocity and the vertical component of the spatial gradient of the reference image is equal to zero. Based on the combination of this optical flow equation and Hermite interpolation, the motion vectors in block units obtained from the merge list, etc. are corrected in pixel units.
[0147] Note that the motion vector may be derived on the decoder side by a method different from the derivation of the motion vector based on the model assuming uniform linear motion. For example, the motion vector may be derived in sub-block units based on the motion vectors of a plurality of adjacent blocks.
[0148] Here, we will describe a mode in which motion vectors are derived at the sub-block level based on the motion vectors of multiple adjacent blocks. This mode is sometimes called the affine motion compensation prediction mode.
[0149] Figure 9A is a diagram illustrating the derivation of subblock-level motion vectors based on the motion vectors of multiple adjacent blocks. In Figure 9A, the current block contains 16 4x4 subblocks. Here, the motion vector v0 of the upper left corner control point of the current block is derived based on the motion vectors of the adjacent blocks, and the motion vector v1 of the upper right corner control point of the current block is derived based on the motion vectors of the adjacent subblocks. Then, using the two motion vectors v0 and v1, the motion vector (v) of each subblock within the current block is derived by the following equation (2). x ,v y ) is derived.
[0150]
number
[0151] Here, x and y represent the horizontal and vertical positions of the subblock, respectively, and w represents a predetermined weighting coefficient.
[0152] Such affine motion compensation prediction modes may include several modes in which the motion vectors of the upper-left and upper-right corner control points are derived. Information indicating such affine motion compensation prediction modes (e.g., called affine flags) is signaled at the CU level. Note that the signaling of this information indicating affine motion compensation prediction modes is not limited to the CU level, but may be at other levels (e.g., sequence level, picture level, slice level, tile level, CTU level, or subblock level).
[0153] [Prediction Control Unit] The prediction control unit 128 selects either the intra-prediction signal or the inter-prediction signal and outputs the selected signal as the prediction signal to the subtraction unit 104 and the addition unit 116.
[0154] Here, we will explain an example of deriving the motion vector of a picture to be encoded using merge mode. Figure 9B is a diagram illustrating the overview of the motion vector derivation process using merge mode.
[0155] First, a list of predicted MVs is generated, containing registered candidates for predicted MVs. Candidates for predicted MVs include spatially adjacent predicted MVs, which are the MVs of multiple encoded blocks located spatially around the block to be encoded; temporally adjacent predicted MVs, which are the MVs of nearby blocks projected onto the location of the block to be encoded in the encoded reference picture; combined predicted MVs, which are generated by combining the MV values of spatially adjacent predicted MVs and temporally adjacent predicted MVs; and zero predicted MVs, which are MVs with a value of zero.
[0156] Next, one predicted MV is selected from the multiple predicted MVs registered in the predicted MV list to determine it as the MV for the block to be encoded.
[0157] Furthermore, the variable-length coding unit encodes the merge_idx signal, which indicates which predicted MV was selected, by writing it to a stream.
[0158] Note that the predicted MVs registered in the predicted MV list explained in Figure 9B are just an example, and the number of predicted MVs may differ from the number shown in the figure, the configuration may not include some of the types of predicted MVs shown in the figure, or it may include predicted MVs other than those shown in the figure.
[0159] Alternatively, the final MV may be determined by performing the DMVR process described later using the MV of the target block to be encoded derived by merge mode.
[0160] Here, we will explain an example of determining the MV using DMVR processing.
[0161] Figure 9C is a conceptual diagram illustrating the overview of DMVR processing.
[0162] First, the optimal MVP set for the block to be processed is used as a candidate MV. According to the candidate MV, reference pixels are obtained from the first reference picture, which is a processed picture in the L0 direction, and the second reference picture, which is a processed picture in the L1 direction, and a template is generated by taking the average of each reference pixel.
[0163] Next, using the template, the surrounding regions of candidate MVs for the first and second reference pictures are searched, and the MV with the lowest cost is determined as the final MV. The cost value is calculated using the difference between each pixel value of the template and each pixel value of the search region, as well as the MV value, etc.
[0164] Note that the general outline of the processing described here is basically the same for both the encoding and decoding devices.
[0165] Note that any process that can explore the vicinity of a candidate MV and derive the final MV may be used instead of the exact process described here.
[0166] Here, we will explain the mode for generating predictive images using LIC processing.
[0167] Figure 9D is a diagram illustrating the outline of a predictive image generation method using luminance correction processing by LIC processing.
[0168] First, we derive a Music Model (MV) to obtain the reference image corresponding to the block to be encoded from the reference picture, which is an encoded picture.
[0169] Next, for the block to be encoded, information indicating how the luminance values have changed between the reference picture and the picture to be encoded is extracted using the luminance pixel values of the left-adjacent and top-adjacent encoded surrounding reference regions, and the luminance pixel values at the equivalent positions in the reference picture specified by MV, and a luminance correction parameter is calculated.
[0170] By performing brightness correction processing on the reference image within the reference picture specified in MV using the brightness correction parameter, a predicted image for the encoding target block is generated.
[0171] Note that the shape of the surrounding reference region in Figure 9D is just one example, and other shapes may be used.
[0172] Furthermore, while this explanation describes the process of generating a predicted image from a single reference picture, the process is similar when generating predicted images from multiple reference pictures. Brightness correction processing is performed on each reference image obtained from a single reference picture in the same manner before generating the predicted image.
[0173] One method for determining whether or not to apply LIC processing is to use a signal called lic_flag, which indicates whether or not to apply LIC processing. Specifically, in an encoding device, it is determined whether or not the block to be encoded belongs to a region where brightness changes occur. If it belongs to a region where brightness changes occur, the value of lic_flag is set to 1 and LIC processing is applied and encoding is performed. If it does not belong to a region where brightness changes occur, the value of lic_flag is set to 0 and encoding is performed without applying LIC processing. On the other hand, in a decoding device, the lic_flag written in the stream is decoded, and the device switches whether or not to apply LIC processing according to its value and performs decoding.
[0174] Another way to determine whether to apply LIC processing is, for example, by checking whether LIC processing has been applied to surrounding blocks. A specific example is that if the block to be encoded is in merge mode, during the MV derivation in merge mode processing, it is determined whether the surrounding encoded blocks selected were encoded with LIC processing. Based on this result, the application of LIC processing is switched, and encoding is performed accordingly. In this example, the decoding process is exactly the same.
[0175] [Overview of the decryption device] Next, an overview of a decoding device capable of decoding the encoded signal (encoded bitstream) output from the above-mentioned encoding device 100 will be described. Figure 10 is a block diagram showing the functional configuration of the decoding device 200 according to Embodiment 1. The decoding device 200 is a video / image decoding device that decodes video / images in block units.
[0176] As shown in Figure 10, the decoding device 200 includes an entropy decoding unit 202, an inverse quantization unit 204, an inverse transform unit 206, an adder unit 208, a block memory 210, a loop filter unit 212, a frame memory 214, an intra prediction unit 216, an inter prediction unit 218, and a prediction control unit 220.
[0177] The decoding device 200 can be implemented, for example, by a general-purpose processor and memory. In this case, when the software program stored in memory is executed by the processor, the processor functions as an entropy decoding unit 202, an inverse quantization unit 204, an inverse transformation unit 206, an addition unit 208, a loop filter unit 212, an intra prediction unit 216, an inter prediction unit 218, and a prediction control unit 220. Alternatively, the decoding device 200 may be implemented as one or more dedicated electronic circuits corresponding to the entropy decoding unit 202, the inverse quantization unit 204, the inverse transformation unit 206, the addition unit 208, the loop filter unit 212, the intra prediction unit 216, the inter prediction unit 218, and the prediction control unit 220.
[0178] The following describes each component included in the decoding device 200.
[0179] [Entropy Decoder] The entropy decoding unit 202 entropically decodes the encoded bitstream. Specifically, the entropy decoding unit 202 arithmetically decodes the encoded bitstream into a binary signal, for example. Then, the entropy decoding unit 202 debinarizes the binary signal. As a result, the entropy decoding unit 202 outputs the quantization coefficients in block units to the inverse quantization unit 204.
[0180] [Dequantization section] The inverse quantization unit 204 inversely quantizes the quantization coefficients of the decoded block (hereinafter referred to as the current block), which is the input from the entropy decoding unit 202. Specifically, for each quantization coefficient of the current block, the inverse quantization unit 204 inversely quantizes the quantization coefficient based on the quantization parameter corresponding to that quantization coefficient. The inverse quantization unit 204 then outputs the inversely quantized quantization coefficients (i.e., transformation coefficients) of the current block to the inverse transformation unit 206.
[0181] [Inverse Transformation Section] The inverse transform unit 206 restores the prediction error by inversely transforming the transformation coefficients, which are input from the inverse quantization unit 204.
[0182] For example, if the information decoded from the encoded bitstream indicates that EMT or AMT should be applied (e.g., the AMT flag is true), the inverse transform unit 206 inversely transforms the transformation coefficients of the current block based on the information indicating the decoded transformation type.
[0183] For example, if the information decoded from the encoded bitstream indicates that NSST should be applied, the inverse transform unit 206 applies inverse retransformation to the transformation coefficients.
[0184] [Addition section] The adder 208 reconstructs the current block by adding the prediction error, which is the input from the inverse transformer 206, and the prediction sample, which is the input from the prediction control unit 220. The adder 208 then outputs the reconstructed block to the block memory 210 and the loop filter unit 212.
[0185] [Block memory] The block memory 210 is a storage unit for storing blocks that are referenced in intra prediction and are located within the decoded picture (hereinafter referred to as the current picture). Specifically, the block memory 210 stores the reconstructed blocks output from the adder 208.
[0186] [Loop Filter Section] The loop filter unit 212 applies a loop filter to the block reconstructed by the adder unit 208 and outputs the filtered reconstructed block to the frame memory 214 and the display device, etc.
[0187] If the information interpreted from the encoded bitstream indicating ALF on / off indicates ALF is on, one filter is selected from among several filters based on the direction and activity of the local gradient, and the selected filter is applied to the reconstruction block.
[0188] [Frame memory] The frame memory 214 is a memory unit for storing reference pictures used for interpretation, and is sometimes called a frame buffer. Specifically, the frame memory 214 stores the reconstructed blocks filtered by the loop filter unit 212.
[0189] [Intra Prediction Unit] The intra-prediction unit 216 generates a prediction signal (intra-prediction signal) by performing intra-prediction based on the intra-prediction mode decoded from the encoded bitstream, and by referring to the blocks in the current picture stored in the block memory 210. Specifically, the intra-prediction unit 216 generates an intra-prediction signal by performing intra-prediction by referring to samples (e.g., luminance values, chrominance values) of blocks adjacent to the current block, and outputs the intra-prediction signal to the prediction control unit 220.
[0190] Furthermore, if an intra-prediction mode that references a luminance block is selected in the intra-prediction of a color difference block, the intra-prediction unit 216 may predict the color difference component of the current block based on the luminance component of the current block.
[0191] Furthermore, if the information decoded from the encoded bitstream indicates the application of PDPC, the intra-prediction unit 216 corrects the pixel value after intra-prediction based on the gradient of the horizontal / vertical reference pixels.
[0192] [International Prediction Department] The inter-prediction unit 218 predicts the current block by referring to a reference picture stored in the frame memory 214. Prediction is performed in units of the current block or sub-blocks within the current block (e.g., 4x4 blocks). For example, the inter-prediction unit 218 generates an inter-prediction signal for the current block or sub-block by performing motion compensation using motion information (e.g., motion vectors) decoded from the encoded bitstream, and outputs the inter-prediction signal to the prediction control unit 220.
[0193] Furthermore, if the information decoded from the encoded bitstream indicates that OBMC mode should be applied, the interpretation unit 218 generates an interpretation prediction signal using not only the motion information of the current block obtained by motion search, but also the motion information of the adjacent block.
[0194] Furthermore, if the information decoded from the encoded bitstream indicates that FRUC mode should be applied, the interpretation unit 218 derives motion information by performing a motion search according to the pattern matching method (bilateral matching or template matching) decoded from the encoded stream. Then, the interpretation unit 218 performs motion compensation using the derived motion information.
[0195] Furthermore, when the BIO mode is applied, the inter-prediction unit 218 derives motion vectors based on a model that assumes uniform linear motion. Also, if the information decoded from the encoded bitstream indicates that the affine motion compensation prediction mode should be applied, the inter-prediction unit 218 derives motion vectors on a sub-block basis based on the motion vectors of multiple adjacent blocks.
[0196] [Prediction Control Unit] The prediction control unit 220 selects either the intra-prediction signal or the inter-prediction signal and outputs the selected signal as the prediction signal to the adder 208.
[0197] [Explanation of affine intermodes in affine motion compensation prediction] As mentioned above, the mode in which motion vectors are derived at the sub-block level based on the motion vectors of multiple adjacent blocks is called the affine motion compensation prediction mode. There are two modes of affine motion compensation prediction: affine inter and affine merge. The following describes these two modes.
[0198] Figure 11 is a conceptual diagram illustrating the affine intermode of affine motion compensation prediction. Figure 11 shows the current processing target block, the predicted motion vector v0 of the upper left corner control point of the current processing target block, and the predicted motion vector v1 of the upper right corner control point.
[0199] In affine intermode, as shown in Figure 11, the predicted motion vector v0 of the upper left corner control point is selected from the motion vectors of blocks A, B, and C, which are processed blocks adjacent to the currently processed block. Similarly, the predicted motion vector v1 of the upper right corner control point is selected from the motion vectors of blocks D and E, which are processed blocks adjacent to the currently processed block.
[0200] In the encoding process, cost evaluation and other methods are used to determine which encoded block's motion vector to select from the surrounding and adjacent processed blocks as the predicted motion vector for the control point in the affine motion compensation prediction of the currently processed block. Then, a flag indicating which encoded block's motion vector was selected is written to the bitstream as the predicted motion vector for the control point.
[0201] Furthermore, in the encoding process, after the predicted motion vector of the control point of the currently processed block is determined, a motion search is performed to detect the motion vector of the control point. Using the detected motion vector of the control point, the affine motion vector of each subblock within the currently processed block is calculated from equation (2) above, and motion compensation is performed. Then, along with the motion compensation process, the difference between the detected motion vector of the control point and the predicted motion vector of the control point is written to the bitstream.
[0202] The above describes the operation of the encoding process, but the operation of the decoding process is similar.
[0203] [Explanation of the affine merge mode for affine motion compensation prediction] Figures 12A and 12B are conceptual diagrams illustrating the affine merge mode of affine motion compensation prediction. Figure 12A shows the currently processed block and the processed blocks A to D adjacent to the currently processed block. Figure 12B shows an example of processing in the affine merge mode of affine motion compensation prediction. Figure 12B also shows the currently processed block, the predicted motion vector v0 of the upper left corner control point, the predicted motion vector v1 of the upper right corner control point, and the processed block A.
[0204] In affine merge mode, as shown in Figure 12A, the system checks the processed blocks adjacent to the current target block in the order of processed blocks A (left), B (top), C (upper right), D (lower left), and E (upper left). Then, from among the adjacent processed blocks A to D, it identifies the first valid processed block encoded with affine motion compensation prediction.
[0205] For example, using Figure 12B, if the processed block A adjacent to the left of the currently processed block is encoded with affine motion compensation prediction, then motion vectors v2, v3, and v4 for the upper left, upper right, and lower left corners of the processed block containing block A are derived. Next, from the derived motion vectors v2, v3, and v4, the motion vector v0 for the control point of the upper left corner of the currently processed block is derived. Similarly, the motion vector v1 for the control point of the upper right corner of the currently processed block is calculated.
[0206] Next, using the calculated motion vector v0 of the upper left corner control point and the motion vector v1 of the upper right corner control point of the currently processed block, the affine motion vectors of each subblock within the currently processed block are calculated from equation (2) and motion compensation is performed.
[0207] The above describes the operation of the encoding process, but the operation of the decoding process is similar.
[0208] [Internal configuration of affine motion compensation prediction in the interpretation unit of the encoding device] Figure 13 is a block diagram showing the internal configuration for performing affine motion compensation prediction processing in the interpretation unit 126 included in the encoding device 100 in Embodiment 1. While the operation of the interpretation unit 126 included in the encoding device 100 will be mainly described below, the operation of the interpretation unit 218 included in the decoding device 200 is similar.
[0209] As shown in Figure 13, the interpretation unit 126 in Embodiment 1 includes a range determination unit 1261, a control point MV derivation unit 1262, an affine MV calculation unit 1263, a motion compensation unit 1264, and a motion search unit 1265.
[0210] The range determination unit 1261 determines the range of motion search or motion compensation in the reference picture from range limitation information indicating a sub-region within the reference picture where motion search or motion compensation is permitted. The determined range of motion search or motion compensation restricts the range within the reference picture where motion search or motion compensation is permitted in the affine motion compensation prediction. By restricting the range within the reference picture where motion search or motion compensation is permitted in this way, the range that the motion vector (MV) selected in the affine motion compensation prediction can take is restricted.
[0211] The control point MV derivation unit 1262 derives the motion vector of a control point (hereinafter referred to as control point MV) using the MVs of adjacent processed blocks in the affine intermode or affine merge mode of affine motion compensation prediction.
[0212] The affine MV calculation unit 1263 uses the control point MV derived by the control point MV derivation unit 1262 to calculate the affine motion vector (hereinafter referred to as affine MV) for each subblock according to equation (2).
[0213] The motion compensation unit 1264 generates a predicted image by performing motion compensation using the affine MV calculated for each subblock by the affine MV calculation unit 1263.
[0214] The motion search unit 1265 performs motion search in affine intermode by evaluating the cost using the predicted image input from the motion compensation unit 1264 and the actual image. At this time, if the affine MV is outside the range of motion search and motion compensation for the reference picture, the affine intermode and affine merge mode using the control point MV are prohibited.
[0215] [First processing step of affine intermode] Figure 14 is a flowchart showing the first processing procedure for the affine intermode of affine motion compensation by the interprediction unit 126 of the encoding device 100 in Embodiment 1.
[0216] As shown in Figure 14, the inter-screen prediction unit 126 first determines the range of motion search that can be performed from the range limitation information of motion search, which indicates a partial region within the reference picture in which motion search can be performed (S101).
[0217] Next, the inter-screen prediction unit 126 acquires motion vectors (MVs) of multiple processed blocks adjacent to the currently processed block, and uses the acquired MVs to derive a predicted motion vector for the control point (referred to as the control point prediction MV) (S102). In affine intermode, as explained with reference to Figure 11, motion vectors (MVs) of multiple processed blocks adjacent to the currently processed block are acquired, and the control point prediction MV is derived using the acquired MVs. Here, a flag indicating which MV was used from among the multiple MVs is written to the bitstream.
[0218] Next, the screen-to-screen prediction unit 126 calculates the affine MV for each subblock unit according to equation (2) while updating the control point motion vector (hereinafter referred to as control point MV) (S103) (S104).
[0219] At this point, the inter-screen prediction unit 126 determines whether the affine MV calculated in step S104 is within the range of motion search (S105). If it is determined to be outside the range (out of range in S105), the inter-screen prediction unit 126 excludes the control point MV to be evaluated from the search candidates.
[0220] On the other hand, if it is determined that the range is within limits (in S105), the inter-screen prediction unit 126 performs affine motion compensation (S106) and searches for the control point MV that minimizes the cost value (for example, the difference between the current image and the predicted image). Here, the difference between the obtained control point MV and the control point prediction MV updated in step S103 is written to the bitstream. Then, for example, by setting the control point prediction MV as the control point MV at the start of the search, it becomes possible to set an initial value closer to the optimal solution for the search, thereby improving the accuracy of the search.
[0221] Furthermore, if processing after affine motion compensation prediction (such as OBMC) uses surrounding pixels of the target block, and the region containing the surrounding pixels used in the post-application processing is outside the motion search range, it may be excluded from the search candidates.
[0222] Furthermore, the inter-prediction unit 218 in the decoding device 200 performs decoding according to the bitstream information derived by the encoding device 100 using the method described above, thereby enabling affine motion compensation processing using affine intermode within a limited range of motion search.
[0223] [Second processing step for affine intermode] Figure 15 is a flowchart showing the second processing procedure for the affine intermode of affine motion compensation by the interpretation unit 126 of the encoding device 100 in Embodiment 1. The same reference numerals are used for elements similar to those in Figure 14, and detailed explanations are omitted. The second processing procedure shown in Figure 15 differs from the first processing procedure shown in Figure 14 in that the determination of whether or not to exclude the control point MV to be evaluated from the search candidates is made using the control point MV instead of the affine MV.
[0224] Specifically, in step S107, the inter-screen prediction unit 126 determines whether the variation in the values of the two control points MV at the upper left and upper right corners to be evaluated is within a limit range. If it is outside the range (out of range in S107), the inter-screen prediction unit 126 excludes the control points MV to be evaluated from the search candidates. Here, for example, the variation in the values of the two control points MV at the upper left and upper right corners refers to the difference in size and direction of the two control points MV, the temporal distance to the reference picture, or the magnitude of the difference value. And it is desirable that the positions within the reference picture pointed to by the two control points MV at the upper left and upper right corners are within a limit range.
[0225] The affine MV for each subblock is calculated in step S104 from the control point MV according to equation (2). Therefore, if the variation between the two control point MVs is large, the variation in the calculated affine MV will also be large and will not fall within the range of motion search. Conversely, if the variation between the two control point MVs is within a specific range, the calculated affine MV will fall within the range of motion search. Therefore, by pre-determining the above specific range as a limit range in step S101, it becomes possible to exclude the control point MV from the search candidates when it is updated, thereby reducing the processing load.
[0226] Furthermore, if the inter-screen prediction unit 126 determines, at the time it has derived the control point prediction MV in step S102, that the variation in the control point prediction MV is already outside the limited range, it may prohibit encoding as affine intermode.
[0227] Furthermore, if processing after affine motion compensation prediction (such as OBMC) uses surrounding pixels of the target block, the region containing the surrounding pixels used in the post-application processing may be excluded from the search candidates if it is outside the motion search range.
[0228] Furthermore, the inter-prediction unit 218 in the decoding device 200 performs decoding according to the bitstream information derived by the encoding device 100 using the method described above, thereby enabling affine motion compensation processing using affine intermode within a limited range of motion search.
[0229] [First processing step in affine merge mode] Figure 16 is a flowchart showing the first processing procedure of the affine merge mode of affine motion compensation by the interpretation unit 126 of the encoding device 100 in Embodiment 1.
[0230] As shown in Figure 16, the inter-screen prediction unit 126 first determines the range of motion compensation that can be performed from the range limitation information of motion compensation, which indicates a partial region within the reference picture where motion compensation can be performed (S201).
[0231] Next, the inter-screen prediction unit 126 inspects the processed blocks adjacent to the current target block in a predetermined order. Then, the inter-screen prediction unit 126 selects the MV of the first valid processed block encoded with affine motion compensation prediction, and derives the control point MV using the selected MV (S202). In affine merge mode, as explained with reference to Figure 12A, the processed blocks adjacent to the current target block are inspected in the order of block A (left), block B (top), C (upper right), D (lower left), and E (upper left). Then, the MV of the first valid processed block encoded with affine motion compensation prediction is selected from among the processed blocks adjacent to the current target block, and the control point MV is derived using the selected MV.
[0232] Next, the screen-to-screen prediction unit 126 calculates the affine MV for each subblock unit from the control point MV derived in step S202 according to equation (2) (S203).
[0233] At this time, the inter-screen prediction unit 126 determines whether the affine MV calculated in step S203 is within the range of motion search (S204). If it is determined to be outside the range (out of range in S204), the inter-screen prediction unit 126 prohibits encoding it as affine merge mode (S205).
[0234] On the other hand, if it is determined that the range is within the range (within the range in S204), the inter-screen prediction unit 126 acquires a predicted image by performing affine motion compensation (S206).
[0235] Furthermore, if processing after affine motion compensation prediction (such as OBMC) uses peripheral pixels of the target block, and the region containing the peripheral pixels used in the post-processing after affine motion compensation prediction is outside the motion compensation range, encoding as affine merge mode may be prohibited.
[0236] Furthermore, the interpretation unit 218 in the decoding device 200 performs decoding according to the bitstream information derived by the encoding device 100 using the method described above, thereby enabling affine motion compensation processing using affine merge mode within a limited range of motion compensation.
[0237] [Second processing step in affine merge mode] Figure 17 is a flowchart showing the second processing procedure of the affine merge mode of affine motion compensation by the interpretation unit 126 of the encoding device 100 in Embodiment 1. The same reference numerals are used for elements similar to those in Figure 16, and detailed explanations are omitted. The second processing procedure shown in Figure 17 differs from the first processing procedure shown in Figure 16 in that the determination of whether or not to exclude the control point MV to be evaluated from the search candidates is made using the control point MV instead of the affine MV.
[0238] Specifically, in step S207, the screen-to-screen prediction unit 126 determines whether the variation in the values of the two control points MV at the upper left and upper right corners to be evaluated is within the limit range. If it is outside the range (out of range in S207), the process proceeds to step S205 and encoding as affine merge mode is prohibited. Here, for example, the variation in the values of the two control points MV at the upper left and upper right corners refers to the difference in size and direction of the two control points MV, the temporal distance to the reference picture, or the magnitude of the difference value. It is desirable that the positions within the reference picture pointed to by the two control points MV at the upper left and upper right corners are within the limit range.
[0239] Since the affine MV for each subblock unit is calculated in step S203 from the control point MV according to equation (2), if the variation between the two control point MVs is large, the variation in the calculated affine MV will also be large and will not fall within the range of motion search. Conversely, if the variation between the two control point MVs is within a specific range, the calculated affine MV will fall within the range of motion search. Therefore, by determining the above specific range as a limit range in advance in step S201, it becomes possible to prohibit encoding as affine merge mode at the time the control point MV is derived, thereby reducing the amount of processing.
[0240] Furthermore, if processing after affine motion compensation prediction (such as OBMC) uses peripheral pixels of the target block, and the region containing the peripheral pixels used in the post-processing after affine motion compensation prediction is outside the motion compensation range, encoding as affine merge mode may be prohibited.
[0241] Furthermore, the interpretation unit 218 in the decoding device 200 performs decoding according to the bitstream information derived by the encoding device 100 using the method described above, thereby enabling affine motion compensation processing using affine merge mode within a limited motion compensation range.
[0242] [Effects of Embodiment 1] According to Embodiment 1, by limiting the range of motion vectors, it is possible to suppress variations in control point motion vectors in affine motion compensation prediction, increasing the likelihood that affine motion compensation prediction will be selected in interpretation. Furthermore, it becomes possible to limit the region of the reference image to be acquired, potentially reducing the memory band width required for the external memory, which is the frame memory.
[0243] For example, if the motion search range is allowed to be wide without restriction, the magnitude and direction of the motion vectors of each block within the picture may vary widely. Also, affine motion compensation is intended for linear transformations such as scaling, shearing, and rotation of objects within a picture, as well as translation. When affine motion compensation prediction is selected, the motion vectors of the two control points at the top-left and top-right corners are likely to point in the same direction. However, if the magnitude and direction of the motion vectors of each block within the picture vary widely, the motion vectors selected from the surrounding processed blocks may also vary. As a result, the magnitude and direction of the affine motion vectors may also vary, making it less likely that affine motion compensation prediction will be selected. Furthermore, if the magnitude and direction of the motion vectors of each block within the picture vary widely, the area within the picture acquired as a reference image will also become wider, which is likely to increase the memory band width required for the external memory, which is the frame memory.
[0244] Furthermore, by limiting the range of motion vectors (MVs) that can take in the affine motion compensation prediction by the inter-prediction unit 126 of the encoding device 100, the following can be said: In other words, in the inter-prediction unit 218 of the decoding device 200, which decodes the encoded bitstream generated by the encoding device 100, the range of motion vectors that can take in the affine motion compensation prediction is limited in the same way as in the encoding device 100.
[0245] Furthermore, not all components described in Embodiment 1 are always necessary, and only some of the components from Embodiment 1 may be included. Also, the processing content of all components described in Embodiment 1 is not limited to this, and processing may be carried out using components other than those in Embodiment 1.
[0246] (Variation 1) In the affine motion compensation prediction process, the motion search range or the motion compensation range may be newly determined for each processing target picture, or a set of a plurality of motion search ranges and motion compensation ranges may be determined in advance, and an appropriate set may be selected for each processing target picture. Thereby, the variation in the control point motion vectors in the affine motion compensation prediction can be suppressed, and the possibility that the affine motion compensation prediction is selected in the inter prediction is increased. As a result, the motion compensation using the affine motion compensation can be efficiently performed.
[0247] Also, in the affine motion compensation prediction process, the motion search range or the motion compensation range may be changed according to the type of the reference picture. For example, when referring to a P picture, the motion search range or the motion compensation range may be increased as compared with the case of referring to a B picture. Also, in the affine motion compensation prediction process, the motion search range or the motion compensation range may be determined for each predetermined profile and level. According to these, since the limitation of the range in which the motion search or the motion compensation is performed can be appropriately determined, the necessary memory bandwidth for the external memory can be reduced.
[0248] Also, in the affine motion compensation prediction process, the motion search range or the motion compensation range may be determined according to the motion search processing ability due to the arithmetic processing ability on the encoding side, the memory bandwidth, etc. Thereby, since the limitation of the range in which the motion search or the motion compensation is performed can be determined according to the search processing ability, not only can the processing amount be reduced according to the search processing ability, but also the motion compensation using the affine motion compensation can be efficiently performed.
[0249] Also, in the affine motion compensation prediction process, in addition to limiting the range of pixels that can be referenced, the reference picture may be limited. Thereby, not only can the necessary memory bandwidth for the external memory be reduced, but also the motion compensation using the affine motion compensation can be efficiently performed.
[0250] Furthermore, in the affine motion compensation prediction process, information regarding the limitation of the motion search range and the motion compensation range may be included in the header information of the encoded bitstream, such as the VPS (Video Parameter Set), SPS (Sequence Parameter Set), and PPS (Picture Parameter Set). This allows for efficient motion compensation using affine motion compensation.
[0251] Furthermore, in the affine motion compensation prediction process, information regarding the limitation of the motion search range or motion compensation range may include not only information limiting the range of pixels that can be referenced, but also information limiting the picture to be referenced. This not only reduces the memory band width required for external memory, but also enables efficient motion compensation using affine motion compensation.
[0252] Furthermore, in the affine motion compensation prediction process, the information regarding the limitation of the motion search range or motion compensation range may include only information on whether or not to limit the motion search range and motion compensation range.
[0253] Furthermore, if information regarding the limitation of the motion search range or motion compensation range is exchanged or predetermined between the sender and receiver in the higher-level system, it does not need to be included in the header information of the encoded bitstream, such as VPS, SPS, or PPS.
[0254] Furthermore, when limiting the number of referenced pictures, you may specify the number of referenced pictures and limit the number to only the specified number of referenced pictures, starting with the smallest reference index.
[0255] (Modification 2) Furthermore, when limiting the referenced pictures, or when performing time-scalable encoding / decoding, you may specify the number of referenced pictures for referenced pictures in the hierarchy below the hierarchy of the current encoded and / or decoded picture indicated by the time identifier, and limit the number of referenced pictures to only those specified in order from the smallest reference index.
[0256] Furthermore, if you want to limit the referenced pictures, you can specify the number of referenced pictures and limit them to only the specified number of referenced pictures, starting from those closest to the currently encoded picture in terms of the Picture Order Count (POC) information indicating the output order of the pictures.
[0257] Furthermore, when limiting the referenced pictures, if time-scalable coding is performed, the number of referenced pictures may be specified for referenced pictures at a lower hierarchy than the current picture to be coded, as indicated by the time identifier. This limiting the number of referenced pictures to only those closest to the current picture to be coded / decoded, based on the Picture Order Count (POC) information indicating the primary order of the pictures, may also be used.
[0258] [Example of an encoding device implementation] Figure 18 is a block diagram showing an implementation example of the encoding device 100 according to Embodiment 1. The encoding device 100 includes a circuit 160 and a memory 162. For example, the multiple components of the encoding device 100 shown in Figures 1 and 13 are implemented by the circuit 160 and memory 162 shown in Figure 18.
[0259] Circuit 160 is an information processing circuit and is a circuit that can access memory 162. For example, circuit 160 is a dedicated or general-purpose electronic circuit for encoding moving images. Circuit 160 may also be a processor such as a CPU. Alternatively, circuit 160 may be a collection of multiple electronic circuits. Furthermore, for example, circuit 160 may play the role of multiple components of the encoding device 100 shown in Figure 1, etc., excluding the component for storing information.
[0260] Memory 162 is a dedicated or general-purpose memory in which information for the circuit 160 to encode moving images is stored. Memory 162 may be an electronic circuit, or it may be connected to the circuit 160. Memory 162 may also be included in the circuit 160. Memory 162 may also be a collection of multiple electronic circuits. Memory 162 may also be a magnetic disk or an optical disk, or it may be described as storage or a recording medium. Memory 162 may also be a non-volatile memory or a volatile memory.
[0261] For example, memory 162 may store the video to be encoded, or it may store a bit sequence corresponding to the encoded video. Alternatively, memory 162 may store a program for circuit 160 to encode the video.
[0262] Furthermore, for example, memory 162 may play the role of an information storage component among the multiple components of the encoding device 100 shown in Figure 1, etc. Specifically, memory 162 may play the role of block memory 118 and frame memory 122 shown in Figure 1. More specifically, reconstructed blocks and reconstructed pictures may be stored in memory 162.
[0263] Furthermore, it is not necessary for the encoding device 100 to implement all of the components shown in Figure 1, etc., nor is it necessary for all of the processes described above to be performed. Some of the components shown in Figure 1, etc., may be included in other devices, and some of the processes described above may be executed by other devices. Then, in the encoding device 100, motion compensation is efficiently performed by implementing some of the components shown in Figure 1, etc., and by performing some of the processes described above.
[0264] The following shows an example of the operation of the encoding device 100 shown in Figure 18. In the following example of operation, the affine motion compensation prediction process is a process that derives motion vectors on a sub-block basis based on the motion vectors of multiple adjacent blocks in the inter-prediction process of the target block.
[0265] Figure 19 is a flowchart showing an example of the operation of the encoding device 100 shown in Figure 18. For example, when encoding a moving image with motion compensation, the encoding device 100 shown in Figure 18 performs the operation shown in Figure 19.
[0266] Specifically, the circuit 160 of the encoding device 100 uses the memory 162 to perform motion compensation for the target block in the affine motion compensation prediction process in the inter prediction process of the target block by limiting the range in which motion search or motion compensation is performed (S311).
[0267] This enables the encoding device 100 to efficiently perform motion compensation using affine motion compensation. More specifically, by limiting the range in which motion search or motion compensation is performed in the affine motion compensation prediction process, it becomes possible to suppress the variation in control point motion vectors in affine motion compensation prediction. As a result, the likelihood of affine motion compensation prediction being selected in interpretation increases, allowing for efficient motion compensation using affine motion compensation. Furthermore, it becomes possible to limit the region of the reference image to be acquired, potentially reducing the memory band width required for the external memory, which is the frame memory.
[0268] [Example of a decryption device implementation] Figure 20 is a block diagram showing an example of the implementation of the decoding device 200 according to Embodiment 1. The decoding device 200 includes a circuit 260 and a memory 262. For example, the multiple components of the decoding device 200 shown in Figure 10 are implemented by the circuit 260 and memory 262 shown in Figure 20.
[0269] The circuit 260 is a circuit that performs information processing and is a circuit that can access the memory 262. For example, the circuit 260 is a dedicated or general-purpose electronic circuit that decodes moving images. The circuit 260 may be a processor such as a CPU. Also, the circuit 260 may be an aggregate of a plurality of electronic circuits. Also, for example, the circuit 260 may play the roles of a plurality of components of the decoding device 200 shown in FIG. 10 etc., excluding the components for storing information.
[0270] The memory 262 is a dedicated or general-purpose memory in which information for the circuit 260 to decode moving images is stored. The memory 262 may be an electronic circuit and may be connected to the circuit 260. Also, the memory 262 may be included in the circuit 260. Also, the memory 262 may be an aggregate of a plurality of electronic circuits. Also, the memory 262 may be a magnetic disk or an optical disk etc., or may be expressed as a storage or a recording medium etc. Also, the memory 262 may be a non-volatile memory or a volatile memory.
[0271] For example, the memory 262 may store a bit string corresponding to the encoded moving image, or may store a moving image corresponding to the decoded bit string. Also, the memory 262 may store a program for the circuit 260 to decode moving images.
[0272] Also, for example, the memory 262 may play the role of a component for storing information among a plurality of components of the decoding device 200 shown in FIG. 10 etc. Specifically, the memory 262 may play the roles of the block memory 210 and the frame memory 214 shown in FIG. 10. More specifically, the memory 262 may store the reconstructed blocks and the reconstructed pictures etc.
[0273] Furthermore, it is not necessary for the decoding device 200 to implement all of the components shown in Figure 10, etc., nor is it necessary for all of the processes described above to be performed. Some of the components shown in Figure 10, etc., may be included in other devices, and some of the processes described above may be performed by other devices. Then, motion compensation is efficiently performed in the decoding device 200 by implementing some of the components shown in Figure 10, etc., and performing some of the processes described above.
[0274] The following shows an example of the operation of the decoding device 200 shown in Figure 20. In the following example of operation, the affine motion compensation prediction process is a process that derives motion vectors on a sub-block basis based on the motion vectors of multiple adjacent blocks in the inter-prediction process of the target block.
[0275] Figure 21 is a flowchart showing an example of the operation of the decoding device 200 shown in Figure 20. For example, when decoding a moving image with motion compensation, the decoding device 200 shown in Figure 20 performs the operation shown in Figure 21.
[0276] Specifically, the circuit 260 of the decoding device 200 uses the memory 262 to limit the range in which motion search or motion compensation is performed in the affine motion compensation prediction process in the inter prediction process of the target block, and performs motion compensation for the target block (S411). Then, it decodes the encoded stream (S412).
[0277] This enables the decoding device 200 to efficiently perform motion compensation using affine motion compensation. More specifically, by limiting the range in which motion search or motion compensation is performed in the affine motion compensation prediction process, it becomes possible to suppress the variation in the control point motion vector in affine motion compensation prediction. As a result, the likelihood of affine motion compensation prediction being selected in interpretation increases, allowing for efficient motion compensation using affine motion compensation. Furthermore, it becomes possible to limit the region of the reference image to be acquired, potentially reducing the memory band width required for the external memory, which is the frame memory.
[0278] [supplement] Furthermore, the encoding device 100 and decoding device 200 in this embodiment may be used as an image encoding device and an image decoding device, respectively, or as a video encoding device and a video decoding device. Alternatively, the encoding device 100 and decoding device 200 may be used as inter-prediction devices (inter-screen prediction devices), respectively.
[0279] In other words, the encoding device 100 and the decoding device 200 may correspond only to the inter-prediction unit (inter-screen prediction unit) 126 and the inter-prediction unit (inter-screen prediction unit) 218, respectively. Other components such as the conversion unit 106 and the inverse conversion unit 206 may be included in other devices.
[0280] Furthermore, in this embodiment, each component may be implemented by being composed of dedicated hardware or by executing a software program suitable for each component. Each component may also be implemented by a program execution unit such as a CPU or processor reading and executing a software program recorded on a recording medium such as a hard disk or semiconductor memory.
[0281] Specifically, each of the encoding device 100 and the decoding device 200 may include a processing circuitry and a storage device electrically connected to and accessible from the processing circuitry. For example, the processing circuitry corresponds to circuit 160 or 260, and the storage device corresponds to memory 162 or 262.
[0282] The processing circuit includes at least one of dedicated hardware and a program execution unit, and performs processing using a memory device. Furthermore, if the processing circuit includes a program execution unit, the memory device stores the software program executed by that program execution unit.
[0283] Here, the software that implements the encoding device 100 or decoding device 200, etc. in this embodiment is the following program.
[0284] In other words, this program may cause the computer to execute an encoding method that performs motion compensation on the target block by limiting the range in which motion search or motion compensation is performed during the affine motion compensation prediction process in the inter prediction process of the target block.
[0285] Alternatively, the program may cause the computer to execute a decoding method that decodes the encoded stream by limiting the range in which motion search or motion compensation is performed in the affine motion compensation prediction process in the interpretation process of the target block, and performing motion compensation for the target block.
[0286] Furthermore, each component may be a circuit, as described above. These circuits may form a single circuit as a whole, or they may be separate circuits. Also, each component may be implemented using a general-purpose processor, or it may be implemented using a dedicated processor.
[0287] Furthermore, a process performed by one component may be performed by another component. Also, the order in which processes are executed may be changed, and multiple processes may be executed in parallel. Additionally, the encoding / decoding device may comprise an encoding device 100 and a decoding device 200.
[0288] The first and second ordinal numbers used in the explanation may be changed as appropriate. Furthermore, ordinal numbers may be newly assigned to or removed from the constituent elements.
[0289] Although the embodiments of the encoding device 100 and the decoding device 200 have been described above based on these embodiments, the embodiments of the encoding device 100 and the decoding device 200 are not limited to these embodiments. Without departing from the spirit of this disclosure, various modifications that a person skilled in the art could conceive of are applied to these embodiments, and configurations constructed by combining components from different embodiments may also be included within the scope of the embodiments of the encoding device 100 and the decoding device 200.
[0290] This embodiment may be implemented in combination with at least some of the other embodiments of this disclosure. Furthermore, some of the processes, some of the configurations of the apparatus, some of the syntax, etc., described in the flowchart of this embodiment may be implemented in combination with the other embodiments.
[0291] (Embodiment 2) In each of the above embodiments, each functional block can typically be implemented by an MPU and memory, etc. Furthermore, the processing performed by each functional block is typically implemented by a program execution unit such as a processor reading and executing software (program) recorded on a recording medium such as ROM. This software may be distributed by download, etc., or it may be recorded on a recording medium such as semiconductor memory and distributed. Of course, it is also possible to implement each functional block by hardware (dedicated circuitry).
[0292] Furthermore, the processing described in each embodiment may be implemented by centralized processing using a single device (system), or by distributed processing using multiple devices. Also, the processor executing the above program may be one or multiple. In other words, centralized processing may be performed, or distributed processing may be performed.
[0293] The embodiments of this disclosure are not limited to those described above, and various modifications are possible, which are also included within the scope of the embodiments of this disclosure.
[0294] Furthermore, here we will describe application examples of the video encoding method (image encoding method) or video decoding method (image decoding method) shown in each of the above embodiments, and a system using the same. The system is characterized by having an image encoding device using the image encoding method, an image decoding device using the image decoding method, and an image encoding and decoding device that includes both. Other configurations in the system can be appropriately modified as needed.
[0295] [Usage example] Figure 22 shows the overall configuration of the content supply system ex100 that realizes the content distribution service. The service area for the communication service is divided into cells of a desired size, and fixed radio stations, base stations ex106, ex107, ex108, ex109, and ex110, are installed in each cell.
[0296] In this content supply system ex100, various devices such as a computer ex111, a game console ex112, a camera ex113, a home appliance ex114, and a smartphone ex115 are connected to the internet ex101 via an internet service provider ex102 or a communication network ex104, and base stations ex106~ex110. The content supply system ex100 may also connect any combination of the above elements. Each device may be directly or indirectly connected to each other via a telephone network or short-range radio, etc., without going through the base stations ex106~ex110, which are fixed radio stations. In addition, the streaming server ex103 is connected to various devices such as a computer ex111, a game console ex112, a camera ex113, a home appliance ex114, and a smartphone ex115 via the internet ex101, etc. Furthermore, the streaming server ex103 is connected to terminals in a hotspot on an airplane ex117 via satellite ex116.
[0297] Note that instead of base stations ex106~ex110, wireless access points or hotspots may be used. Also, streaming server ex103 may be connected directly to the communication network ex104 without going through the internet ex101 or internet service provider ex102, or it may be connected directly to the airplane ex117 without going through satellite ex116.
[0298] Camera ex113 is a device capable of taking still images and videos, such as a digital camera. Smartphone ex115 is a smartphone, mobile phone, or PHS (Personal Handyphone System) that supports mobile communication systems generally known as 2G, 3G, 3.9G, 4G, and the upcoming 5G.
[0299] Home appliance ex118 refers to appliances such as refrigerators or equipment included in household fuel cell cogeneration systems.
[0300] In the content supply system ex100, live streaming becomes possible when a terminal with a shooting function is connected to the streaming server ex103 via a base station ex106 or the like. In live streaming, the terminal (computer ex111, game console ex112, camera ex113, home appliance ex114, smartphone ex115, and terminal inside an airplane ex117, etc.) performs the encoding process described in each of the above embodiments on still images or video content captured by the user using the terminal, multiplexes the video data obtained by encoding with sound data encoded from the sound corresponding to the video, and transmits the obtained data to the streaming server ex103. In other words, each terminal functions as an image encoding device according to one aspect of this disclosure.
[0301] Meanwhile, the streaming server ex103 streams the content data sent to the requesting client. The client is a computer ex111, a game console ex112, a camera ex113, a home appliance ex114, a smartphone ex115, or a terminal on an airplane ex117, etc., that is capable of decoding the encoded data. Each device that receives the distributed data decodes and plays back the received data. That is, each device functions as an image decoding device according to one aspect of this disclosure.
[0302] [Distributed Processing] Furthermore, the streaming server ex103 may consist of multiple servers or computers that distribute data processing, recording, and distribution. For example, the streaming server ex103 may be implemented using a CDN (Content Delivery Network), where content delivery is achieved through a network connecting numerous edge servers distributed worldwide. In a CDN, the physically closest edge server is dynamically assigned depending on the client. Latency can be reduced by caching and delivering content to the edge server. In addition, if an error occurs or the communication state changes due to an increase in traffic, processing can be distributed among multiple edge servers, the delivery entity can be switched to another edge server, or delivery can be continued by bypassing the failed part of the network, thus enabling high-speed and stable delivery.
[0303] Furthermore, beyond the distributed processing of the distribution itself, the encoding process of the captured data can be performed on each terminal, on the server side, or shared among them. For example, encoding generally involves two processing loops. In the first loop, the complexity or code amount of the image at the frame or scene level is detected. In the second loop, processing is performed to improve encoding efficiency while maintaining image quality. For example, if the terminal performs the first encoding process and the server that receives the content performs the second encoding process, it is possible to improve the quality and efficiency of the content while reducing the processing load on each terminal. In this case, if there is a request to receive and decode near real time, the first encoded data from the terminal can be received and played back on other terminals, enabling more flexible real-time distribution.
[0304] Another example is the camera ex113, which extracts features from an image, compresses the feature data as metadata, and sends it to the server. The server performs compression according to the meaning of the image, for example, by determining the importance of an object from the features and switching the quantization precision. Feature data is particularly effective in improving the accuracy and efficiency of motion vector prediction during further compression on the server. Alternatively, a simple encoding such as VLC (Variable Length Coding) may be performed on the terminal, and a more computationally intensive encoding such as CABAC (Context-Adaptive Binary Arithmetic Coding) may be performed on the server.
[0305] Another example is a scenario in a stadium, shopping mall, or factory where multiple video data sets of nearly identical scenes may exist, captured by multiple terminals. In such cases, the encoding process is distributed among the multiple terminals that captured the footage, along with other terminals and servers as needed, by assigning encoding tasks to each unit, for example, at the Group of Picture (GOP) level, picture level, or tile level (a division of a picture). This reduces latency and enables more real-time performance.
[0306] Furthermore, since multiple video data sets depict essentially the same scene, the server may manage and / or instruct the video data captured by each terminal to reference each other. Alternatively, the server may receive the encoded data from each terminal, change the reference relationships between the multiple data sets, or correct or replace the pictures themselves and re-encode them. This allows for the creation of a stream with improved quality and efficiency for each individual data set.
[0307] Furthermore, the server may transcode the video data to change its encoding method before distributing it. For example, the server may convert an MPEG-based encoding to a VP-based encoding, or convert H.264 to H.265.
[0308] Thus, the encoding process can be performed by a terminal or one or more servers. Therefore, in the following, the terms "server" or "terminal" will be used to refer to the entity performing the processing, but some or all of the processing performed by the server may be performed by the terminal, and some or all of the processing performed by the terminal may be performed by the server. The same applies to the decoding process.
[0309] [3D, Multi-angle] In recent years, it has become increasingly common to integrate and utilize images or videos of different scenes, or the same scene, captured from different angles, using multiple cameras ex113 and / or smartphones ex115, which are nearly synchronized with each other. The videos captured by each device are integrated based on the relative positional relationship between the devices, or on areas where feature points contained in the videos coincide, which are acquired separately.
[0310] The server may not only encode 2D video but also encode still images automatically based on scene analysis of the video, or at a time specified by the user, and send them to the receiving terminal. Furthermore, if the server can obtain the relative positional relationship between the shooting terminals, it can generate a 3D shape of the scene based not only on 2D video but also on video of the same scene taken from different angles. The server may also separately encode 3D data generated by a point cloud, or it may select or reconstruct video to send to the receiving terminal from video taken by multiple terminals based on the results of recognizing or tracking a person or object using the 3D data.
[0311] In this way, users can enjoy scenes by arbitrarily selecting each video corresponding to each shooting terminal, or they can enjoy content in which video from an arbitrary viewpoint is extracted from 3D data reconstructed using multiple images or videos. Furthermore, just like the video, sound can also be collected from multiple different angles, and the server may multiplex and transmit sound from a specific angle or space in conjunction with the video.
[0312] In recent years, content that links the real world with a virtual world, such as Virtual Reality (VR) and Augmented Reality (AR), has also become popular. In the case of VR images, the server may create separate viewpoint images for the right and left eyes and perform encoding that allows referencing between the viewpoint images using Multi-View Coding (MVC), or it may encode them as separate streams without referencing each other. When decoding the separate streams, it is advisable to synchronize playback so that the virtual 3D space is reproduced according to the user's viewpoint.
[0313] In the case of AR images, the server superimposes virtual object information from the virtual space onto camera information from the real space, based on its three-dimensional position or the user's viewpoint movement. The decoding device may acquire or store the virtual object information and three-dimensional data, generate a two-dimensional image according to the user's viewpoint movement, and create superimposed data by smoothly stitching them together. Alternatively, the decoding device may send the user's viewpoint movement to the server in addition to requesting virtual object information, and the server may create superimposed data from the three-dimensional data held by the server according to the received viewpoint movement, encode the superimposed data, and distribute it to the decoding device. The superimposed data may have an α value indicating transparency in addition to RGB, and the server may set the α value of parts other than the object created from the three-dimensional data to 0, etc., so that those parts are transparent, and encode the data. Alternatively, the server may set a predetermined RGB value to the background, like chroma keying, and generate data in which parts other than the object are the background color.
[0314] Similarly, the decryption process of the distributed data can be performed on each client terminal, on the server side, or shared between them. For example, one terminal may send a reception request to the server, and other terminals may receive the content corresponding to that request, perform the decryption process, and then transmit the decrypted signal to a device with a display. By distributing the processing and selecting appropriate content regardless of the performance of the communication-capable terminals themselves, it is possible to play back data with good image quality. Another example is that while receiving large image data on a TV or similar device, a portion of the picture, such as tiles, may be decrypted and displayed on the viewer's personal terminal. This allows for sharing the overall picture while allowing users to check their own area of responsibility or areas they want to examine in more detail on their own device.
[0315] In the future, it is expected that content will be seamlessly received by switching appropriate data for the connected communication, using distribution system standards such as MPEG-DASH, in situations where multiple short-range, medium-range, or long-range wireless communications are available both indoors and outdoors. This will allow users to freely select and switch in real time between decoding devices or display devices, such as displays installed indoors or outdoors, as well as their own terminals. Furthermore, decoding can be performed while switching between the decoding terminal and the display terminal based on the user's location information. This will make it possible to display map information on the wall or part of the ground of an adjacent building with a displayable device embedded, while traveling to a destination. It will also be possible to switch the bitrate of the received data based on the ease of access to the encoded data on the network, such as when the encoded data is cached on a server that can be accessed quickly from the receiving terminal, or copied to an edge server in the content delivery service.
[0316] [Scalable encoding] Regarding content switching, we will explain using a scalable stream compressed and encoded using the video encoding method described in each of the embodiments above, as shown in Figure 23. The server may have multiple streams with the same content but different qualities as individual streams, but it may also be configured to switch content by taking advantage of the temporal / spatial scalability of the stream realized by encoding it in layers, as shown in the figure. In other words, the decoding side can freely switch between decoding low-resolution and high-resolution content by deciding which layer to decode according to internal factors such as performance and external factors such as the state of the communication bandwidth. For example, if you want to watch the rest of a video that you were watching on your smartphone ex115 while traveling, on a device such as an internet TV when you get home, that device only needs to decode the same stream to a different layer, thus reducing the burden on the server.
[0317] Furthermore, in addition to the configuration described above, in which pictures are encoded for each layer and an enhancement layer exists above the base layer to achieve scalability, the enhancement layer may include metadata based on statistical information of the image, and the decoding side may generate high-quality content by super-resolution the picture in the base layer based on the metadata. Super-resolution may refer to either an improvement in the signal-to-noise ratio at the same resolution or an increase in resolution. The metadata may include information for identifying linear or nonlinear filter coefficients used in the super-resolution process, or information for identifying parameter values in the filtering process, machine learning, or least-squares operation used in the super-resolution process.
[0318] Alternatively, the picture may be divided into tiles or similar structures according to the meaning of objects within the image, and the decoding side may select tiles to decode, thereby decoding only a portion of the area. Furthermore, by storing object attributes (people, cars, balls, etc.) and their positions within the image (coordinate positions within the same image, etc.) as metadata, the decoding side can identify the location of a desired object based on the metadata and determine the tile containing that object. For example, as shown in Figure 24, the metadata is stored using a data storage structure different from pixel data, such as the SEI message in HEVC. This metadata indicates, for example, the position, size, or color of the main object.
[0319] Furthermore, metadata may be stored in units consisting of multiple pictures, such as streams, sequences, or random access units. This allows the decryption side to obtain information such as the time when a specific person appears in the video, and by combining this with the picture-level information, it can identify the picture in which the object exists and the object's position within that picture.
[0320] [Web page optimization] Figure 25 shows an example of a web page display screen on a computer ex111, etc. Figure 26 shows an example of a web page display screen on a smartphone ex115, etc. As shown in Figures 25 and 26, a web page may contain multiple linked images, which are links to image content, and their appearance will differ depending on the viewing device. When multiple linked images are visible on the screen, the display device (decoder) will display still images or I-pictures from each content as linked images, display video such as a GIF animation using multiple still images or I-pictures, or receive only the base layer and decode and display the video, until the user explicitly selects a linked image, or until the linked image approaches the center of the screen or the entire linked image is within the screen.
[0321] When a linked image is selected by the user, the display device prioritizes decoding the base layer. If the HTML of the web page contains information indicating that the content is scalable, the display device may decode up to the enhancement layer. Furthermore, to ensure real-time performance, before selection or when bandwidth is very limited, the display device can decode and display only forward-referenced pictures (I-pictures, P-pictures, and B-pictures that only use forward references), thereby reducing the delay between the decoding time and display time of the first picture (the delay from the start of content decoding to the start of display). Alternatively, the display device may deliberately ignore the reference relationships between pictures and roughly decode all B-pictures and P-pictures using forward references, then perform normal decoding as time passes and more pictures are received.
[0322] [Autonomous driving] Furthermore, when transmitting and receiving still images or video data such as 2D or 3D map information for autonomous driving or driving assistance of a vehicle, the receiving terminal may receive metadata such as weather or construction information in addition to image data belonging to one or more layers, and decode these in association with each other. The metadata may belong to a layer, or it may simply be multiplexed with the image data.
[0323] In this case, since the vehicle, drone, or airplane containing the receiving terminal is in motion, the receiving terminal can transmit its location information when a reception request is made, enabling seamless reception and decoding while switching between base stations ex106 to ex110. Furthermore, the receiving terminal can dynamically switch how much metadata is received or how much map information is updated, depending on the user's selection, the user's situation, or the state of the communication bandwidth.
[0324] As described above, the content supply system ex100 allows the client to receive, decode, and play back encoded information transmitted by the user in real time.
[0325] [Distribution of personal content] Furthermore, the ex100 content delivery system allows for unicast or multicast distribution of not only high-definition, long-duration content from video distribution companies, but also low-definition, short-duration content from individuals. It is also expected that the amount of such individual content will continue to increase. To improve the quality of individual content, the server may perform editing before encoding. This can be achieved, for example, with the following configuration.
[0326] During shooting, or after shooting, the server performs recognition processing such as detecting shooting errors, searching for scenes, analyzing semantics, and detecting objects from the original images or encoded data in real time. Based on the recognition results, the server manually or automatically edits the images, correcting out-of-focus or shaky images, deleting less important scenes such as those with lower brightness or out of focus compared to other pictures, emphasizing object edges, and changing color tones. The server then encodes the edited data based on the editing results. It is also known that viewership decreases if the shooting time is too long, so the server may automatically clip scenes with little movement, as well as less important scenes, based on the image processing results, to ensure that the content falls within a specific time range according to the shooting time. Alternatively, the server may generate and encode a digest based on the results of the semantic analysis of the scenes.
[0327] Furthermore, personal content may contain elements that infringe on copyright, moral rights, or portrait rights, and the scope of sharing may exceed the intended scope, which can be inconvenient for the individual. Therefore, for example, the server may intentionally change the image to one that is out of focus, such as the faces of people at the edges of the screen or the interior of a house, before encoding. The server may also recognize whether the face of a person other than those previously registered is visible in the image to be encoded, and if so, it may apply a mosaic effect to the face. Alternatively, as a pre- or post-processing step before encoding, the user can specify a person or background area that they want to process from a copyright perspective, and the server can replace the specified area with a different image or blur the focus. In the case of a person, the server can track the person in a video and replace the image of their face.
[0328] Furthermore, because viewing personal content with small data volumes requires real-time processing, depending on the bandwidth, the decoder prioritizes receiving, decoding, and playing the base layer first. During this time, the decoder can receive the enhancement layer, and if playback is looped or if the content is played more than once, it may play the high-quality video including the enhancement layer. With a stream that uses this scalable encoding, it is possible to provide an experience where the video is rough when unselected or at the beginning of viewing, but gradually the stream becomes smarter and the image quality improves. In addition to scalable encoding, a similar experience can be provided even if the rough stream played the first time and the second stream encoded by referencing the first video are configured as a single stream.
[0329] [Other usage examples] Furthermore, these encoding or decoding processes are generally performed by the LSIex500 present in each terminal. The LSIex500 may be a single chip or a multi-chip configuration. Alternatively, video encoding or decoding software may be embedded in some recording medium (such as a CD-ROM, flexible disk, or hard disk) that can be read by a computer ex111, and the encoding or decoding process may be performed using that software. In addition, if the smartphone ex115 has a camera, video data acquired by that camera may be transmitted. In this case, the video data is data encoded by the LSIex500 present in the smartphone ex115.
[0330] The LSIex500 may also be configured to be activated by downloading application software. In this case, the terminal first determines whether it supports the content encoding method or whether it has the capability to perform the specific service. If the terminal does not support the content encoding method or does not have the capability to perform the specific service, the terminal downloads the codec or application software, and then acquires and plays the content.
[0331] Furthermore, not only the content supply system ex100 via the Internet ex101, but also digital broadcasting systems can incorporate at least one of the video encoding device (image encoding device) or video decoding device (image decoding device) of each of the above embodiments. While the content supply system ex100 has a configuration that is more suited to multicast than unicast, as it transmits and receives multiplexed data with video and sound multiplexed onto broadcast radio waves using satellites, etc., the encoding and decoding processes are similar and can be applied in the same way.
[0332] [Hardware configuration] Figure 27 shows the smartphone ex115. Figure 28 shows an example of the configuration of the smartphone ex115. The smartphone ex115 includes an antenna ex450 for transmitting and receiving radio waves with the base station ex110, a camera unit ex465 capable of taking video and still images, and a display unit ex458 that displays video captured by the camera unit ex465 and data decoded from video received by the antenna ex450. The smartphone ex115 further includes an operation unit ex466, such as a touch panel, an audio output unit ex457, such as a speaker for outputting voice or sound, an audio input unit ex456, such as a microphone for inputting voice, a memory unit ex467 capable of storing captured video or still images, recorded audio, received video or still images, encoded data such as emails, or decoded data, and a slot unit ex464, which is an interface unit with SIM ex468 for identifying the user and authenticating access to various data, including the network. External memory may be used instead of the memory unit ex467.
[0333] Furthermore, the main control unit ex460, which comprehensively controls the display unit ex458 and the operation unit ex466, is connected via the bus ex470 to the power supply circuit unit ex461, the operation input control unit ex462, the video signal processing unit ex455, the camera interface unit ex463, the display control unit ex459, the modulation / demodulation unit ex452, the multiplexing / decompression unit ex453, the audio signal processing unit ex454, the slot unit ex464, and the memory unit ex467.
[0334] The power supply circuit unit ex461, when the power key is turned on by the user, supplies power from the battery pack to each component, thereby starting up the smartphone ex115 and making it operational.
[0335] The smartphone ex115 performs tasks such as phone calls and data communication based on the control of the main control unit ex460, which has a CPU, ROM, RAM, etc. During a call, the audio signal picked up by the audio input unit ex456 is converted into a digital audio signal by the audio signal processing unit ex454, which is then subjected to spread spectrum processing by the modulation / demodulation unit ex452, and after digital-to-analog conversion and frequency conversion processing by the transmission / reception unit ex451, it is transmitted via the antenna ex450. Similarly, received data is amplified, subjected to frequency conversion and analog-to-digital conversion processing, despread spectrum processing by the modulation / demodulation unit ex452, converted into an analog audio signal by the audio signal processing unit ex454, and then output from the audio output unit ex457. In data communication mode, text, still images, or video data are sent to the main control unit ex460 via the operation input control unit ex462 by the operation unit ex466 of the main unit, and transmission and reception processing is performed in the same manner. When transmitting video, still images, or video and audio in data communication mode, the video signal processing unit ex455 compresses and encodes the video signal stored in the memory unit ex467 or the video signal input from the camera unit ex465 using the video encoding method shown in each of the above embodiments, and sends the encoded video data to the multiplexing / decoding unit ex453. The audio signal processing unit ex454 encodes the audio signal picked up by the audio input unit ex456 while the camera unit ex465 is capturing video or still images, and sends the encoded audio data to the multiplexing / decoding unit ex453. The multiplexing / decoding unit ex453 multiplexes the encoded video data and encoded audio data in a predetermined manner, performs modulation and conversion processing in the modulation / demodulation unit (modulation / demodulation circuit unit) ex452 and the transmission / reception unit ex451, and transmits the data via the antenna ex450.
[0336] When receiving video attached to an email or chat, or video linked to a webpage, etc., the multiplexing / decomposition unit ex453 separates the multiplexed data received via antenna ex450 to decode the multiplexed data, dividing it into a video data bitstream and an audio data bitstream. It then supplies the encoded video data to the video signal processing unit ex455 and the encoded audio data to the audio signal processing unit ex454 via the synchronization bus ex470. The video signal processing unit ex455 decodes the video signal using a video decoding method corresponding to the video encoding method shown in each embodiment above, and displays the video or still image contained in the linked video file from the display unit ex458 via the display control unit ex459. The audio signal processing unit ex454 decodes the audio signal, and audio is output from the audio output unit ex457. However, since real-time streaming is widespread, there may be situations where audio playback is socially inappropriate depending on the user's circumstances. Therefore, as an initial setting, it is preferable to have a configuration that plays only video data and not audio signals. Audio may be synchronized and played only when the user performs an action, such as clicking on video data.
[0337] Furthermore, although the smartphone ex115 was used as an example here, there are three possible implementation formats for terminals: a transceiver-type terminal that has both an encoder and a decoder, a transmitting terminal that has only an encoder, and a receiving terminal that has only a decoder. In addition, although it was explained that multiplexed data, in which audio data etc. is multiplexed with video data, is received or transmitted in a digital broadcasting system, the multiplexed data may also include text data related to the video in addition to audio data, or the video data itself may be received or transmitted instead of multiplexed data.
[0338] Although it was explained that the main control unit ex460, including the CPU, controls the encoding or decoding process, terminals often also have a GPU. Therefore, a configuration that leverages the GPU's performance to process a wide area at once using memory shared by the CPU and GPU, or memory whose addresses are managed so that it can be used in common, is also possible. This can shorten the encoding time, ensure real-time performance, and achieve low latency. In particular, it is efficient to perform motion detection, deblocking filters, SAO (Sample Adaptive Offset), and transformation / quantization processes at once on the GPU, rather than on the CPU, in units such as pictures. [Industrial applicability]
[0339] This disclosure can be used, for example, in television receivers, digital video recorders, car navigation systems, mobile phones, digital cameras, digital video cameras, video conferencing systems, or electronic mirrors. [Explanation of Symbols]
[0340] 100 Encoding device 102 Division 104 Subtraction Unit 106 Conversion Unit 108 Quantization section 110 Entropy coding unit 112, 204 Inverse quantization section 114, 206 Inverse Transform Section 116, 208 Addition section 118, 210 block memory 120, 212 Loop filter section 122,214 frame memory 124, 216 Intra prediction unit (on-screen prediction unit) 126, 218 Inter-prediction unit (inter-screen prediction unit) 128, 220 Prediction Control Unit 160, 260 circuits 162,262 memory 200 Decoders 202 Entropy Decoder 1261 Range Determination Unit 1262 Control point MV derivation section 1263 Affine MV Calculation Unit 1264 Motion compensation unit 1265 Motion detection unit
Claims
1. Circuits and, Equipped with memory, The circuit uses the memory, In the affine motion compensation prediction process in the interpretation process of the target block, motion compensation of the target block is performed by limiting the range in which motion search or motion compensation is performed. In the aforementioned affine motion compensation prediction process, The range in which the motion search or motion compensation is performed is limited such that the variation between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block in the affine motion compensation prediction process falls within a predetermined range. The aforementioned variation is a value based on the difference between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block. Encoding device.
2. Circuits and, Equipped with memory, The circuit uses the memory, In the affine motion compensation prediction process in the interpretation process of the target block, motion compensation of the target block is performed by limiting the range in which motion search or motion compensation is performed, thereby decoding the encoded stream. In the aforementioned affine motion compensation prediction process, The range in which the motion search or motion compensation is performed is limited such that the variation between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block in the affine motion compensation prediction process falls within a predetermined range. The aforementioned variation is a value based on the difference between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block. Decoding device.
3. A bitstream generation device, The bitstream includes information to cause the decoder to perform affine motion compensation prediction processing in the interpretation processing of the target block, In the aforementioned affine motion compensation prediction process, The range in which the motion search or motion compensation is performed is limited such that the variation between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block in the affine motion compensation prediction process falls within a predetermined range. The aforementioned variation is a value based on the difference between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block. generator.
4. A bitstream transmitter, The bitstream includes information to cause the decoder to perform affine motion compensation prediction processing in the interpretation processing of the target block, In the aforementioned affine motion compensation prediction process, The range in which the motion search or motion compensation is performed is limited such that the variation between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block in the affine motion compensation prediction process falls within a predetermined range. The aforementioned variation is a value based on the difference between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block. Transmitter.
5. A computer-readable non-temporary storage medium for storing bitstreams, The bitstream includes information to cause the decoder to perform affine motion compensation prediction processing in the interpretation processing of the target block, In the aforementioned affine motion compensation prediction process, The range in which the motion search or motion compensation is performed is limited such that the variation between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block in the affine motion compensation prediction process falls within a predetermined range. The aforementioned variation is a value based on the difference between the motion vector of the control point at the upper left corner and the motion vector of the control point at the upper right corner of the target block. Non-transitory storage medium.