Encoding device, decoding device, encoding method, decoding method, and transmission method

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A dual-mode motion compensation process in encoding and decoding devices addresses inefficiencies in H.265 by using spatial gradient and adjacent block vectors, reducing processing load and improving coding efficiency.

JP7876038B2Active Publication Date: 2026-06-18PANASONIC INTELLECTUAL PROPERTY CORP OF AMERICA

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Patents
Current Assignee / Owner: PANASONIC INTELLECTUAL PROPERTY CORP OF AMERICA
Filing Date: 2025-06-17
Publication Date: 2026-06-18

Application Information

Patent Timeline

17 Jun 2025

Application

18 Jun 2026

Publication

JP7876038B2

IPC: H04N19/54; H04N19/52; H04N19/577

CPC: H04N19/109; H04N19/157; H04N19/176; H04N19/52; H04N19/105; H04N19/119; H04N19/139; H04N19/14

AI Tagging

Application Domain

Digital video signal modification

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing encoding and decoding methods, such as H.265, require significant processing resources and do not efficiently manage motion compensation, leading to inefficiencies.

Method used

Implementing a dual-mode motion compensation process in encoding and decoding devices that includes a first mode using spatial gradient-based motion vectors and a second mode using adjacent block vectors, allowing for reduced processing load and improved efficiency.

Benefits of technology

The dual-mode approach reduces processing requirements while maintaining encoding efficiency by performing motion vector calculations on a prediction block basis, thereby minimizing processing load and enhancing coding efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 0007876038000004
Figure 0007876038000005
Figure 0007876038000006

Patent Text Reader

Abstract

To provide an encoder of which the throughput can be reduced.SOLUTION: In a first operating mode, an encoder 100 derives a first motion vector in a unit of a prediction block in a merge mode (S112) and performs a first motion compensation process that generates a prediction image by referring to a spatial gradient of luminance in an image generated by performing motion compensation using the first motion vector (S113); and in a second operating mode, derives motion vectors at a plurality of control points, derives a second motion vector in a unit of a sub-block for each sub-block obtained by splitting the prediction block (S114) and performs a second motion compensation process that generates a prediction image without referring to a spatial gradient of luminance in an image generated by performing motion compensation using the second motion vector (S115).SELECTED DRAWING: Figure 15

Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] This disclosure relates to an encoding device, a decoding device, an encoding method, and a decoding method. [Background technology]

[0002] Traditionally, H.265 has been used as a standard for encoding moving images. H.265 is also known as HEVC (High Efficiency Video Coding). [Prior art documents] [Non-patent literature]

[0003] [Non-Patent Document 1] H.265(ISO / IEC 23008-2 HEVC(High Efficiency Video Coding)) [Overview of the Initiative] [Problems that the invention aims to solve]

[0004] In such encoding and decoding methods, it is desirable to reduce the amount of processing required.

[0005] The purpose of this disclosure is to provide a decoding device, encoding device, decoding method, or encoding method that can reduce the amount of processing required. [Means for solving the problem]

[0006] An encoding device according to one aspect of the present disclosure comprises a circuit and a memory, wherein the circuit performs motion compensation processing using the memory, and the motion compensation processing includes a first operating mode and a second operating mode, wherein in the first operating mode, a first motion vector is derived by merging mode for each prediction block obtained by dividing the image contained in the moving image, and a first motion compensation processing is performed to generate a prediction image by referring to the spatial gradient of brightness in the image generated by motion compensation using the first motion vector, and in the second operating mode, motion vectors of a plurality of control points of the prediction block are derived based on the motion vectors of blocks spatially adjacent to the prediction block, and a second motion compensation processing is performed for each subblock obtained by dividing the prediction block into a plurality of subblocks using the motion vectors of the plurality of control points, and a second motion compensation processing is performed to generate a prediction image without referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector.

[0007] An encoding device according to one aspect of the present disclosure comprises a circuit and a memory, wherein the circuit performs motion compensation processing using the memory, and the motion compensation processing includes a first operating mode and a second operating mode, wherein in the first operating mode, a first motion compensation process is performed in which a first motion vector is derived for each prediction block unit obtained by dividing the image contained in the moving image, and a prediction image is generated by referring to the spatial gradient of brightness in the image generated by motion compensation using the first motion vector, and in the second operating mode, motion vectors of a plurality of control points of the prediction block are derived based on the motion vectors of blocks spatially adjacent to the prediction block, and a second motion compensation process is performed in which a second motion compensation process is performed in which a prediction image is generated by motion compensation using the second motion vector is generated without referring to the spatial gradient of brightness in the image.

[0008] An encoding device according to one aspect of the present disclosure comprises a circuit and a memory, wherein the circuit performs motion compensation processing using the memory, and the motion compensation processing includes a first operating mode and a second operating mode, wherein in the first operating mode, a first motion vector is derived for each prediction block unit obtained by dividing the image contained in the moving image, and a first motion compensation processing is performed to generate a prediction image by referring to the spatial gradient of brightness in the image generated by motion compensation using the first motion vector, wherein in the second operating mode, motion vectors of a plurality of control points of the prediction block are derived based on the motion vectors of blocks spatially adjacent to the prediction block, and a second motion compensation processing is performed for each subblock obtained by dividing the prediction block into a plurality of subblocks using the motion vectors of the plurality of control points, and a prediction image is generated without referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector, and the second operating mode further includes a plurality of motion compensation modes in which the motion vectors of the plurality of control points used to derive the second motion vector are different.

[0009] A decoding device according to an aspect of the present disclosure includes a circuit and a memory. The circuit performs motion compensation processing using the memory. The motion compensation processing includes a first operation mode and a second operation mode. In the first operation mode, a first motion vector is derived for each prediction block obtained by dividing an image included in a moving image, and a first motion compensation process is performed to generate a predicted image by referring to a spatial gradient of luminance in an image generated by motion compensation using the first motion vector. In the second operation mode, motion vectors of a plurality of control points of the prediction block are derived based on motion vectors of blocks spatially adjacent to the prediction block, and for each sub-block obtained by dividing the prediction block into a plurality of sub-blocks using the motion vectors of the plurality of control points, a second motion vector for bidirectional prediction is derived for each sub-block, and a second motion compensation process is performed to generate a predicted image without referring to a spatial gradient of luminance in an image generated by motion compensation using the second motion vector. The second operation mode further includes a plurality of motion compensation modes in which the motion vectors of the plurality of control points used for deriving the second motion vector are different.

[0010] These general or specific aspects may be implemented in a system, an apparatus, a method, an integrated circuit, a computer program, or a non-transitory recording medium such as a computer-readable CD-ROM, or may be implemented in any combination of a system, an apparatus, a method, an integrated circuit, a computer program, and a recording medium.

Advantages of the Invention

[0011] The present disclosure can provide a decoding device, an encoding device, a decoding method, or an encoding method capable of reducing the processing amount.

Brief Description of the Drawings

[0012] [Figure 1] FIG. 1 is a block diagram showing a functional configuration of an encoding device according to Embodiment 1. [Figure 2] FIG. 2 is a diagram showing an example of block division in Embodiment 1. [Figure 3]Figure 3 is a table showing the transformation basis functions corresponding to each transformation type. [Figure 4A] Figure 4A shows an example of the filter shape used in ALF. [Figure 4B] Figure 4B shows another example of the filter shape used in ALF. [Figure 4C] Figure 4C shows another example of the filter shape used in ALF. [Figure 5A] Figure 5A shows the 67 intra-prediction modes in intra-prediction. [Figure 5B] Figure 5B is a flowchart illustrating the overview of the predictive image correction process using OBMC processing. [Figure 5C] Figure 5C is a conceptual diagram illustrating the overview of the predictive image correction process using OBMC processing. [Figure 5D] Figure 5D shows an example of FRUC. [Figure 6] Figure 6 illustrates pattern matching (bilateral matching) between two blocks along a motion trajectory. [Figure 7] Figure 7 illustrates pattern matching (template matching) between a template in the current picture and a block in the referenced picture. [Figure 8] Figure 8 is a diagram illustrating a model that assumes uniform linear motion. [Figure 9A] Figure 9A is a diagram illustrating the derivation of subblock-level motion vectors based on the motion vectors of multiple adjacent blocks. [Figure 9B] Figure 9B is a diagram illustrating the overview of the motion vector derivation process using merge mode. [Figure 9C] Figure 9C is a conceptual diagram illustrating the overview of DMVR processing. [Figure 9D] Figure 9D is a diagram illustrating the outline of a predictive image generation method using luminance correction processing by LIC processing. [Figure 10]Figure 10 is a block diagram showing the functional configuration of the decoding device according to Embodiment 1. [Figure 11] Figure 11 is a flowchart of the screen-to-screen prediction process related to Comparative Example 1. [Figure 12] Figure 12 is a flowchart of the screen-to-screen prediction process according to Comparative Example 2. [Figure 13] Figure 13 is a flowchart of the screen-to-screen prediction process according to Embodiment 1. [Figure 14] Figure 14 is a flowchart of the screen-to-screen prediction process according to a modified example of Embodiment 1. [Figure 15] Figure 15 is a flowchart of the encoding and decoding processes according to a modified example of Embodiment 1. [Figure 16] Figure 16 is a conceptual diagram showing the template FRUC method according to Embodiment 1. [Figure 17] Figure 17 is a conceptual diagram showing the bilateral FRUC system according to Embodiment 1. [Figure 18] Figure 18 is a flowchart showing the process of deriving motion vectors using the FRUC method according to Embodiment 1. [Figure 19] Figure 19 is a conceptual diagram showing the BIO processing according to Embodiment 1. [Figure 20] Figure 20 is a flowchart showing the BIO processing according to Embodiment 1. [Figure 21] Figure 21 is a block diagram showing an example of an implementation of the encoding device according to Embodiment 1. [Figure 22] Figure 22 is a block diagram showing an example of an implementation of the decoding device according to Embodiment 1. [Figure 23] Figure 23 is an overall diagram of the content supply system that enables the content distribution service. [Figure 24] Figure 24 shows an example of an encoding structure during scalable encoding. [Figure 25] Figure 25 shows an example of an encoding structure during scalable encoding. [Figure 26]Figure 26 shows an example of how a web page is displayed. [Figure 27] Figure 27 shows an example of how a web page is displayed. [Figure 28] Figure 28 shows an example of a smartphone. [Figure 29] Figure 29 is a block diagram showing an example of a smartphone configuration. [Modes for carrying out the invention]

[0013] An encoding device according to one aspect of the present disclosure comprises a circuit and a memory, the circuit using the memory to derive a first motion vector by a first inter-screen prediction method that uses the degree of fit of two reconstructed images of two regions in two different pictures, in prediction block units obtained by dividing the image contained in the moving image, and a first motion compensation process that generates a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the derived first motion vector, in the prediction block units.

[0014] According to this, the encoding device can reduce the processing load by performing the derivation of motion vectors using the first inter-screen prediction method and the first motion compensation process on a prediction block basis, compared to, for example, performing these processes on a sub-block basis. Furthermore, the first motion compensation process, which generates a predicted image by referring to the spatial gradient of brightness, can achieve correction on a unit smaller than the prediction block basis, thus suppressing the decrease in encoding efficiency that occurs when processing is not performed on a sub-block basis. Therefore, the encoding device can reduce the processing load while suppressing the decrease in encoding efficiency.

[0015] For example, the circuit may further use the memory to derive a second motion vector for each prediction block using a second inter-screen prediction method that uses the degree of fit between the target prediction block and the reconstructed image of the region included in the reference picture, and for each prediction block, perform a second motion compensation process to generate a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector, and generate an encoded bitstream containing information for identifying the second motion vector.

[0016] According to this, the processing unit for motion compensation can be the same whether the first inter-screen prediction method or the second inter-screen prediction method is used. This simplifies the implementation of motion compensation.

[0017] For example, the two regions in the first inter-screen prediction method may be (1) a region in a target picture adjacent to the target prediction block and a region in a reference picture, or (2) two regions in two different reference pictures.

[0018] A decoding device according to one aspect of the present disclosure comprises a circuit and a memory, the circuit using the memory to derive a first motion vector by a first inter-frame prediction method that uses the degree of fit of two reconstructed images of two regions in two different pictures, in prediction block units obtained by dividing the image contained in the moving image, and a first motion compensation process that generates a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the derived first motion vector, in the prediction block units.

[0019] According to this, the decoding device can reduce the processing load by performing the derivation of motion vectors using the first inter-frame prediction method and the first motion compensation process on a prediction block basis, compared to, for example, performing these processes on a sub-block basis. Furthermore, the first motion compensation process, which generates a predicted image by referring to the spatial gradient of brightness, can achieve correction on a unit smaller than the prediction block basis, thus suppressing the decrease in coding efficiency that occurs when processing is not performed on a sub-block basis. Therefore, the decoding device can reduce the processing load while suppressing the decrease in coding efficiency.

[0020] For example, the circuit may further use the memory to obtain information for identifying a second motion vector in units of prediction blocks from the encoded bitstream, derive the second motion vector in units of prediction blocks using a second inter-screen prediction method using the information, and perform a second motion compensation process in units of prediction blocks to generate a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the derived second motion vector.

[0021] According to this, the processing unit for motion compensation can be the same whether the first inter-screen prediction method or the second inter-screen prediction method is used. This simplifies the implementation of motion compensation.

[0022] For example, the two regions in the first inter-screen prediction method may be (1) a region in a target picture adjacent to the target prediction block and a region in a reference picture, or (2) two regions in two different reference pictures.

[0023] An encoding method according to one aspect of the present disclosure derives a first motion vector by a first inter-frame prediction method that uses the degree of fit of two reconstructed images of two different regions in two different pictures, in prediction block units obtained by dividing an image contained in a moving image, and performs motion compensation processing in the prediction block units to generate a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the derived first motion vector.

[0024] According to this, the encoding method can reduce the amount of processing compared to, for example, performing the derivation of motion vectors using the first inter-screen prediction method and the first motion compensation process on a prediction block basis. Furthermore, the first motion compensation process, which generates a predicted image by referring to the spatial gradient of brightness, can achieve correction at a unit smaller than the prediction block basis, thus suppressing the decrease in encoding efficiency that occurs when processing is not performed on a sub-block basis. Therefore, the encoding method can reduce the amount of processing while suppressing the decrease in encoding efficiency.

[0025] A decoding method according to one aspect of the present disclosure derives a first motion vector by a first inter-frame prediction method that uses the degree of fit of two reconstructed images of two different regions in two different pictures, in prediction block units obtained by dividing the image contained in the moving image, and performs motion compensation processing in the prediction block units to generate a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the derived first motion vector.

[0026] According to this, the decoding method can reduce the processing load compared to, for example, performing the derivation of motion vectors using the first inter-frame prediction method and the first motion compensation process on a prediction block basis. Furthermore, the first motion compensation process, which generates a predicted image by referring to the spatial gradient of brightness, can achieve correction at a unit smaller than the prediction block basis, thus suppressing the decrease in coding efficiency that occurs when processing is not performed on a sub-block basis. Therefore, the decoding method can reduce the processing load while suppressing the decrease in coding efficiency.

[0027] An encoding device according to one aspect of the present disclosure comprises a circuit and a memory, wherein the circuit uses the memory to derive a first motion vector in a first operating mode, in a prediction block unit obtained by dividing an image contained in a moving image, and performs a first motion compensation process in which it generates a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the derived first motion vector in the prediction block unit, and performs a second motion compensation process in which it derives a second motion vector in a subblock unit obtained by dividing the prediction block, and generates a predicted image in the subblock unit without referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector.

[0028] According to this, in the first operating mode, the encoding device performs the motion vector derivation process and the first motion compensation process on a prediction block basis, thereby reducing the processing load compared to, for example, performing these processes on a sub-block basis. Furthermore, the first motion compensation process, which generates a predicted image by referencing the spatial gradient of brightness, can achieve correction at a unit smaller than the prediction block basis, thus suppressing the decrease in encoding efficiency that occurs when processing is not performed on a sub-block basis. In the second operating mode, the encoding device performs the motion vector derivation process and the second motion compensation process on a sub-block basis. Here, the second motion compensation process does not refer to the spatial gradient of brightness, so the processing load is less than that of the first motion compensation process. Moreover, by having these two operating modes, the encoding device can improve encoding efficiency. In this way, the encoding device can reduce the processing load while suppressing a decrease in encoding efficiency.

[0029] For example, in the first operating mode, the circuit may derive the first motion vector in units of the prediction block using a first inter-screen prediction method, and in the second operating mode, it may derive the second motion vector in units of the sub-block using a second inter-screen prediction method different from the first inter-screen prediction method.

[0030] For example, the second inter-screen prediction method may be an inter-screen prediction method that uses the degree of fit of two reconstructed images of two different regions within two separate pictures.

[0031] According to this, the inter-screen prediction method, which significantly improves coding efficiency by calculating motion vectors at the sub-block level, can be implemented at the sub-block level. Therefore, coding efficiency can be improved.

[0032] For example, the first inter-screen prediction method is one of the following: (1) a third inter-screen prediction method that uses the degree of fit between the reconstructed image of a region in a target picture adjacent to the target prediction block and the reconstructed image of a region in a reference picture; and (2) a fourth inter-screen prediction method that uses the degree of fit between two reconstructed images of two regions in two different reference pictures. The second inter-screen prediction method may be the other of the third inter-screen prediction method and the fourth inter-screen prediction method.

[0033] For example, the first inter-screen prediction method may be the third inter-screen prediction method, and the second inter-screen prediction method may be the fourth inter-screen prediction method.

[0034] According to this, the inter-screen prediction method, which significantly improves coding efficiency by calculating motion vectors at the sub-block level, can be implemented at the sub-block level. Therefore, coding efficiency can be improved.

[0035] For example, the first inter-screen prediction method is an inter-screen prediction method that uses the degree of fit between the target prediction block and the reconstructed image of the region included in the reference picture, and an encoded bitstream containing information for identifying the derived first motion vector may be generated.

[0036] A decoding device according to one aspect of the present disclosure comprises a circuit and a memory, wherein the circuit uses the memory to derive a first motion vector in a first operating mode, in a prediction block unit obtained by dividing the image contained in the moving image, and performs a first motion compensation process in which a prediction image is generated by referencing the spatial gradient of brightness in the image generated by motion compensation using the derived first motion vector in the prediction block unit, and performs a second motion compensation process in which a second motion vector is derived in a subblock unit obtained by dividing the prediction block, and generates a prediction image in the subblock unit without referencing the spatial gradient of brightness in the image generated by motion compensation using the second motion vector.

[0037] According to this, in the first operating mode, the decoder can reduce the processing load by performing the motion vector derivation process and the first motion compensation process on a prediction block basis, compared to, for example, performing these processes on a sub-block basis. Furthermore, the first motion compensation process, which generates a predicted image by referring to the spatial gradient of brightness, can achieve correction at a unit smaller than the prediction block basis, thus suppressing the decrease in coding efficiency that occurs when processing is not performed on a sub-block basis. In the second operating mode, the decoder can perform the motion vector derivation process and the second motion compensation process on a sub-block basis. Here, the second motion compensation process does not refer to the spatial gradient of brightness, so the processing load is less than that of the first motion compensation process. Moreover, by having these two operating modes, the decoder can improve coding efficiency. In this way, the decoder can reduce the processing load while suppressing a decrease in coding efficiency.

[0038] For example, in the first operating mode, the circuit may derive the first motion vector in units of the prediction block using a first inter-screen prediction method, and in the second operating mode, it may derive the second motion vector in units of the sub-block using a second inter-screen prediction method different from the first inter-screen prediction method.

[0039] For example, the second inter-screen prediction method may be an inter-screen prediction method that uses the degree of fit of two reconstructed images of two different regions within two separate pictures.

[0040] According to this, the inter-screen prediction method, which significantly improves coding efficiency by calculating motion vectors at the sub-block level, can be implemented at the sub-block level. Therefore, coding efficiency can be improved.

[0041] For example, the first inter-screen prediction method is one of the following: (1) a third inter-screen prediction method that uses the degree of fit between the reconstructed image of a region in a target picture adjacent to the target prediction block and the reconstructed image of a region in a reference picture; and (2) a fourth inter-screen prediction method that uses the degree of fit between two reconstructed images of two regions in two different reference pictures. The second inter-screen prediction method may be the other of the third inter-screen prediction method and the fourth inter-screen prediction method.

[0042] For example, the first inter-screen prediction method may be the third inter-screen prediction method, and the second inter-screen prediction method may be the fourth inter-screen prediction method.

[0043] According to this, the inter-screen prediction method, which significantly improves coding efficiency by calculating motion vectors at the sub-block level, can be implemented at the sub-block level. Therefore, coding efficiency can be improved.

[0044] For example, in the first inter-screen prediction method, information for identifying the first motion vector in units of the prediction block may be obtained from the encoded bitstream, and the first motion vector may be derived using the information.

[0045] An encoding method according to one aspect of the present disclosure includes, in a first operating mode, deriving a first motion vector in each prediction block unit obtained by dividing an image contained in a moving image, and performing a first motion compensation process in each prediction block unit to generate a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the derived first motion vector; and in a second operating mode, deriving a second motion vector in each subblock unit obtained by dividing the prediction block, and performing a second motion compensation process in each subblock unit to generate a predicted image without referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector.

[0046] According to this, in the first operating mode, the encoding method performs the motion vector derivation process and the first motion compensation process on a prediction block basis, thereby reducing the processing load compared to, for example, performing these processes on a sub-block basis. Furthermore, the first motion compensation process, which generates a predicted image by referencing the spatial gradient of brightness, can achieve correction at a unit smaller than the prediction block basis, thus suppressing the decrease in encoding efficiency that occurs when processing is not performed on a sub-block basis. In the second operating mode, the encoding method performs the motion vector derivation process and the second motion compensation process on a sub-block basis. Here, the second motion compensation process does not refer to the spatial gradient of brightness, so the processing load is less than that of the first motion compensation process. Moreover, the encoding method can improve encoding efficiency by having these two operating modes. In this way, the encoding method can reduce the processing load while suppressing a decrease in encoding efficiency.

[0047] A decoding method according to one aspect of the present disclosure includes, in a first operating mode, deriving a first motion vector in each prediction block unit obtained by dividing the image contained in the moving image, and performing a first motion compensation process in each prediction block unit to generate a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the derived first motion vector; and in a second operating mode, deriving a second motion vector in each subblock unit obtained by dividing the prediction block, and performing a second motion compensation process in each subblock unit to generate a predicted image without referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector.

[0048] According to this, in the first operating mode, the decoding method performs the motion vector derivation process and the first motion compensation process on a prediction block basis, thereby reducing the processing load compared to, for example, performing these processes on a sub-block basis. Furthermore, the first motion compensation process, which generates a predicted image by referring to the spatial gradient of brightness, can achieve correction at a unit smaller than the prediction block basis, thus suppressing the decrease in coding efficiency that occurs when processing is not performed on a sub-block basis. In the second operating mode, the decoding method performs the motion vector derivation process and the second motion compensation process on a sub-block basis. Here, the second motion compensation process does not refer to the spatial gradient of brightness, so the processing load is less than that of the first motion compensation process. Moreover, by having these two operating modes, the decoding method can improve coding efficiency. In this way, the coding method can reduce the processing load while suppressing a decrease in coding efficiency.

[0049] Furthermore, these comprehensive or specific embodiments may be implemented as systems, devices, methods, integrated circuits, computer programs, or non-temporary recording media such as computer-readable CD-ROMs, or as any combination of systems, devices, methods, integrated circuits, computer programs, and recording media.

[0050] The embodiments will be described in detail below with reference to the drawings.

[0051] The embodiments described below are all comprehensive or specific examples. The numerical values, shapes, materials, components, arrangement and connection configurations of components, steps, and the order of steps shown in the following embodiments are examples only and are not intended to limit the scope of the claims. Furthermore, among the components in the following embodiments, those not described in the independent claim representing the highest-level concept will be described as optional components.

[0052] (Embodiment 1) First, an overview of Embodiment 1 will be given as an example of an encoding and decoding device to which the processes and / or configurations described in each aspect of this disclosure, described later, can be applied. However, Embodiment 1 is merely an example of an encoding and decoding device to which the processes and / or configurations described in each aspect of this disclosure can be applied, and the processes and / or configurations described in each aspect of this disclosure can also be implemented in encoding and decoding devices different from Embodiment 1.

[0053] When applying the processes and / or configurations described in each aspect of this disclosure to Embodiment 1, for example, one of the following may be performed:

[0054] (1) With respect to the encoding or decoding device of Embodiment 1, replace the component corresponding to the component described in each aspect of the disclosure with the component described in each aspect of the disclosure, among the plurality of components constituting the encoding or decoding device. (2) With respect to the encoding or decoding device of Embodiment 1, any modifications such as adding, replacing, or deleting functions or processes performed by some of the multiple components constituting the encoding or decoding device are made, and then the components corresponding to the components described in each aspect of the Disclosure are replaced with the components described in each aspect of the Disclosure. (3) Adding processing to and / or replacing, deleting, or otherwise modifying some of the processing included in the method performed by the encoding or decoding device of Embodiment 1, and then replacing the processing corresponding to the processing described in each aspect of the Disclosure with the processing described in each aspect of the Disclosure. (4) Combining some of the multiple components constituting the encoding or decoding device of Embodiment 1 with a component described in each aspect of the Disclosure, a component that has some of the functions of the component described in each aspect of the Disclosure, or a component that performs some of the processing performed by the component described in each aspect of the Disclosure. (5) A component that has some of the functions of some of the components constituting the encoding or decoding device of Embodiment 1, or a component that performs some of the processing performed by some of the components constituting the encoding or decoding device of Embodiment 1, in combination with a component described in each aspect of this disclosure, a component that has some of the functions of the components described in each aspect of this disclosure, or a component that performs some of the processing performed by the components described in each aspect of this disclosure. (6) With respect to the method performed by the encoding or decoding device of Embodiment 1, replace with the process corresponding to the process described in each aspect of the Disclosure among the multiple processes included in the method with the process described in each aspect of the Disclosure. (7) Performing some of the processes included in the method performed by the encoding or decoding device of Embodiment 1 in combination with the processes described in each aspect of the present disclosure.

[0055] The methods of implementing the processes and / or configurations described in each aspect of this disclosure are not limited to the examples above. For example, they may be implemented in a device used for a purpose other than the video / image encoding device or video / image decoding device disclosed in Embodiment 1, or the processes and / or configurations described in each embodiment may be implemented individually. Furthermore, the processes and / or configurations described in different embodiments may be implemented in combination.

[0056] [Overview of the coding device] First, an overview of the encoding device according to Embodiment 1 will be described. Figure 1 is a block diagram showing the functional configuration of the encoding device 100 according to Embodiment 1. The encoding device 100 is a video / image encoding device that encodes video / images in block units.

[0057] As shown in Figure 1, the encoding device 100 is a device that encodes an image in block units and comprises a division unit 102, a subtraction unit 104, a transformation unit 106, a quantization unit 108, an entropy encoding unit 110, an inverse quantization unit 112, an inverse transformation unit 114, an addition unit 116, a block memory 118, a loop filter unit 120, a frame memory 122, an intra prediction unit 124, an inter prediction unit 126, and a prediction control unit 128.

[0058] The encoding device 100 can be implemented, for example, by a general-purpose processor and memory. In this case, when a software program stored in memory is executed by the processor, the processor functions as a splitting unit 102, a subtraction unit 104, a conversion unit 106, a quantization unit 108, an entropy encoding unit 110, an inverse quantization unit 112, an inverse conversion unit 114, an addition unit 116, a loop filter unit 120, an intra prediction unit 124, an inter prediction unit 126, and a prediction control unit 128. Alternatively, the encoding device 100 may be implemented as one or more dedicated electronic circuits corresponding to the splitting unit 102, a subtraction unit 104, a conversion unit 106, a quantization unit 108, an entropy encoding unit 110, an inverse quantization unit 112, an inverse conversion unit 114, an addition unit 116, a loop filter unit 120, an intra prediction unit 124, an inter prediction unit 126, and a prediction control unit 128.

[0059] The following describes each component included in the encoding device 100.

[0060] [Divided part] The splitting unit 102 divides each picture contained in the input video into multiple blocks and outputs each block to the subtraction unit 104. For example, the splitting unit 102 first divides the picture into blocks of a fixed size (e.g., 128x128). These fixed-size blocks are sometimes called coding tree units (CTUs). Then, based on recursive quadtree and / or binary tree block partitioning, the splitting unit 102 divides each of the fixed-size blocks into blocks of a variable size (e.g., 64x64 or less). These variable-size blocks are sometimes called coding units (CUs), prediction units (PUs), or transformation units (TUs). In this embodiment, CUs, PUs, and TUs do not need to be distinguished, and some or all of the blocks in the picture may become processing units for CUs, PUs, and TUs.

[0061] Figure 2 shows an example of block partitioning in Embodiment 1. In Figure 2, solid lines represent block boundaries due to quadtree block partitioning, and dashed lines represent block boundaries due to binary tree block partitioning.

[0062] Here, block 10 is a 128x128 pixel square block (128x128 block). This 128x128 block 10 is first divided into four 64x64 square blocks (quadtree block partitioning).

[0063] The top-left 64x64 block is further divided vertically into two rectangular 32x64 blocks, and the left 32x64 block is further divided vertically into two rectangular 16x64 blocks (binary tree block partitioning). As a result, the top-left 64x64 block is divided into two 16x64 blocks 11 and 12 and a 32x64 block 13.

[0064] The 64x64 block in the upper right is horizontally divided into two rectangular 64x32 blocks, 14 and 15 (binary tree block division).

[0065] The bottom-left 64x64 block is divided into four square 32x32 blocks (quadrutree block division). Of the four 32x32 blocks, the top-left and bottom-right blocks are further divided. The top-left 32x32 block is vertically divided into two rectangular 16x32 blocks, and the rightmost 16x32 block is further horizontally divided into two 16x16 blocks (binary tree block division). The bottom-right 32x32 block is horizontally divided into two 32x16 blocks (binary tree block division). As a result, the bottom-left 64x64 block is divided into 16x32 block 16, two 16x16 blocks 17 and 18, two 32x32 blocks 19 and 20, and two 32x16 blocks 21 and 22.

[0066] The 64x64 block 23 in the bottom right will not be divided.

[0067] As described above, in Figure 2, block 10 is divided into 13 variable-sized blocks 11-23 based on recursive quad-tree and binary tree block partitioning. Such partitioning is sometimes called QTBT (quad-tree plus binary tree) partitioning.

[0068] In Figure 2, one block was divided into four or two blocks (quadrutree or binary tree block partitioning), but the partitioning is not limited to these. For example, one block may be divided into three blocks (ternary tree block partitioning). Partitioning that includes such ternary tree block partitioning is sometimes called MBT (multi-type tree) partitioning.

[0069] [Subtraction Unit] The subtraction unit 104 subtracts the predicted signal (predicted sample) from the original signal (original sample) in block units divided by the division unit 102. In other words, the subtraction unit 104 calculates the prediction error (also called the residual) of the block to be encoded (hereinafter referred to as the current block). The subtraction unit 104 then outputs the calculated prediction error to the conversion unit 106.

[0070] The source signal is the input signal to the encoding device 100, and is a signal representing the image of each picture that makes up the moving image (for example, a luminance (luma) signal and two chroma (chroma) signals). In the following, the signal representing the image may also be called a sample.

[0071] [Conversion section] The conversion unit 106 converts the prediction error in the spatial domain into conversion coefficients in the frequency domain and outputs the conversion coefficients to the quantization unit 108. Specifically, the conversion unit 106 performs a predetermined discrete cosine transform (DCT) or discrete sine transform (DST) on the prediction error in the spatial domain, for example.

[0072] The transformation unit 106 may also adaptively select a transformation type from among several transformation types and use a transformation basis function corresponding to the selected transformation type to convert the prediction error into transformation coefficients. Such a transformation is sometimes called an EMT (explicit multiple core transform) or an AMT (adaptive multiple transform).

[0073] Multiple transformation types include, for example, DCT-II, DCT-V, DCT-VIII, DST-I, and DST-VII. Figure 3 is a table showing the transformation basis functions corresponding to each transformation type. In Figure 3, N represents the number of input pixels. The selection of a transformation type from among these multiple transformation types may depend, for example, on the type of prediction (intra-prediction and inter-prediction) or on the intra-prediction mode.

[0074] Information indicating whether or not to apply such EMT or AMT (e.g., called an AMT flag) and information indicating the selected conversion type are signaled at the CU level. However, the signaling of this information is not limited to the CU level and may be at other levels (e.g., sequence level, picture level, slice level, tile level, or CTU level).

[0075] Furthermore, the transformation unit 106 may retransform the transformation coefficients (transformation results). Such retransformation is sometimes called AST (adaptive secondary transform) or NSST (non-separable secondary transform). For example, the transformation unit 106 performs retransformation for each subblock (e.g., 4x4 subblock) contained in the block of transformation coefficients corresponding to the intra-prediction error. Information indicating whether or not to apply NSST and information regarding the transformation matrix used for NSST are signaled at the CU level. Note that the signaling of this information is not limited to the CU level, but may be at other levels (e.g., sequence level, picture level, slice level, tile level, or CTU level).

[0076] Here, a separable transformation is a method in which the input is separated into directions equal to the number of dimensions and transformed multiple times, while a non-separable transformation is a method in which, when the input is multidimensional, two or more dimensions are treated as one dimension and transformed together.

[0077] For example, one example of a non-separable transformation is to treat a 4x4 block as a single array with 16 elements and then perform a transformation on that array using a 16x16 transformation matrix.

[0078] Similarly, the Hypercube Givens Transform, which treats a 4x4 input block as a single array with 16 elements and then performs multiple Givens rotations on that array, is another example of a non-separable transformation.

[0079] [Quantization section] The quantization unit 108 quantizes the conversion coefficients output from the conversion unit 106. Specifically, the quantization unit 108 scans the conversion coefficients of the current block in a predetermined scanning order and quantizes the conversion coefficients based on the quantization parameter (QP) corresponding to the scanned conversion coefficients. The quantization unit 108 then outputs the quantized conversion coefficients of the current block (hereinafter referred to as quantization coefficients) to the entropy coding unit 110 and the inverse quantization unit 112.

[0080] The predetermined order is the order for quantization / inverse quantization of the transformation coefficients. For example, the predetermined scanning order is defined as ascending frequency (from low frequency to high frequency) or descending frequency (from high frequency to low frequency).

[0081] Quantization parameters are parameters that define the quantization step (quantization width). For example, if the value of the quantization parameter increases, the quantization step also increases. In other words, if the value of the quantization parameter increases, the quantization error increases.

[0082] [Entropy coding unit] The entropy coding unit 110 generates an encoded signal (encoded bitstream) by variable-length encoding the quantization coefficients, which are input from the quantization unit 108. Specifically, the entropy coding unit 110, for example, binarizes the quantization coefficients and arithmetically encodes the binary signal.

[0083] [Dequantization section] The inverse quantization unit 112 inversely quantizes the quantization coefficients, which are input from the quantization unit 108. Specifically, the inverse quantization unit 112 inversely quantizes the quantization coefficients of the current block in a predetermined scanning order. Then, the inverse quantization unit 112 outputs the inversely quantized conversion coefficients of the current block to the inverse conversion unit 114.

[0084] [Inverse Transformation Section] The inverse transform unit 114 restores the prediction error by inversely transforming the transformation coefficients, which are input from the inverse quantization unit 112. Specifically, the inverse transform unit 114 restores the prediction error of the current block by performing an inverse transform on the transformation coefficients that corresponds to the transformation by the transformation unit 106. The inverse transform unit 114 then outputs the restored prediction error to the summation unit 116.

[0085] Furthermore, the recovered prediction error does not match the prediction error calculated by the subtraction unit 104 because information is lost due to quantization. In other words, the recovered prediction error includes quantization errors.

[0086] [Addition section] The adder 116 reconstructs the current block by adding the prediction error, which is the input from the inverse transformer 114, and the prediction sample, which is the input from the prediction control unit 128. The adder 116 then outputs the reconstructed block to the block memory 118 and the loop filter unit 120. The reconstructed block is sometimes called the local decoded block.

[0087] [Block memory] The block memory 118 is a storage unit for storing blocks within the picture to be encoded (hereinafter referred to as the current picture) that are referenced in intra prediction. Specifically, the block memory 118 stores the reconstructed blocks output from the adder 116.

[0088] [Loop Filter Section] The loop filter unit 120 applies a loop filter to the block reconstructed by the adder unit 116 and outputs the filtered reconstructed block to the frame memory 122. A loop filter is a filter used within the encoding loop (in-loop filter), and includes, for example, a deblocking filter (DF), sample adaptive offset (SAO), and adaptive loop filter (ALF).

[0089] In ALF, a least-squares error filter is applied to remove coding distortion. For example, for each 2x2 subblock within the current block, one filter selected from several filters is applied based on the direction and activity of the local gradient.

[0090] Specifically, first, subblocks (e.g., 2x2 subblocks) are classified into multiple classes (e.g., 15 or 25 classes). The classification of subblocks is based on the direction and activity of the gradient. For example, a classification value C (e.g., C = 5D + A) is calculated using the gradient direction value D (e.g., 0-2 or 0-4) and the gradient activity value A (e.g., 0-4). Then, based on the classification value C, the subblocks are classified into multiple classes (e.g., 15 or 25 classes).

[0091] The gradient direction value D is derived, for example, by comparing gradients in multiple directions (e.g., horizontal, vertical, and two diagonal directions). The gradient activation value A is derived, for example, by adding the gradients in multiple directions and quantizing the sum.

[0092] Based on the results of this classification, a filter for the subblock is determined from among multiple filters.

[0093] For example, a circularly symmetric shape is used as the filter shape in ALF. Figures 4A to 4C show several examples of filter shapes used in ALF. Figure 4A shows a 5x5 diamond-shaped filter, Figure 4B shows a 7x7 diamond-shaped filter, and Figure 4C shows a 9x9 diamond-shaped filter. Information indicating the filter shape is signaled at the picture level. However, the signaling of information indicating the filter shape is not limited to the picture level and may be at other levels (e.g., sequence level, slice level, tile level, CTU level, or CU level).

[0094] The on / off status of ALF is determined, for example, at the picture level or CU level. For instance, the decision to apply ALF to luminance is made at the CU level, and the decision to apply ALF to color difference is made at the picture level. Information indicating whether ALF is on or off is signaled at the picture level or CU level. However, the signaling of information indicating whether ALF is on or off is not limited to the picture level or CU level, but may be at other levels (e.g., sequence level, slice level, tile level, or CTU level).

[0095] The coefficient sets of multiple selectable filters (e.g., up to 15 or 25 filters) are signaled at the picture level. However, the signaling of the coefficient sets is not limited to the picture level; it may be at other levels (e.g., sequence level, slice level, tile level, CTU level, CU level, or subblock level).

[0096] [Frame memory] The frame memory 122 is a storage unit for storing reference pictures used for interpretation, and is sometimes called a frame buffer. Specifically, the frame memory 122 stores the reconstructed blocks filtered by the loop filter unit 120.

[0097] [Intra Prediction Unit] The intra-prediction unit 124 generates a prediction signal (intra-prediction signal) by performing intra-prediction (also called in-screen prediction) of the current block by referring to the block in the current picture stored in the block memory 118. Specifically, the intra-prediction unit 124 generates an intra-prediction signal by performing intra-prediction by referring to samples (e.g., luminance values, color difference values) of blocks adjacent to the current block, and outputs the intra-prediction signal to the prediction control unit 128.

[0098] For example, the intra-prediction unit 124 performs intra-prediction using one of a predetermined set of intra-prediction modes. The set of intra-prediction modes includes one or more non-directional prediction modes and multiple directional prediction modes.

[0099] One or more non-directional prediction modes include, for example, the Planar prediction mode and DC prediction mode as defined in the H.265 / HEVC (High-Efficiency Video Coding) standard (Non-Patent Document 1).

[0100] Multiple directional prediction modes include, for example, the 33 directional prediction modes defined in the H.265 / HEVC standard. Note that multiple directional prediction modes may also include 32 additional directional prediction modes (a total of 65 directional prediction modes). Figure 5A shows 67 intra-prediction modes (2 non-directional prediction modes and 65 directional prediction modes) in intra-prediction. Solid arrows represent the 33 directions defined in the H.265 / HEVC standard, and dashed arrows represent the additional 32 directions.

[0101] Furthermore, in the intra-prediction of a color difference block, a luminance block may be referenced. That is, the color difference component of the current block may be predicted based on the luminance component of the current block. Such intra-prediction is sometimes called CCLM (cross-component linear model) prediction. Such an intra-prediction mode for a color difference block that references a luminance block (e.g., called the CCLM mode) may be added as one of the intra-prediction modes for a color difference block.

[0102] The intra-prediction unit 124 may correct the pixel values after intra-prediction based on the gradient of the horizontal / vertical reference pixels. Intra-prediction with such correction is sometimes called PDPC (position dependent intra-prediction combination). Information indicating whether or not PDPC is applied (for example, called a PDPC flag) is signaled at, for example, the CU level. Note that the signaling of this information is not limited to the CU level, but may be at other levels (for example, sequence level, picture level, slice level, tile level, or CTU level).

[0103] [International Prediction Department] The inter-prediction unit 126 generates a prediction signal (inter-prediction signal) by performing inter-prediction (also called inter-screen prediction) of the current block by referring to a reference picture stored in the frame memory 122 that is different from the current picture. Inter-prediction is performed in units of the current block or sub-blocks within the current block (e.g., 4x4 blocks). For example, the inter-prediction unit 126 performs motion estimation within the reference picture for the current block or sub-block. Then, the inter-prediction unit 126 generates an inter-prediction signal for the current block or sub-block by performing motion compensation using motion information (e.g., motion vectors) obtained from the motion estimation. Finally, the inter-prediction unit 126 outputs the generated inter-prediction signal to the prediction control unit 128.

[0104] The motion information used for motion compensation is converted into a signal. A motion vector predictor may be used to convert the motion vector into a signal. In other words, the difference between the motion vector and the predicted motion vector may be converted into a signal.

[0105] Furthermore, an inter-prediction signal may be generated using not only the motion information of the current block obtained through motion search, but also the motion information of adjacent blocks. Specifically, an inter-prediction signal may be generated for each sub-block within the current block by weighted addition of a prediction signal based on motion information obtained through motion search and a prediction signal based on the motion information of adjacent blocks. Such inter-prediction (motion compensation) is sometimes called OBMC (overlapped block motion compensation).

[0106] In this OBMC mode, information indicating the size of the subblock for OBMC (e.g., called the OBMC block size) is signaled at the sequence level. Information indicating whether or not to apply OBMC mode (e.g., called the OBMC flag) is signaled at the CU level. Note that the signaling levels for this information are not limited to the sequence and CU levels; other levels (e.g., picture level, slice level, tile level, CTU level, or subblock level) may also be used.

[0107] Let's explain the OBMC mode in more detail. Figures 5B and 5C are flowcharts and conceptual diagrams illustrating the overview of the predictive image correction process using OBMC processing.

[0108] First, a predicted image (Pred) is obtained using normal motion compensation with the motion vector (MV) assigned to the block to be encoded.

[0109] Next, the motion vector (MV_L) of the encoded left adjacent block is applied to the block to be encoded to obtain a predicted image (Pred_L), and the first correction of the predicted image is performed by superimposing the predicted image and Pred_L with weights.

[0110] Similarly, the motion vector (MV_U) of the encoded upper adjacent block is applied to the block to be encoded to obtain a predicted image (Pred_U). The predicted image is then corrected a second time by weighting the first corrected predicted image and Pred_U, and this is used as the final predicted image.

[0111] While this explanation describes a two-stage correction method using the left adjacent block and the upper adjacent block, it is also possible to use the right adjacent block and the lower adjacent block to perform corrections more than two times.

[0112] Furthermore, the area to be superimposed does not have to be the entire pixel area of the block, but rather only a portion of the area near the block boundary.

[0113] Although this explanation describes the predictive image correction process using a single reference picture, the process is similar when correcting predictive images from multiple reference pictures. After obtaining corrected predictive images from each reference picture, the resulting predictive images are superimposed to create the final predictive image.

[0114] The processing target block may be a prediction block unit, or it may be a sub-block unit obtained by further dividing the prediction block.

[0115] One method for determining whether or not to apply OBMC processing is to use an obmc_flag signal, which indicates whether or not to apply OBMC processing. Specifically, in an encoding device, it is determined whether or not the block to be encoded belongs to a region with complex motion. If it belongs to a region with complex motion, the obmc_flag is set to a value of 1 and OBMC processing is applied to perform encoding. If it does not belong to a region with complex motion, the obmc_flag is set to a value of 0 and encoding is performed without applying OBMC processing. On the other hand, in a decoding device, the obmc_flag written in the stream is decoded, and the device switches whether or not to apply OBMC processing depending on its value and performs decoding.

[0116] Furthermore, motion information may be derived by the decoder without being converted into a signal. For example, the merge mode specified in the H.265 / HEVC standard may be used. Alternatively, motion information may be derived by performing a motion search on the decoder side. In this case, the motion search is performed without using the pixel values of the current block.

[0117] Here, we will explain the mode in which motion detection is performed on the decoding device side. This mode in which motion detection is performed on the decoding device side is sometimes called PMMVD (pattern matched motion vector derivation) mode or FRUC (frame rate up-conversion) mode.

[0118] An example of FRUC processing is shown in Figure 5D. First, a list of multiple candidates (which may be the same as the merge list) is generated, each having a predicted motion vector, by referencing the motion vectors of spatially or temporally adjacent encoded blocks to the current block. Next, the best candidate MV is selected from among the multiple candidate MVs registered in the candidate list. For example, an evaluation value is calculated for each candidate included in the candidate list, and one candidate is selected based on the evaluation value.

[0119] Then, based on the motion vectors of the selected candidates, a motion vector for the current block is derived. Specifically, for example, the motion vector of the selected candidate (best candidate MV) is directly derived as the motion vector for the current block. Alternatively, for example, the motion vector for the current block may be derived by performing pattern matching in the area surrounding the position in the reference picture corresponding to the motion vector of the selected candidate. That is, a similar search is performed in the area surrounding the best candidate MV, and if an MV with a better evaluation value is found, the best candidate MV may be updated to this MV and used as the final MV for the current block. It is also possible to configure the system so that this process is not performed.

[0120] The same processing method can be used when processing at the sub-block level.

[0121] The evaluation value is calculated by determining the difference value of the reconstructed image through pattern matching between a region in the reference picture corresponding to the motion vector and a predetermined region. Alternatively, the evaluation value may be calculated using other information in addition to the difference value.

[0122] For pattern matching, either first-order pattern matching or second-order pattern matching is used. First-order pattern matching and second-order pattern matching are sometimes called bilateral matching and template matching, respectively.

[0123] In the first pattern matching, pattern matching is performed between two blocks in two different reference pictures that are aligned with the motion trajectory of the current block. Therefore, in the first pattern matching, a region in another reference picture aligned with the motion trajectory of the current block is used as a predetermined region for calculating the evaluation value of the candidate described above.

[0124] Figure 6 illustrates an example of pattern matching (bilateral matching) between two blocks along a motion trajectory. As shown in Figure 6, in the first pattern matching, two motion vectors (MV0, MV1) are derived by searching for the best-matching pair of two blocks within two different reference pictures (Ref0, Ref1) that are along the motion trajectory of the current block. Specifically, for the current block, the difference between the reconstructed image at a specified position in the first encoded reference picture (Ref0) specified by the candidate MV and the reconstructed image at a specified position in the second encoded reference picture (Ref1) specified by the symmetric MV obtained by scaling the candidate MV by the display time interval is derived, and an evaluation value is calculated using the obtained difference value. It is preferable to select the candidate MV with the best evaluation value among multiple candidate MVs as the final MV.

[0125] Under the assumption of a continuous motion trajectory, the motion vector (MV0, MV1) pointing to two reference blocks is proportional to the temporal distance (TD0, TD1) between the current picture (Cur Pic) and the two reference pictures (Ref0, Ref1). For example, if the current picture is temporally located between the two reference pictures and the temporal distances from the current picture to the two reference pictures are equal, then the first pattern matching derives a mirror-symmetric bidirectional motion vector.

[0126] In the second pattern matching, pattern matching is performed between the template in the current picture (blocks adjacent to the current block in the current picture (e.g., blocks above and / or to the left)) and the blocks in the reference picture. Therefore, in the second pattern matching, the blocks adjacent to the current block in the current picture are used as a predetermined area for calculating the evaluation value of the candidates mentioned above.

[0127] Figure 7 illustrates an example of pattern matching (template matching) between a template in the current picture and a block in the reference picture. As shown in Figure 7, in the second pattern matching, the motion vector of the current block is derived by searching in the reference picture (Ref0) for the block that best matches the block adjacent to the current block (Cur block) in the current picture (Cur Pic). Specifically, for the current block, the difference is derived between the reconstructed image of the encoded region of both or either of the left adjacent and upper adjacent regions and the reconstructed image at the equivalent position in the encoded reference picture (Ref0) specified by the candidate MV. An evaluation value is calculated using the obtained difference value, and the candidate MV with the best evaluation value among multiple candidate MVs is selected as the best candidate MV.

[0128] Information indicating whether or not to apply such a FRUC mode (e.g., called the FRUC flag) is signaled at the CU level. Furthermore, if the FRUC mode is applied (e.g., the FRUC flag is true), information indicating the pattern matching method (first pattern matching or second pattern matching) (e.g., called the FRUC mode flag) is signaled at the CU level. Note that the signaling of this information is not limited to the CU level; it may be at other levels (e.g., sequence level, picture level, slice level, tile level, CTU level, or subblock level).

[0129] Here, we will explain a mode for deriving motion vectors based on a model that assumes uniform linear motion. This mode is sometimes called the BIO (bi-directional optical flow) mode.

[0130] Figure 8 is a diagram illustrating a model that assumes uniform linear motion. In Figure 8, (v x ,v yindicates the velocity vector, and τ0 and τ1 respectively indicate the temporal distances between the current picture (Cur Pic) and two reference pictures (Ref0, Ref1). (MVx0, MVy0) indicates the motion vector corresponding to the reference picture Ref0, and (MVx1, MVy1) indicates the motion vector corresponding to the reference picture Ref1.

[0131] At this time, under the assumption of a uniform linear motion of the velocity vector (v x , v y ), (MVx0, MVy0) and (MVx1, MVy1) are respectively represented as (v x τ0, v y τ0) and (-v x τ1, -v y τ1), and the following optical flow equation (1) holds.

[0132]

Equation

[0133] Here, I (k) indicates the luminance value of the reference image k (k = 0, 1) after motion compensation. This optical flow equation indicates that the sum of (i) the temporal derivative of the luminance value, (ii) the product of the horizontal velocity and the horizontal component of the spatial gradient of the reference image, and (iii) the product of the vertical velocity and the vertical component of the spatial gradient of the reference image is equal to zero. Based on the combination of this optical flow equation and Hermite interpolation, the motion vector in block units obtained from the merge list or the like is corrected in pixel units.

[0134] Note that the motion vector may be derived on the decoder side by a method different from the derivation of the motion vector based on the model assuming uniform linear motion. For example, the motion vector may be derived in sub-block units based on the motion vectors of a plurality of adjacent blocks.

[0135] Here, we will describe a mode in which motion vectors are derived at the sub-block level based on the motion vectors of multiple adjacent blocks. This mode is sometimes called the affine motion compensation prediction mode.

[0136] Figure 9A is a diagram illustrating the derivation of subblock-level motion vectors based on the motion vectors of multiple adjacent blocks. In Figure 9A, the current block contains 16 4x4 subblocks. Here, the motion vector v0 of the upper left corner control point of the current block is derived based on the motion vectors of the adjacent blocks, and the motion vector v1 of the upper right corner control point of the current block is derived based on the motion vectors of the adjacent subblocks. Then, using the two motion vectors v0 and v1, the motion vector (v) of each subblock within the current block is derived by the following equation (2). x ,v y ) is derived.

[0137]

number

[0138] Here, x and y represent the horizontal and vertical positions of the subblock, respectively, and w represents a predetermined weighting coefficient.

[0139] Such affine motion compensation prediction modes may include several modes in which the motion vectors of the upper-left and upper-right corner control points are derived. Information indicating such affine motion compensation prediction modes (e.g., called affine flags) is signaled at the CU level. Note that the signaling of this information indicating affine motion compensation prediction modes is not limited to the CU level, but may be at other levels (e.g., sequence level, picture level, slice level, tile level, CTU level, or subblock level).

[0140] [Prediction Control Unit] The prediction control unit 128 selects either the intra-prediction signal or the inter-prediction signal and outputs the selected signal as the prediction signal to the subtraction unit 104 and the addition unit 116.

[0141] Here, we will explain an example of deriving the motion vector of a picture to be encoded using merge mode. Figure 9B is a diagram illustrating the overview of the motion vector derivation process using merge mode.

[0142] First, a list of predicted MVs is generated, containing registered candidates for predicted MVs. Candidates for predicted MVs include spatially adjacent predicted MVs, which are the MVs of multiple encoded blocks located spatially around the block to be encoded; temporally adjacent predicted MVs, which are the MVs of nearby blocks projected onto the location of the block to be encoded in the encoded reference picture; combined predicted MVs, which are generated by combining the MV values of spatially adjacent predicted MVs and temporally adjacent predicted MVs; and zero predicted MVs, which are MVs with a value of zero.

[0143] Next, one predicted MV is selected from the multiple predicted MVs registered in the predicted MV list to determine it as the MV for the block to be encoded.

[0144] Furthermore, the variable-length coding unit encodes the merge_idx signal, which indicates which predicted MV was selected, by writing it to a stream.

[0145] Note that the predicted MVs registered in the predicted MV list explained in Figure 9B are just an example, and the number of predicted MVs may differ from the number shown in the figure, the configuration may not include some of the types of predicted MVs shown in the figure, or it may include predicted MVs other than those shown in the figure.

[0146] Alternatively, the final MV may be determined by performing the DMVR process described later using the MV of the target block to be encoded derived by merge mode.

[0147] Here, we will explain an example of determining the MV using DMVR processing.

[0148] Figure 9C is a conceptual diagram illustrating the overview of DMVR processing.

[0149] First, the optimal MVP set for the block to be processed is used as a candidate MV. According to the candidate MV, reference pixels are obtained from the first reference picture, which is a processed picture in the L0 direction, and the second reference picture, which is a processed picture in the L1 direction, and a template is generated by taking the average of each reference pixel.

[0150] Next, using the template, the surrounding regions of candidate MVs for the first and second reference pictures are searched, and the MV with the lowest cost is determined as the final MV. The cost value is calculated using the difference between each pixel value of the template and each pixel value of the search region, as well as the MV value, etc.

[0151] Note that the general outline of the processing described here is basically the same for both the encoding and decoding devices.

[0152] Note that any process that can explore the vicinity of a candidate MV and derive the final MV may be used instead of the exact process described here.

[0153] Here, we will explain the mode for generating predictive images using LIC processing.

[0154] Figure 9D is a diagram illustrating the outline of a predictive image generation method using luminance correction processing by LIC processing.

[0155] First, we derive a Music Model (MV) to obtain the reference image corresponding to the block to be encoded from the reference picture, which is an encoded picture.

[0156] Next, for the block to be encoded, information indicating how the luminance values have changed between the reference picture and the picture to be encoded is extracted using the luminance pixel values of the left-adjacent and top-adjacent encoded surrounding reference regions, and the luminance pixel values at the equivalent positions in the reference picture specified by MV, and a luminance correction parameter is calculated.

[0157] By performing brightness correction processing on the reference image within the reference picture specified in MV using the brightness correction parameter, a predicted image for the encoding target block is generated.

[0158] Note that the shape of the surrounding reference region in Figure 9D is just one example, and other shapes may be used.

[0159] Furthermore, while this explanation describes the process of generating a predicted image from a single reference picture, the process is similar when generating predicted images from multiple reference pictures. In this case, the same brightness correction process is applied to each reference image obtained from the respective reference picture before generating the predicted image.

[0160] One method for determining whether or not to apply LIC processing is to use a signal called lic_flag, which indicates whether or not to apply LIC processing. Specifically, in an encoding device, it is determined whether or not the block to be encoded belongs to a region where brightness changes are occurring. If it belongs to a region where brightness changes are occurring, the value of lic_flag is set to 1 and LIC processing is applied and encoding is performed. If it does not belong to a region where brightness changes are occurring, the value of lic_flag is set to 0 and encoding is performed without applying LIC processing. On the other hand, in a decoding device, the lic_flag written in the stream is decoded, and the device switches whether or not to apply LIC processing according to its value and performs decoding.

[0161] Another way to determine whether to apply LIC processing is, for example, by checking whether LIC processing has been applied to surrounding blocks. A specific example is that if the block to be encoded is in merge mode, during the MV derivation in merge mode processing, it is determined whether the surrounding encoded blocks selected were encoded with LIC processing. Based on this result, the application of LIC processing is switched, and encoding is performed accordingly. In this example, the decoding process is exactly the same.

[0162] [Overview of the decryption device] Next, an overview of a decoding device capable of decoding the encoded signal (encoded bitstream) output from the above-mentioned encoding device 100 will be described. Figure 10 is a block diagram showing the functional configuration of the decoding device 200 according to Embodiment 1. The decoding device 200 is a video / image decoding device that decodes video / images in block units.

[0163] As shown in Figure 10, the decoding device 200 includes an entropy decoding unit 202, an inverse quantization unit 204, an inverse transform unit 206, an adder unit 208, a block memory 210, a loop filter unit 212, a frame memory 214, an intra prediction unit 216, an inter prediction unit 218, and a prediction control unit 220.

[0164] The decoding device 200 can be implemented, for example, by a general-purpose processor and memory. In this case, when the software program stored in memory is executed by the processor, the processor functions as an entropy decoding unit 202, an inverse quantization unit 204, an inverse transformation unit 206, an addition unit 208, a loop filter unit 212, an intra prediction unit 216, an inter prediction unit 218, and a prediction control unit 220. Alternatively, the decoding device 200 may be implemented as one or more dedicated electronic circuits corresponding to the entropy decoding unit 202, the inverse quantization unit 204, the inverse transformation unit 206, the addition unit 208, the loop filter unit 212, the intra prediction unit 216, the inter prediction unit 218, and the prediction control unit 220.

[0165] The following describes each component included in the decoding device 200.

[0166] [Entropy Decoder] The entropy decoding unit 202 entropically decodes the encoded bitstream. Specifically, the entropy decoding unit 202 arithmetically decodes the encoded bitstream into a binary signal, for example. Then, the entropy decoding unit 202 debinarizes the binary signal. As a result, the entropy decoding unit 202 outputs the quantization coefficients in block units to the inverse quantization unit 204.

[0167] [Dequantization section] The inverse quantization unit 204 inversely quantizes the quantization coefficients of the decoded block (hereinafter referred to as the current block), which is the input from the entropy decoding unit 202. Specifically, for each quantization coefficient of the current block, the inverse quantization unit 204 inversely quantizes the quantization coefficient based on the quantization parameter corresponding to that quantization coefficient. The inverse quantization unit 204 then outputs the inversely quantized quantization coefficients (i.e., transformation coefficients) of the current block to the inverse transformation unit 206.

[0168] [Inverse Transformation Section] The inverse transform unit 206 restores the prediction error by inversely transforming the transformation coefficients, which are input from the inverse quantization unit 204.

[0169] For example, if the information decoded from the encoded bitstream indicates that EMT or AMT should be applied (e.g., the AMT flag is true), the inverse transform unit 206 inversely transforms the transformation coefficients of the current block based on the information indicating the decoded transformation type.

[0170] For example, if the information decoded from the encoded bitstream indicates that NSST should be applied, the inverse transform unit 206 applies inverse retransformation to the transformation coefficients.

[0171] [Addition section] The adder 208 reconstructs the current block by adding the prediction error, which is the input from the inverse transformer 206, and the prediction sample, which is the input from the prediction control unit 220. The adder 208 then outputs the reconstructed block to the block memory 210 and the loop filter unit 212.

[0172] [Block memory] The block memory 210 is a storage unit for storing blocks that are referenced in intra prediction and are located within the decoded picture (hereinafter referred to as the current picture). Specifically, the block memory 210 stores the reconstructed blocks output from the adder 208.

[0173] [Loop Filter Section] The loop filter unit 212 applies a loop filter to the block reconstructed by the adder unit 208 and outputs the filtered reconstructed block to the frame memory 214 and the display device, etc.

[0174] If the information interpreted from the encoded bitstream indicating ALF on / off indicates ALF is on, one filter is selected from among several filters based on the direction and activity of the local gradient, and the selected filter is applied to the reconstruction block.

[0175] [Frame memory] The frame memory 214 is a memory unit for storing reference pictures used for interpretation, and is sometimes called a frame buffer. Specifically, the frame memory 214 stores the reconstructed blocks filtered by the loop filter unit 212.

[0176] [Intra Prediction Unit] The intra-prediction unit 216 generates a prediction signal (intra-prediction signal) by performing intra-prediction based on the intra-prediction mode decoded from the encoded bitstream, and by referring to the blocks in the current picture stored in the block memory 210. Specifically, the intra-prediction unit 216 generates an intra-prediction signal by performing intra-prediction by referring to samples (e.g., luminance values, chrominance values) of blocks adjacent to the current block, and outputs the intra-prediction signal to the prediction control unit 220.

[0177] Furthermore, if an intra-prediction mode that references a luminance block is selected in the intra-prediction of a color difference block, the intra-prediction unit 216 may predict the color difference component of the current block based on the luminance component of the current block.

[0178] Furthermore, if the information decoded from the encoded bitstream indicates the application of PDPC, the intra-prediction unit 216 corrects the pixel value after intra-prediction based on the gradient of the reference pixels in the horizontal / vertical directions.

[0179] [International Prediction Department] The inter-prediction unit 218 predicts the current block by referring to a reference picture stored in the frame memory 214. Prediction is performed in units of the current block or sub-blocks within the current block (e.g., 4x4 blocks). For example, the inter-prediction unit 218 generates an inter-prediction signal for the current block or sub-block by performing motion compensation using motion information (e.g., motion vectors) decoded from the encoded bitstream, and outputs the inter-prediction signal to the prediction control unit 220.

[0180] Furthermore, if the information decoded from the encoded bitstream indicates that OBMC mode should be applied, the interpretation unit 218 generates an interpretation prediction signal using not only the motion information of the current block obtained by motion search, but also the motion information of adjacent blocks.

[0181] Furthermore, if the information decoded from the encoded bitstream indicates that FRUC mode should be applied, the interpretation unit 218 derives motion information by performing a motion search according to the pattern matching method (bilateral matching or template matching) decoded from the encoded stream. Then, the interpretation unit 218 performs motion compensation using the derived motion information.

[0182] Furthermore, when the BIO mode is applied, the inter-prediction unit 218 derives motion vectors based on a model that assumes uniform linear motion. Also, if the information decoded from the encoded bitstream indicates that the affine motion compensation prediction mode should be applied, the inter-prediction unit 218 derives motion vectors on a sub-block basis based on the motion vectors of multiple adjacent blocks.

[0183] [Prediction Control Unit] The prediction control unit 220 selects either the intra-prediction signal or the inter-prediction signal and outputs the selected signal as the prediction signal to the adder 208.

[0184] [Comparative Example] Before describing the inter-screen prediction processing according to this embodiment, we will first describe an example of inter-screen prediction processing that does not use the method of this embodiment.

[0185] First, Comparative Example 1 will be described. Figure 11 is a flowchart of the inter-frame prediction processing in prediction block units in the video encoding method and video decoding method according to Comparative Example 1. The processing shown in Figure 11 is repeated in prediction block units, which are the processing units of inter-frame prediction processing. In the following, the operation of the inter-prediction unit 126 included in the encoding device 100 will be mainly described, but the operation of the inter-prediction unit 218 included in the decoding device 200 is similar.

[0186] If the FRUC control information indicates 0 (0 in S101), the inter-prediction unit 126 derives motion vectors (MV) in units of prediction blocks according to the normal inter-screen prediction method (S102). Here, the normal inter-screen prediction method is a conventional method that does not use the FRUC method, for example, a method in which motion vectors are derived on the encoding side and information indicating the derived motion vectors is transmitted from the encoding side to the decoding side.

[0187] Next, the inter-prediction unit 126 acquires an inter-screen predicted image by performing motion compensation on a prediction block basis using the motion vector of the prediction block basis (S103).

[0188] On the other hand, if the FRUC control information indicates 1 (1 in S101), the interpretation unit 126 derives motion vectors in units of prediction blocks according to the template FRUC method (S104). Subsequently, the interpretation unit 126 derives motion vectors in units of subblocks obtained by dividing the prediction block, according to the template FRUC method (S105).

[0189] On the other hand, if the FRUC control information indicates 2 (2 in S101), the inter-prediction unit 126 derives motion vectors in units of prediction blocks according to the bilateral FRUC method (S106). Subsequently, the inter-prediction unit 126 derives motion vectors in units of sub-blocks obtained by dividing the prediction block according to the bilateral FRUC method (S107).

[0190] Then, the inter-prediction unit 126 derives motion vectors for each sub-block according to the template FRUC method or the bilateral FRUC method, and then performs motion compensation for each sub-block using the derived motion vectors for each sub-block to acquire an inter-screen prediction image (S108).

[0191] Thus, in FRUC processing, by deriving motion vectors at the block level and then correcting the motion vectors at the sub-block level, it becomes possible to track fine movements. This improves encoding efficiency. On the other hand, it may not be able to adequately handle blocks where deformation occurs at the pixel level.

[0192] Next, Comparative Example 2 will be described. Figure 12 is a flowchart of the inter-screen prediction processing at the prediction block level in the video encoding method and video decoding method according to Comparative Example 2. In Comparative Example 2, BIO processing is used as the motion compensation processing. In other words, in the processing shown in Figure 12, steps S103 and S108 are changed to steps S103A and S108A compared to the processing shown in Figure 11.

[0193] If the FRUC control information indicates 0 (0 in S101), the inter-prediction unit 126 derives motion vectors in units of prediction blocks according to the normal inter-screen prediction method (S102). Next, the inter-prediction unit 126 acquires an inter-screen prediction image by performing motion compensation by BIO processing in units of prediction blocks using the motion vectors in units of prediction blocks (S103A).

[0194] On the other hand, if the FRUC control information indicates 1 (1 in S101), the interpretation unit 126 derives motion vectors in units of prediction blocks according to the template FRUC method (S104). Subsequently, the interpretation unit 126 derives motion vectors in units of subblocks obtained by dividing the prediction block, according to the template FRUC method (S105).

[0195] On the other hand, if the FRUC control information indicates 2 (2 in S101), the inter-prediction unit 126 derives motion vectors in units of prediction blocks according to the bilateral FRUC method (S106). Subsequently, the inter-prediction unit 126 derives motion vectors in units of sub-blocks obtained by dividing the prediction block according to the bilateral FRUC method (S107).

[0196] Then, the inter-prediction unit 126 derives motion vectors for each subblock according to the template FRUC method or the bilateral FRUC method, and then uses the derived motion vectors for each subblock to perform motion compensation by BIO processing on a subblock basis, thereby acquiring an inter-screen predicted image (S108A).

[0197] Thus, in Comparative Example 2, the interpretation unit 126 can correct the predicted image on a pixel-by-pixel basis by performing BIO processing after FRUC processing. This may improve encoding efficiency even for blocks where deformation occurs.

[0198] On the other hand, there is a problem in that the processing volume increases because both FRUC processing and BIO processing are performed.

[0199] Furthermore, in the standard inter-screen prediction method, BIO processing is performed at the prediction block level, while in the FRUC method, BIO processing is performed at the sub-block level. Thus, because the units of motion vectors used as input for BIO processing differ between the standard inter-screen prediction method and the FRUC method, there is a problem in that it becomes necessary to implement two types of BIO processing functions.

[0200] [Screen-to-screen prediction processing] The inter-frame prediction processing will now be described using the inter-frame prediction unit 126 according to this embodiment. The inter-frame prediction unit 126 can execute at least two motion vector derivation methods in inter-frame prediction processing: a normal inter-frame prediction method and the FRUC method. In the normal inter-frame prediction method, information regarding the motion vector of the block to be processed is encoded into a stream. In the FRUC method, information regarding the motion vector of the block to be processed is not encoded into a stream, and the motion vector is derived using a reconstructed image of the processed region and a processed reference picture in a common method between the encoding side and the decoding side.

[0201] The interpretation unit 126 further performs BIO processing to obtain a predicted image by performing motion compensation on each prediction block of the processed reference picture using motion vectors, and also derives local motion estimates by obtaining brightness gradient values, and generates a corrected predicted image using the derived local motion estimates. In the FRUC method, the interpretation unit 126 processes on a prediction block basis and derives motion vectors for each prediction block. In addition, in BIO processing, regardless of which motion vector derivation method is used, the interpretation unit 126 always takes motion vectors for each prediction block as input and generates a predicted image using common processing for each prediction block.

[0202] The following describes the inter-frame prediction processing by the inter-prediction unit 126 according to this embodiment. Figure 13 is a flowchart of the inter-frame prediction processing in prediction block units in the video encoding method and video decoding method according to this embodiment. The processing shown in Figure 13 is repeated in prediction block units, which are the processing units of inter-frame prediction processing. In the following, the operation of the inter-prediction unit 126 included in the encoding device 100 will be mainly described, but the operation of the inter-prediction unit 218 included in the decoding device 200 is similar.

[0203] If the FRUC control information indicates 0 (0 in S101), the inter-prediction unit 126 derives motion vectors in units of prediction blocks according to the normal inter-screen prediction method, similar to the process shown in Figure 12 (S102). Next, the inter-prediction unit 126 acquires an inter-screen prediction image by performing motion compensation by BIO processing in units of prediction blocks using the motion vectors in units of prediction blocks (S103A).

[0204] On the other hand, if the FRUC control information indicates 1 (1 in S101), the inter-prediction unit 126 derives motion vectors for each prediction block according to the template FRUC method (S104). Also, if the FRUC control information indicates 2 (2 in S101), the inter-prediction unit 126 derives motion vectors for each prediction block according to the bilateral FRUC method (S106). Note that in the process shown in Figure 13, unlike the process shown in Figure 12, the inter-prediction unit 126 does not derive motion vectors for each sub-block when the FRUC method is used.

[0205] Then, the inter-prediction unit 126 derives motion vectors for each prediction block according to the template FRUC method or the bilateral FRUC method, and then uses the derived motion vectors for each prediction block to perform motion compensation by BIO processing for each prediction block, thereby acquiring an inter-screen prediction image (S103A).

[0206] Thus, in this embodiment, the inter-prediction unit 126 derives motion vectors for each prediction block, regardless of whether the FRUC control information indicates the normal inter-screen prediction method, the template FRUC method, or the bilateral FRUC method. The inter-prediction unit 126 then performs BIO processing for each prediction block. In other words, the processing unit is the prediction block unit in all cases, and the processing unit remains the same.

[0207] The FRUC control information numbers shown above are examples, and other numbers may be used. Furthermore, only one of the template FRUC method or the bilateral FRUC method may be used. Also, common processing can be used during encoding and decoding.

[0208] In Comparative Example 2, shown in Figure 12, both sub-block FRUC processing, which corrects motion vectors at the sub-block level, and BIO processing, which corrects the predicted image at the pixel level, are performed. Both sub-block FRUC processing and BIO processing are processes that perform corrections at a finer level than the predicted block, and they have similar properties and similar effects. In the process shown in Figure 13, these processes are consolidated into BIO processing. Furthermore, in the process shown in Figure 13, the processing load can be reduced by not performing the sub-block FRUC processing, which is computationally intensive. In addition, the BIO processing when the FRUC method is used has also been changed from sub-block level to predicted block level, which can reduce the processing load.

[0209] Thus, the processing according to this embodiment shown in Figure 13 has the potential to improve encoding efficiency even for blocks where deformation occurs, while suppressing an increase in processing load, compared to Comparative Example 2 shown in Figure 12.

[0210] Furthermore, in the process according to this embodiment shown in Figure 13, regardless of the value of the FRUC control information, BIO processing is performed using the motion vector of the predicted block as input. Therefore, compared to Comparative Example 2 shown in Figure 12, BIO processing using the motion vector of the subblock as input is unnecessary. This simplifies the implementation.

[0211] Furthermore, the BIO processing in step S103A does not need to be completely identical across multiple motion vector derivation methods. In other words, the BIO processing used may differ in one or each of the following cases: when the normal inter-screen prediction method is used, when the template FRUC method is used, and when the bilateral FRUC method is used.

[0212] Furthermore, for at least one of the multiple motion vector derivation methods, the BIO processing in step S103A may be replaced with another process (including a modified version of the BIO processing) that generates a predicted image while correcting the predicted image at a pixel-level or unit smaller than the predicted block. In this case as well, since the two processes with similar properties described above can be combined into one process, it may be possible to improve the coding efficiency even for blocks where deformation occurs, while suppressing the increase in processing load compared to Comparative Example 2.

[0213] Furthermore, for at least one of the multiple motion vector derivation methods, the BIO processing in step S103A may be replaced with another process that generates a predicted image while correcting the predicted image using motion vectors at the prediction block level as input. In this case as well, the process described above that uses motion vectors at the sub-block level as input becomes unnecessary, and the implementation can be simplified.

[0214] Furthermore, it may be possible to switch whether or not to perform BIO processing. For example, the inter-prediction unit 126 may perform the processing shown in Figure 13 when performing BIO processing, and the processing shown in Figure 11 when not performing BIO processing. Alternatively, even when not performing BIO processing, the inter-prediction unit 126 may not perform the motion vector derivation processing for each subblock, similar to Figure 13. In other words, the inter-prediction unit 126 may replace the processing in step S108A shown in Figure 13 with motion compensation processing for each prediction block that does not include BIO processing.

[0215] [Example of inter-screen prediction processing] The following describes a modified version of the inter-screen prediction process according to this embodiment. Figure 14 is a flowchart of the inter-screen prediction process in the video encoding method and video decoding method according to a modified version of this embodiment. The process shown in Figure 14 is repeated in units of prediction blocks, which are the processing units of the inter-screen prediction process.

[0216] If the FRUC control information indicates 0 (0 in S101), the inter-prediction unit 126 derives motion vectors in units of prediction blocks according to the normal inter-screen prediction method, similar to the process shown in Figure 13 (S102). Next, the inter-prediction unit 126 acquires an inter-screen prediction image by performing motion compensation using BIO processing on a prediction block basis with respect to the motion vectors of the prediction blocks (S103A).

[0217] Furthermore, if the FRUC control information indicates 1 (1 in S101), the inter-prediction unit 126 derives motion vectors in units of prediction blocks according to the template FRUC method, similar to the process shown in Figure 13 (S104). Next, the inter-prediction unit 126 acquires an inter-screen prediction image by performing motion compensation through BIO processing in units of prediction blocks using the motion vectors in units of prediction blocks (S103A).

[0218] On the other hand, if the FRUC control information indicates 2 (2 in S101), the inter-prediction unit 126 derives motion vectors in units of prediction blocks according to the bilateral FRUC method (S106). Subsequently, the inter-prediction unit 126 derives motion vectors in units of sub-blocks obtained by dividing the prediction block according to the bilateral FRUC method (S107).

[0219] Next, the inter-prediction unit 126 acquires an inter-screen predicted image by performing motion compensation on a sub-block basis using the derived sub-block motion vectors (S108). Note that the motion compensation here is normal motion compensation, not BIO processing.

[0220] Thus, when the normal inter-screen prediction method is used (0 in S101) and when the template FRUC method is used (1 in S101), the inter-screen prediction unit 126 acquires an inter-screen prediction image by performing motion compensation using BIO processing on a prediction block basis, using the derived motion vectors on a prediction block basis. On the other hand, when the bilalateral FRUC method is used (2 in S101), the inter-screen prediction unit 126 acquires an inter-screen prediction image by performing normal motion compensation without applying BIO processing, using motion vectors on a sub-block basis.

[0221] Note that the FRUC control information number is just an example, and other numbers may be used. Also, the interpretation unit 218 included in the decoding unit 200 performs the same processing as the interpretation unit 126 included in the encoding device 100.

[0222] Furthermore, in this case, the interpretation unit 126 derives motion vectors on a prediction block basis and performs BIO processing on a prediction block basis when the template FRUC method is used, and derives motion vectors on a sub-block basis and performs normal motion compensation processing on a sub-block basis when the bilateral FRUC method is used. However, it is also possible to derive motion vectors on a prediction block basis and perform BIO processing on a prediction block basis when the bilateral FRUC method is used, and to derive motion vectors on a sub-block basis and perform normal motion compensation processing on a sub-block basis when the template FRUC method is used.

[0223] Furthermore, the effect of improving encoding efficiency by deriving motion vectors at the sub-block level is greater in the bilateral FRUC method than in the template FRUC method. Therefore, as shown in Figure 14, it is preferable to derive motion vectors at the sub-block level in the bilateral FRUC method.

[0224] Furthermore, while both the template FRUC method and the bilateral FRUC method are used in Figure 14, only one of them may be used. In this case, when the FRUC method is used, the interpretation unit 126 derives motion vectors in subblock units and performs normal motion compensation processing.

[0225] Furthermore, the BIO processing in step S103A does not need to be exactly the same across multiple motion vector derivation methods. In other words, the BIO processing used may differ between the case where the standard inter-screen prediction method is used and the case where the template FRUC method is used.

[0226] Furthermore, for at least one of the multiple motion vector derivation methods, the BIO processing in step S103A may be replaced with another process (including a variation of the BIO processing) that generates a predicted image while correcting the predicted image at a pixel-level or unit smaller than the predicted block.

[0227] Furthermore, for at least one of the multiple motion vector derivation methods, the BIO processing in step S103A may be replaced with another process that generates a predicted image while correcting the predicted image using the motion vectors of the predicted block units as input.

[0228] The process shown in Figure 14 applies only one of two methods in the FRUC scheme: sub-block FRUC processing, which corrects motion vectors at the sub-block level, or BIO processing, which corrects the predicted image at the pixel level. As a result, the process shown in Figure 14 uses roughly the same amount of processing power as the process shown in Figure 13, while allowing the use of a method that has a significant synergistic effect on each FRUC scheme. This improves encoding efficiency.

[0229] As described above, the encoding device 100 performs the processing shown in Figure 15. The processing in the decoding device 200 is the same. In the first operating mode (first operating mode in S111), the encoding device 100 derives a first motion vector in prediction block units obtained by dividing the image contained in the moving image using a first inter-frame prediction method (S112), and performs a first motion compensation processing using the derived first motion vector in each prediction block unit (S113). Here, the first motion compensation processing is, for example, a BIO process, which is a motion compensation process that generates a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the derived first motion vector.

[0230] Furthermore, in the second operating mode (second operating mode in S111), the encoding device 100 derives a second motion vector in sub-block units obtained by dividing the prediction block using the second inter-screen prediction method (S114), and performs a second motion compensation process using the second motion vector in sub-block units (S115). Here, the second motion compensation process is, for example, a motion compensation process that does not apply BIO processing, and is a motion compensation process that generates a predicted image without referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector.

[0231] According to this, in the first operating mode, the encoding device 100 performs the motion vector derivation process and the first motion compensation process on a prediction block basis, thereby reducing the processing load compared to, for example, performing these processes on a sub-block basis. Furthermore, the first motion compensation process, which generates a predicted image by referring to the spatial gradient of brightness, can achieve correction on a unit smaller than the prediction block basis, thus suppressing the decrease in encoding efficiency that occurs when processing is not performed on a sub-block basis. In the second operating mode, the encoding device 100 performs the motion vector derivation process and the second motion compensation process on a sub-block basis. Here, the second motion compensation process does not refer to the spatial gradient of brightness, so the processing load is less than that of the first motion compensation process. Moreover, by having these two operating modes, the encoding device 100 can improve encoding efficiency. In this way, the encoding device 100 can reduce the processing load while suppressing a decrease in encoding efficiency.

[0232] For example, the first inter-frame prediction method differs from the second inter-frame prediction method. Specifically, the second inter-frame prediction method is an inter-frame prediction method that uses the degree of fit of two reconstructed images from two different regions within two separate pictures, such as the FRUC method.

[0233] According to this, the inter-screen prediction method, which significantly improves coding efficiency by calculating motion vectors at the sub-block level, can be implemented at the sub-block level. Therefore, coding efficiency can be improved.

[0234] For example, the first inter-frame prediction method is one of the following: (1) a third inter-frame prediction method (e.g., template FRUC method) that uses the degree of fit between the reconstructed image of a region in a target picture adjacent to the target prediction block and the reconstructed image of a region in a reference picture; or (2) a fourth inter-frame prediction method (e.g., bilateral FRUC method) that uses the degree of fit between two reconstructed images of two regions in two different reference pictures. The second inter-frame prediction method is the other of the third and fourth inter-frame prediction methods.

[0235] For example, the first screen prediction method is the third screen prediction method (e.g., the template FRUC method), and the second screen prediction method is the fourth screen prediction method (e.g., the bilateral FRUC method).

[0236] According to this, the inter-screen prediction method, which significantly improves coding efficiency by calculating motion vectors at the sub-block level, can be implemented at the sub-block level. Therefore, coding efficiency can be improved.

[0237] For example, the first inter-picture prediction method is an inter-picture prediction method (normal inter-picture prediction method) that uses the degree of fitness between a target prediction block and a reconstructed image of a region included in a reference picture. The encoding device 100 generates an encoded bitstream including information for specifying the derived first motion vector. Also, in the first inter-picture prediction method, the decoding device 200 acquires information for specifying the first motion vector in units of prediction blocks from the encoded bitstream, and derives the first motion vector using the information.

[0238] [Template FRUC Method and Bilateral FRUC Method] Hereinafter, a method for deriving a motion vector according to the template FRUC method or the bilateral FRUC method will be described. The method for deriving a motion vector in units of blocks and the method for deriving a motion vector in units of sub-blocks are basically the same. In the following description, the method for deriving a motion vector of a block and the method for deriving a motion vector of a sub-block will be described as a method for deriving a motion vector of a processing target region.

[0239] FIG. 16 is a conceptual diagram showing the template FRUC method used for deriving a motion vector of a processing target region in the encoding device 100 and the decoding device 200. In the template FRUC method, a motion vector is derived using a common method between the encoding device 100 and the decoding device 200 without encoding and decoding information on the motion vector of the processing target region.

[0240] Also, in the template FRUC method, a motion vector is derived using a reconstructed image of an adjacent region, which is a region adjacent to the processing target region, and a reconstructed image of a corresponding adjacent region, which is a region in the reference picture.

[0241] Here, the adjacent region is one or both of a region adjacent to the left and a region adjacent to the top of the processing target region.

[0242] Furthermore, the corresponding adjacent region is a region specified using a candidate motion vector, which is a candidate for the motion vector of the region to be processed. Specifically, the corresponding adjacent region is the region pointed to by the candidate motion vector from the adjacent region. Also, the relative position of the corresponding adjacent region with respect to the corresponding region pointed to by the candidate motion vector from the region to be processed is equal to the relative position of the adjacent region with respect to the region to be processed.

[0243] Figure 17 is a conceptual diagram showing the bilateral FRUC method used in the encoding device 100 and decoding device 200 to derive the motion vector of the processing area. In the bilateral FRUC method, similar to the template FRUC method, the motion vector is derived using a common method between the encoding device 100 and the decoding device 200 without encoding and decoding the motion vector information of the processing area.

[0244] Furthermore, in the bilateral FRUC method, motion vectors are derived using two reconstructed images of two regions in two reference pictures. For example, as shown in Figure 17, motion vectors are derived using the reconstructed image of the corresponding region in the first reference picture and the reconstructed image of the symmetric region in the second reference picture.

[0245] Here, the corresponding region and the symmetric region are regions specified using candidate motion vectors, which are candidates for the motion vectors of the region to be processed. Specifically, the corresponding region is the region pointed to from the region to be processed by the candidate motion vectors. The symmetric region is the region pointed to from the region to be processed by the symmetric motion vectors. The symmetric motion vectors are motion vectors that constitute a set of candidate motion vectors for bidirectional prediction. The symmetric motion vectors may also be motion vectors derived by scaling the candidate motion vectors.

[0246] Figure 18 is a flowchart showing the operation of the interpretation unit 126 of the encoding device 100 in which it derives motion vectors according to the template FRUC method or the bilateral FRUC method. The interpretation unit 218 of the decoding device 200 operates similarly to the interpretation unit 126 of the encoding device 100.

[0247] First, the interpretation unit 126 derives candidate motion vectors by referencing the motion vectors of one or more processed regions that are temporally or spatially surrounding the region to be processed.

[0248] In the bilateral FRUC method, the interpretation unit 126 derives candidate motion vectors for bidirectional prediction. That is, the interpretation unit 126 derives candidate motion vectors as a set of two motion vectors.

[0249] Specifically, in the bilateral FRUC method, if the motion vector of the processed region is a bidirectional prediction motion vector, the inter-prediction unit 126 derives the bidirectional prediction motion vector as is and uses it as a candidate motion vector for bidirectional prediction. If the motion vector of the processed region is a unidirectional prediction motion vector, the inter-prediction unit 126 may derive a candidate motion vector for bidirectional prediction by deriving a bidirectional prediction motion vector from the unidirectional prediction motion vector through scaling or the like.

[0250] More specifically, in the bilateral FRUC method, the interpretation unit 126 derives a motion vector referencing a second reference picture by scaling the motion vector referencing a first reference picture according to the display time interval. As a result, the interpretation unit 126 derives candidate motion vectors that constitute a pair of the unidirectional prediction motion vector and the scaled motion vector as candidate motion vectors for bidirectional prediction.

[0251] Alternatively, in the bilateral FRUC method, the interpretation unit 126 may derive the motion vector of the processed region as a candidate motion vector if the motion vector of the processed region is a bidirectional prediction motion vector. The interpretation unit 126 does not need to derive the motion vector of the processed region as a candidate motion vector if the motion vector of the processed region is a unidirectional prediction motion vector.

[0252] In the template FRUC method, regardless of whether the motion vector of the processed region is a bidirectional or unidirectional motion vector, the interpretation unit 126 derives the motion vector of the processed region as a candidate motion vector.

[0253] The interpretation unit 126 then generates a candidate motion vector list composed of candidate motion vectors (S201). Here, if the processing area is a subblock, that is, if motion vectors are derived for each subblock, the interpretation unit 126 may include block-level motion vectors as candidate motion vectors in the candidate motion vector list. In this case, the interpretation unit 126 may include block-level motion vectors as the highest-priority candidate motion vectors in the candidate motion vector list.

[0254] Furthermore, in the bilateral FRUC method, if the motion vectors for each block are unidirectional prediction motion vectors, the interpretation unit 126 may derive candidate motion vectors for bidirectional prediction from the unidirectional prediction motion vectors by scaling or the like. For example, the interpretation unit 126 may derive candidate motion vectors for bidirectional prediction from the unidirectional prediction motion vectors by scaling or the like, similar to the case where the surrounding motion vectors are unidirectional prediction motion vectors.

[0255] Furthermore, the interpretation unit 126 may include candidate motion vectors derived from the unidirectional prediction motion vectors as candidate motion vectors for bidirectional prediction in the candidate motion vector list.

[0256] Alternatively, in the bilateral FRUC method, the interpretation unit 126 may include the block-unit motion vector as a candidate motion vector in the candidate motion vector list if the block-unit motion vector is a bidirectional prediction motion vector. The interpretation unit 126 may not include the block-unit motion vector as a candidate motion vector in the candidate motion vector list if the block-unit motion vector is a unidirectional prediction motion vector.

[0257] Then, the interpretation unit 126 selects the best candidate motion vector from one or more candidate motion vectors included in the candidate motion vector list (S202). At that time, the interpretation unit 126 calculates an evaluation value for each of the one or more candidate motion vectors according to the degree of fit of the two reconstructed images of the two evaluation target regions.

[0258] Specifically, in the template FRUC method, the two evaluation regions are the adjacent region and the corresponding adjacent region as shown in Figure 16, and in the bilateral FRUC method, the two evaluation regions are the corresponding region and the symmetric region as shown in Figure 17. As described above, the corresponding adjacent region used in the template FRUC method, and the corresponding region and symmetric region used in the bilateral FRUC method, are determined according to the candidate motion vector.

[0259] For example, the interpretation unit 126 calculates a better evaluation value the higher the degree of fit between the two reconstructed images of the two evaluation target regions. Specifically, the interpretation unit 126 derives the difference value between the two reconstructed images of the two evaluation target regions. Then, the interpretation unit 126 calculates an evaluation value using the difference value. For example, the interpretation unit 126 calculates a better evaluation value the smaller the difference value.

[0260] In addition, for calculating the evaluation value, not only the difference value but also other information may be used. That is, the inter prediction unit 126 may calculate the evaluation value using the difference value and other information. For example, the priority order of one or more candidate motion vectors and the amount of code based on the priority order may affect the evaluation value.

[0261] Then, the inter prediction unit 126 selects the candidate motion vector with the best evaluation value from among one or more candidate motion vectors as the best candidate motion vector.

[0262] Then, the inter prediction unit 126 derives the motion vector of the processing target area by searching the vicinity of the best candidate motion vector (S203).

[0263] That is, the inter prediction unit 126 calculates the evaluation value in the same way for the motion vector indicating the area around the area indicated by the best candidate motion vector. And when there is a motion vector with a better evaluation value than the best candidate motion vector, the inter prediction unit 126 updates the best candidate motion vector with the motion vector having a better evaluation value than the best candidate motion vector. Then, the inter prediction unit 126 derives the updated best candidate motion vector as the final motion vector of the processing target area.

[0264] Note that the inter prediction unit 126 may derive the best candidate motion vector as the final motion vector of the processing target area without performing the process of searching the vicinity of the best candidate motion vector (S203). Also, the best candidate motion vector is not limited to the candidate motion vector with the best evaluation value. One of one or more candidate motion vectors whose evaluation value is above a certain standard may be selected as the best candidate motion vector according to a predetermined priority order.

[0265] Furthermore, the processing related to the processing target region and the processed region here is, for example, encoding or decoding. More specifically, the processing related to the processing target region and the processed region may be a process for deriving motion vectors. Alternatively, the processing related to the processing target region and the processed region may be a reconstruction process.

[0266] [BIO processing] Figure 19 is a conceptual diagram showing the BIO processing in the encoding device 100 and the decoding device 200. In BIO processing, a predicted image of the target block is generated by referring to the spatial gradient of brightness in the image obtained by performing motion compensation of the target block using the motion vector of the target block.

[0267] Before BIO processing, two motion vectors for the block to be processed, the L0 motion vector (MV_L0) and the L1 motion vector (MV_L1), are derived. The L0 motion vector (MV_L0) is the motion vector for referencing the L0 reference picture, which is a processed picture, and the L1 motion vector (MV_L1) is the motion vector for referencing the L1 reference picture, which is a processed picture. The L0 reference picture and the L1 reference picture are two reference pictures that are referenced simultaneously in the dual prediction processing of the block to be processed.

[0268] As a method for deriving the L0 motion vector (MV_L0) and L1 motion vector (MV_L1), the usual inter-frame prediction mode, merge mode, or FRUC mode may be used. For example, in the usual inter-frame prediction mode, the encoding device 100 derives the motion vector by performing motion detection using the image of the block to be processed, and the motion vector information is encoded. Also in the usual inter-frame prediction mode, the decoding device 200 derives the motion vector by decoding the motion vector information.

[0269] Then, in the BIO processing, an L0 predicted image is obtained by referencing an L0 reference picture and performing motion compensation for the block to be processed using an L0 motion vector (MV_L0). For example, an L0 predicted image may be obtained by applying a motion compensation filter to the image of the L0 reference pixel range, which includes the block and its surroundings, indicated in the L0 reference picture by the L0 motion vector (MV_L0) from the block to be processed.

[0270] Furthermore, an L0 gradient image is obtained that shows the spatial gradient of brightness at each pixel in the L0 prediction image. For example, the L0 gradient image is obtained by referencing the brightness of each pixel in the L0 reference pixel range, which includes the block and its surroundings, pointed to in the L0 reference picture from the processing target block by the L0 motion vector (MV_L0).

[0271] Furthermore, an L1 predicted image can be obtained by referencing an L1 reference picture and performing motion compensation for the block to be processed using an L1 motion vector (MV_L1). For example, an L1 predicted image may be obtained by applying a motion compensation filter to an image of the L1 reference pixel range that includes the block and its surroundings, as indicated in the L1 reference picture by the L1 motion vector (MV_L1) from the block to be processed.

[0272] Furthermore, an L1 gradient image is obtained that shows the spatial gradient of brightness at each pixel in the L1 prediction image. For example, the L1 gradient image is obtained by referencing the brightness of each pixel in the L1 reference pixel range, which includes the block and its surroundings, pointed to in the L1 reference picture from the block to be processed by the L1 motion vector (MV_L1).

[0273] Then, a local motion estimate is derived for each pixel in the block to be processed. Specifically, the pixel value of the corresponding pixel position in the L0 prediction image, the gradient value of the corresponding pixel position in the L0 gradient image, the pixel value of the corresponding pixel position in the L1 prediction image, and the gradient value of the corresponding pixel position in the L1 gradient image are used. The local motion estimate can also be called a corrected motion vector (corrected MV).

[0274] Then, for each pixel in the block to be processed, a pixel correction value is derived using the gradient value of the corresponding pixel position in the L0 gradient image, the gradient value of the corresponding pixel position in the L1 gradient image, and the local motion estimate. Then, for each pixel in the block to be processed, a predicted pixel value is derived using the pixel value of the corresponding pixel position in the L0 prediction image, the pixel value of the corresponding pixel position in the L1 prediction image, and the pixel correction value. This results in a predicted image to which BIO processing has been applied.

[0275] In other words, the predicted pixel values obtained from the pixel values at the corresponding pixel positions in the L0 predicted image and the L1 predicted image are corrected by the pixel correction value. To put it another way, the predicted image obtained from the L0 predicted image and the L1 predicted image is corrected using the spatial gradient of luminance in the L0 predicted image and the L1 predicted image.

[0276] Figure 20 is a flowchart showing the operation performed by the inter-screen prediction unit 126 of the encoding device 100 as BIO processing. The inter-screen prediction unit 218 of the decoding device 200 operates in the same manner as the inter-screen prediction unit 126 of the encoding device 100.

[0277] First, the inter-screen prediction unit 126 obtains an L0 prediction image by referencing the L0 reference picture using the L0 motion vector (MV_L0) (S401). Then, the inter-screen prediction unit 126 obtains an L0 gradient image by referencing the L0 reference picture using the L0 motion vector (S402).

[0278] Similarly, the inter-screen prediction unit 126 references the L1 reference picture using the L1 motion vector (MV_L1) to obtain the L1 predicted image (S401). Then, the inter-screen prediction unit 126 references the L1 reference picture using the L1 motion vector to obtain the L1 gradient image (S402).

[0279] Next, the inter-screen prediction unit 126 derives local motion estimates for each pixel in the block to be processed (S411). In this process, the pixel value of the corresponding pixel position in the L0 prediction image, the gradient value of the corresponding pixel position in the L0 gradient image, the pixel value of the corresponding pixel position in the L1 prediction image, and the gradient value of the corresponding pixel position in the L1 gradient image are used.

[0280] The inter-screen prediction unit 126 then derives a pixel correction value for each pixel of the block to be processed, using the gradient value of the corresponding pixel position in the L0 gradient image, the gradient value of the corresponding pixel position in the L1 gradient image, and the local motion estimate. The inter-screen prediction unit 126 then derives a predicted pixel value for each pixel of the block to be processed, using the pixel value of the corresponding pixel position in the L0 prediction image, the pixel value of the corresponding pixel position in the L1 prediction image, and the pixel correction value (S412).

[0281] Through the above operation, the inter-screen prediction unit 126 generates a predicted image to which BIO processing has been applied.

[0282] In addition, the following equation (3) may be used specifically in deriving the local motion estimate and the pixel correction value.

[0283]

number

[0284] In equation (3), I x 0 [x,y] is the horizontal gradient value at pixel position [x,y] in the L0 gradient image. x 1 [x,y] is the horizontal gradient value at pixel position [x,y] in the L1 gradient image. y 0 [x,y] is the vertical gradient value at pixel position [x,y] in the L0 gradient image. y 1 [x,y] is the vertical gradient value at pixel position [x,y] in the L1 gradient image.

[0285] Also, in equation (3), I 0 [x,y] is the pixel value at pixel position [x,y] in the L0 prediction image. 1 [x,y] is the pixel value at pixel position [x,y] in the L1 prediction image. ΔI[x,y] is the difference between the pixel value at pixel position [x,y] in the L0 prediction image and the pixel value at pixel position [x,y] in the L1 prediction image.

[0286] Furthermore, in equation (3), Ω is, for example, the set of pixel positions included in the region centered on the pixel position [x,y]. w[i,j] is the weighting coefficient for the pixel position [i,j]. The same value may be used for w[i,j]. G x [x,y], G y [x,y], G x G y [x,y], sG x G y [x,y], sG x 2 [x,y], sG y 2 [x,y], sG x dI[x,y] and sG y dI[x,y] etc. are auxiliary calculated values.

[0287] Furthermore, in equation (3), u[x,y] is the horizontal value that constitutes the local motion estimate at pixel position [x,y]. v[x,y] is the vertical value that constitutes the local motion estimate at pixel position [x,y]. b[x,y] is the pixel correction value at pixel position [x,y]. p[x,y] is the predicted pixel value at pixel position [x,y].

[0288] Furthermore, although the inter-screen prediction unit 126 derives local motion estimates for each pixel in the above description, it is also possible to derive local motion estimates for each subblock, which is an image data unit that is coarser than a pixel but finer than the block being processed.

[0289] For example, in equation (3) above, Ω may be the set of pixel positions included in the subblock. And sG x G y [x,y], sG x 2 [x,y], sG y 2 [x,y], sG x dI[x,y], sG y dI[x,y], u[x,y], and v[x,y] may be calculated for each subblock rather than for each pixel.

[0290] Furthermore, the encoding device 100 and the decoding device 200 can be subjected to a common BIO processing. In other words, the encoding device 100 and the decoding device 200 can be subjected to BIO processing in the same manner.

[0291] [Example of an encoding device implementation] Figure 21 is a block diagram showing an implementation example of the encoding device 100 according to Embodiment 1. The encoding device 100 includes a circuit 160 and a memory 162. For example, the multiple components of the encoding device 100 shown in Figures 1 and 11 are implemented by the circuit 160 and memory 162 shown in Figure 21.

[0292] Circuit 160 is an information processing circuit and is a circuit that can access memory 162. For example, circuit 160 is a dedicated or general-purpose electronic circuit for encoding moving images. Circuit 160 may also be a processor such as a CPU. Alternatively, circuit 160 may be a collection of multiple electronic circuits. Furthermore, for example, circuit 160 may play the role of multiple components of the encoding device 100 shown in Figure 1, etc., excluding the component for storing information.

[0293] Memory 162 is a dedicated or general-purpose memory in which information for the circuit 160 to encode moving images is stored. Memory 162 may be an electronic circuit, or it may be connected to circuit 160. Memory 162 may also be included in circuit 160. Memory 162 may also be a collection of multiple electronic circuits. Memory 162 may also be a magnetic disk or an optical disk, or it may be described as storage or a recording medium. Memory 162 may also be a non-volatile memory or a volatile memory.

[0294] For example, memory 162 may store the video to be encoded, or it may store a bit sequence corresponding to the encoded video. Alternatively, memory 162 may store a program for circuit 160 to encode the video.

[0295] Furthermore, for example, memory 162 may play the role of an information storage component among the multiple components of the encoding device 100 shown in Figure 1, etc. Specifically, memory 162 may play the role of block memory 118 and frame memory 122 shown in Figure 1. More specifically, reconstructed blocks and reconstructed pictures may be stored in memory 162.

[0296] Furthermore, it is not necessary for the encoding device 100 to implement all of the components shown in Figure 1, etc., nor is it necessary for all of the processes described above to be performed. Some of the components shown in Figure 1, etc., may be included in other devices, and some of the processes described above may be executed by other devices. Then, in the encoding device 100, motion compensation is efficiently performed by implementing some of the components shown in Figure 1, etc., and by performing some of the processes described above.

[0297] Specifically, the encoding device 100 derives a first motion vector using a first inter-frame prediction method that uses the degree of fit of two reconstructed images from two different regions within two different pictures, in prediction block units obtained by dividing the image contained in the moving image (S104 or S106 in Figure 13). Here, the first inter-frame prediction method is, for example, the FRUC method described above. Specifically, the first inter-frame prediction method includes at least one of the template FRCU method and the bilateral FRUC method. That is, the two regions in the first inter-frame prediction method are (1) a region within the target picture adjacent to the target prediction block and a region within the reference picture, or (2) two regions within two different reference pictures.

[0298] In other words, the first inter-frame prediction method is a method in which the encoding side and the decoding side derive motion vectors using the same method. Furthermore, in the first inter-frame prediction method, information indicating motion vectors is not signaled to the encoded stream and is not transmitted from the encoding side to the decoding side. In addition, in the first inter-frame prediction method, the encoding device 100 derives motion vectors using the pixel values of the encoded prediction block and without using the pixel values of the target prediction block.

[0299] Next, the encoding device 100 performs a first motion compensation process in which it generates a predicted image by referring to the spatial gradient of luminance in the image generated by motion compensation using the derived first motion vector, on a prediction block basis (S103A in Figure 13). Here, the first motion compensation process is, for example, the BIO process described above, and includes correction using the luminance gradient. In addition, the first motion compensation process corrects the predicted image in units finer than the prediction block (for example, on a pixel basis or block basis). Furthermore, in the first motion compensation process, the predicted image is generated using the region in the reference picture indicated by the motion vector and the pixels surrounding that region.

[0300] According to this, the encoding device 100 can reduce the processing load by performing the derivation process of motion vectors using the first inter-screen prediction method and the first motion compensation process on a prediction block basis, compared to, for example, performing these processes on a sub-block basis. Furthermore, the first motion compensation process, which includes correction using the brightness gradient, can achieve correction on a unit smaller than the prediction block basis, thus suppressing the decrease in encoding efficiency that occurs when processing is not performed on a sub-block basis. Therefore, the encoding device 100 can reduce the processing load while suppressing the decrease in encoding efficiency.

[0301] Furthermore, the encoding device 100 derives a second motion vector using a second inter-screen prediction method that uses the degree of fit between the target prediction block and the reconstructed image of the region included in the reference picture, on a prediction block-by-block basis (S102 in Figure 13). The encoding device 100 then generates an encoded bitstream containing information for identifying the second motion vector.

[0302] Here, the second inter-screen prediction method is, for example, the normal inter-screen prediction method described above. In other words, the second inter-screen prediction method is a method in which the encoding side and the decoding side derive motion vectors using different methods. Specifically, the encoding device 100 derives motion vectors using the pixel values of the encoded prediction block and the pixel values of the target prediction block. The encoding device 100 then signals information indicating the derived motion vector to the encoded stream. As a result, the information indicating the motion vector derived by the encoding device 100 is transmitted from the encoding device 100 to the decoding device 200. The decoding device 200 derives motion vectors using this information contained in the encoded stream.

[0303] Next, the encoding device 100 performs a second motion compensation process in which it generates a predicted image by referring to the spatial gradient of luminance in the image generated by motion compensation using the derived second motion vector, on a prediction block basis (S103A in Figure 13). Here, the second motion compensation process is, for example, the BIO process described above, and includes correction using the luminance gradient. In addition, the second motion compensation process corrects the predicted image in units finer than the prediction block (for example, on a pixel basis or block basis). Furthermore, in the second motion compensation process, the predicted image is generated using the region in the reference picture indicated by the motion vector and the pixels surrounding that region.

[0304] The second motion compensation process may be the same as the first motion compensation process, or it may be partially different.

[0305] According to this, the processing unit for motion compensation can be the same whether the first inter-screen prediction method or the second inter-screen prediction method is used. This simplifies the implementation of motion compensation.

[0306] Furthermore, in the first operating mode (first operating mode in S111), the encoding device 100 derives a first motion vector in prediction block units obtained by dividing the image contained in the moving image using a first inter-frame prediction method (S112), and performs a first motion compensation process using the derived first motion vector in each prediction block unit (S113). Here, the first motion compensation process is, for example, a BIO process, and is a motion compensation process that generates a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the derived first motion vector.

[0307] Furthermore, in the second operating mode (second operating mode in S111), the encoding device 100 derives a second motion vector in sub-block units obtained by dividing the prediction block using the second inter-screen prediction method (S114), and performs a second motion compensation process using the second motion vector in sub-block units (S115). Here, the second motion compensation process is, for example, a motion compensation process that does not apply BIO processing, and is a motion compensation process that generates a predicted image without referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector.

[0308] According to this, in the first operating mode, the encoding device 100 performs the motion vector derivation process and the first motion compensation process on a prediction block basis, thereby reducing the processing load compared to, for example, performing these processes on a sub-block basis. Furthermore, the first motion compensation process, which generates a predicted image by referring to the spatial gradient of brightness, can achieve correction on a unit smaller than the prediction block basis, thus suppressing the decrease in encoding efficiency that occurs when processing is not performed on a sub-block basis. In the second operating mode, the encoding device 100 performs the motion vector derivation process and the second motion compensation process on a sub-block basis. Here, the second motion compensation process does not refer to the spatial gradient of brightness, so the processing load is less than that of the first motion compensation process. Moreover, by having these two operating modes, the encoding device 100 can improve encoding efficiency. In this way, the encoding device 100 can reduce the processing load while suppressing a decrease in encoding efficiency.

[0309] For example, the first inter-frame prediction method differs from the second inter-frame prediction method. Specifically, the second inter-frame prediction method is an inter-frame prediction method that uses the degree of fit of two reconstructed images from two different regions within two separate pictures, such as the FRUC method.

[0310] According to this, the inter-screen prediction method, which significantly improves coding efficiency by calculating motion vectors at the sub-block level, can be implemented at the sub-block level. Therefore, coding efficiency can be improved.

[0311] For example, the first inter-frame prediction method is one of the following: (1) a third inter-frame prediction method (e.g., template FRUC method) that uses the degree of fit between the reconstructed image of a region in a target picture adjacent to the target prediction block and the reconstructed image of a region in a reference picture; or (2) a fourth inter-frame prediction method (e.g., bilateral FRUC method) that uses the degree of fit between two reconstructed images of two regions in two different reference pictures. The second inter-frame prediction method is the other of the third and fourth inter-frame prediction methods.

[0312] For example, the first screen prediction method is the third screen prediction method (e.g., the template FRUC method), and the second screen prediction method is the fourth screen prediction method (e.g., the bilateral FRUC method).

[0313] According to this, the inter-screen prediction method, which significantly improves coding efficiency by calculating motion vectors at the sub-block level, can be implemented at the sub-block level. Therefore, coding efficiency can be improved.

[0314] For example, the first inter-frame prediction method is an inter-frame prediction method (a typical inter-frame prediction method) that uses the degree of fit between the target prediction block and the reconstructed image of the region included in the reference picture, and the encoding device 100 generates an encoded bitstream containing information for identifying the derived first motion vector.

[0315] [Example of a decryption device implementation] Figure 22 is a block diagram showing an example of the implementation of the decoding device 200 according to Embodiment 1. The decoding device 200 includes a circuit 260 and a memory 262. For example, the multiple components of the decoding device 200 shown in Figures 10 and 12 are implemented by the circuit 260 and memory 262 shown in Figure 22.

[0316] Circuit 260 is an information processing circuit and is a circuit that can access memory 262. For example, circuit 260 is a dedicated or general-purpose electronic circuit for decoding moving images. Circuit 260 may also be a processor such as a CPU. Alternatively, circuit 260 may be a collection of multiple electronic circuits. Furthermore, for example, circuit 260 may play the role of multiple components of the decoding device 200 shown in Figure 10, etc., excluding the component for storing information.

[0317] Memory 262 is a dedicated or general-purpose memory in which information for the circuit 260 to decode moving images is stored. Memory 262 may be an electronic circuit, or it may be connected to the circuit 260. Alternatively, memory 262 may be included in the circuit 260. Alternatively, memory 262 may be a collection of multiple electronic circuits. Alternatively, memory 262 may be a magnetic disk or an optical disk, or it may be described as storage or a recording medium. Alternatively, memory 262 may be a non-volatile memory or a volatile memory.

[0318] For example, memory 262 may store a bit sequence corresponding to an encoded video, or a video corresponding to a decoded bit sequence. Memory 262 may also store a program for circuit 260 to decode the video.

[0319] Furthermore, for example, memory 262 may play the role of an information storage component among the multiple components of the decoding device 200 shown in Figure 10, etc. Specifically, memory 262 may play the role of block memory 210 and frame memory 214 shown in Figure 10. More specifically, reconstructed blocks and reconstructed pictures, etc., may be stored in memory 262.

[0320] Furthermore, it is not necessary for the decoding device 200 to implement all of the components shown in Figure 10, etc., nor is it necessary for all of the processes described above to be performed. Some of the components shown in Figure 10, etc., may be included in other devices, and some of the processes described above may be performed by other devices. Then, motion compensation is efficiently performed in the decoding device 200 by implementing some of the components shown in Figure 10, etc., and performing some of the processes described above.

[0321] Specifically, the decoding device 200 derives a first motion vector using a first inter-frame prediction method that uses the degree of fit of two reconstructed images from two different regions within two different pictures, in prediction block units obtained by dividing the image contained in the moving image (S104 or S106 in Figure 13). Here, the first inter-frame prediction method is, for example, the FRUC method described above. Specifically, the first inter-frame prediction method includes at least one of the template FRCU method and the bilateral FRUC method. In other words, the two regions in the first inter-frame prediction method are (1) a region within the target picture adjacent to the target prediction block and a region within the reference picture, or (2) two regions within two different reference pictures.

[0322] In other words, the first inter-frame prediction method is a method in which the encoding side and the decoding side derive motion vectors using the same method. Furthermore, in the first inter-frame prediction method, information indicating motion vectors is not signaled to the encoded stream and is not transmitted from the encoding side to the decoding side. In addition, in the first inter-frame prediction method, the decoding device 200 derives motion vectors using the pixel values of the decoded prediction block and without using the pixel values of the target prediction block.

[0323] Next, the decoding device 200 performs a first motion compensation process in which it generates a predicted image by referring to the spatial gradient of luminance in the image generated by motion compensation using the derived first motion vector, on a prediction block basis (S103A in Figure 13). Here, the first motion compensation process is, for example, the BIO process described above, and includes correction using the luminance gradient. In addition, the first motion compensation process corrects the predicted image in units finer than the prediction block (for example, on a pixel basis or block basis). Furthermore, in the first motion compensation process, the predicted image is generated using the region in the reference picture indicated by the motion vector and the pixels surrounding that region.

[0324] According to this, the decoding device 200 can reduce the processing load by performing the derivation of motion vectors using the first inter-screen prediction method and the first motion compensation processing on a prediction block basis, compared to, for example, performing these processes on a sub-block basis. Furthermore, the first motion compensation processing, which includes correction using the brightness gradient, can achieve correction on a unit smaller than the prediction block basis, thus suppressing the decrease in coding efficiency that occurs when processing is not performed on a sub-block basis. Therefore, the decoding device 200 can reduce the processing load while suppressing the decrease in coding efficiency.

[0325] Furthermore, the decoding device 200 obtains information from the encoded bitstream to identify the second motion vector in units of prediction blocks. The decoding device 200 derives the second motion vector in units of prediction blocks using a second inter-screen prediction method based on the above information (S102 in Figure 13).

[0326] Here, the second inter-screen prediction method is, for example, the normal inter-screen prediction method described above. In other words, the second inter-screen prediction method is a method in which the encoding side and the decoding side derive motion vectors using different methods. Specifically, the encoding device 100 derives motion vectors using the pixel values of the encoded prediction block and the pixel values of the target prediction block. The encoding device 100 then signals information indicating the derived motion vector to the encoded stream. As a result, the information indicating the motion vector derived by the encoding device 100 is transmitted from the encoding device 100 to the decoding device 200. The decoding device 200 derives motion vectors using this information contained in the encoded stream.

[0327] Next, the decoding device 200 performs a second motion compensation process in which it generates a predicted image by referring to the spatial gradient of luminance in the image generated by motion compensation using the derived second motion vector, on a prediction block basis (S103A in Figure 13). Here, the second motion compensation process is, for example, the BIO process described above, and includes correction using the luminance gradient. In addition, the second motion compensation process corrects the predicted image in units finer than the prediction block (for example, on a pixel basis or block basis). Furthermore, in the second motion compensation process, the predicted image is generated using the region in the reference picture indicated by the motion vector and the pixels surrounding that region.

[0328] The second motion compensation process may be the same as the first motion compensation process, or it may be partially different.

[0329] According to this, the processing unit for motion compensation can be the same whether the first inter-screen prediction method or the second inter-screen prediction method is used. This simplifies the implementation of motion compensation.

[0330] Furthermore, in the first operating mode (first operating mode in S111), the decoding device 200 derives a first motion vector in prediction block units obtained by dividing the image contained in the moving image using a first inter-frame prediction method (S112), and performs a first motion compensation process using the derived first motion vector in each prediction block unit (S113). Here, the first motion compensation process is, for example, a BIO process, which is a motion compensation process that generates a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the derived first motion vector.

[0331] Furthermore, in the second operating mode (second operating mode in S111), the decoding device 200 derives a second motion vector in sub-block units obtained by dividing the prediction block using the second inter-screen prediction method (S114), and performs a second motion compensation process using the second motion vector in sub-block units (S115). Here, the second motion compensation process is, for example, a motion compensation process that does not apply BIO processing, and is a motion compensation process that generates a predicted image without referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector.

[0332] According to this, in the first operating mode, the decoding device 200 performs the motion vector derivation process and the first motion compensation process on a prediction block basis, thereby reducing the processing load compared to, for example, performing these processes on a sub-block basis. Furthermore, the first motion compensation process, which generates a predicted image by referencing the spatial gradient of brightness, can achieve correction on a unit smaller than the prediction block basis, thus suppressing the decrease in coding efficiency that occurs when processing is not performed on a sub-block basis. In the second operating mode, the decoding device 200 performs the motion vector derivation process and the second motion compensation process on a sub-block basis. Here, the second motion compensation process does not refer to the spatial gradient of brightness, so the processing load is less than that of the first motion compensation process. Moreover, by having these two operating modes, the decoding device 200 can improve coding efficiency. In this way, the decoding device 200 can reduce the processing load while suppressing a decrease in coding efficiency.

[0333] For example, the first inter-frame prediction method differs from the second inter-frame prediction method. Specifically, the second inter-frame prediction method is an inter-frame prediction method that uses the degree of fit of two reconstructed images from two different regions within two separate pictures, such as the FRUC method.

[0334] According to this, the inter-screen prediction method, which significantly improves coding efficiency by calculating motion vectors at the sub-block level, can be implemented at the sub-block level. Therefore, coding efficiency can be improved.

[0335] For example, the first inter-frame prediction method is one of the following: (1) a third inter-frame prediction method (e.g., template FRUC method) that uses the degree of fit between the reconstructed image of a region in a target picture adjacent to the target prediction block and the reconstructed image of a region in a reference picture; or (2) a fourth inter-frame prediction method (e.g., bilateral FRUC method) that uses the degree of fit between two reconstructed images of two regions in two different reference pictures. The second inter-frame prediction method is the other of the third and fourth inter-frame prediction methods.

[0336] For example, the first screen prediction method is the third screen prediction method (e.g., the template FRUC method), and the second screen prediction method is the fourth screen prediction method (e.g., the bilateral FRUC method).

[0337] According to this, the inter-screen prediction method, which significantly improves coding efficiency by calculating motion vectors at the sub-block level, can be implemented at the sub-block level. Therefore, coding efficiency can be improved.

[0338] For example, in the first inter-screen prediction method, the decoding device 200 obtains information from the encoded bitstream to identify the second motion vector in sub-block units, and uses this information to derive the second motion vector.

[0339] [supplement] Furthermore, the encoding device 100 and decoding device 200 in this embodiment may be used as an image encoding device and an image decoding device, respectively, or as a video encoding device and a video decoding device. Alternatively, the encoding device 100 and decoding device 200 may be used as inter-prediction devices (inter-screen prediction devices), respectively.

[0340] In other words, the encoding device 100 and the decoding device 200 may correspond only to the inter-prediction unit (inter-screen prediction unit) 126 and the inter-prediction unit (inter-screen prediction unit) 218, respectively. Other components such as the conversion unit 106 and the inverse conversion unit 206 may be included in other devices.

[0341] Furthermore, in this embodiment, each component may be implemented by being composed of dedicated hardware or by executing a software program suitable for each component. Each component may also be implemented by a program execution unit such as a CPU or processor reading and executing a software program recorded on a recording medium such as a hard disk or semiconductor memory.

[0342] Specifically, each of the encoding device 100 and the decoding device 200 may include a processing circuitry and a storage device electrically connected to and accessible from the processing circuitry. For example, the processing circuitry corresponds to circuit 160 or 260, and the storage device corresponds to memory 162 or 262.

[0343] The processing circuit includes at least one of dedicated hardware and a program execution unit, and performs processing using a memory device. Furthermore, if the processing circuit includes a program execution unit, the memory device stores the software program executed by that program execution unit.

[0344] Here, the software that implements the encoding device 100 or decoding device 200, etc., in this embodiment is the following program.

[0345] Furthermore, each component may be a circuit, as described above. These circuits may form a single circuit as a whole, or they may be separate circuits. Also, each component may be implemented using a general-purpose processor, or it may be implemented using a dedicated processor.

[0346] Furthermore, a process performed by one component may be performed by another component. Also, the order in which processes are executed may be changed, and multiple processes may be executed in parallel. Additionally, the encoding / decoding device may comprise an encoding device 100 and a decoding device 200.

[0347] Although the embodiments of the encoding device 100 and the decoding device 200 have been described above based on these embodiments, the embodiments of the encoding device 100 and the decoding device 200 are not limited to these embodiments. Without departing from the spirit of this disclosure, various modifications that a person skilled in the art could conceive of are applied to these embodiments, and configurations constructed by combining components from different embodiments may also be included within the scope of the embodiments of the encoding device 100 and the decoding device 200.

[0348] This embodiment may be implemented in combination with at least some of the other embodiments of this disclosure. Furthermore, some of the processes, some of the configurations of the apparatus, some of the syntax, etc., described in the flowchart of this embodiment may be implemented in combination with the other embodiments.

[0349] (Embodiment 2) In each of the above embodiments, each functional block can typically be implemented by an MPU and memory, etc. Furthermore, the processing performed by each functional block is typically implemented by a program execution unit such as a processor reading and executing software (program) recorded on a recording medium such as ROM. This software may be distributed by download, etc., or it may be recorded on a recording medium such as semiconductor memory and distributed. Of course, it is also possible to implement each functional block by hardware (dedicated circuitry).

[0350] Furthermore, the processing described in each embodiment may be implemented by centralized processing using a single device (system), or by distributed processing using multiple devices. Also, the processor executing the above program may be one or multiple. In other words, centralized processing may be performed, or distributed processing may be performed.

[0351] The embodiments of this disclosure are not limited to those described above, and various modifications are possible, which are also included within the scope of the embodiments of this disclosure.

[0352] Furthermore, here we will describe application examples of the video encoding method (image encoding method) or video decoding method (image decoding method) shown in each of the above embodiments, and a system using the same. The system is characterized by having an image encoding device using the image encoding method, an image decoding device using the image decoding method, and an image encoding and decoding device that includes both. Other configurations in the system can be appropriately modified as needed.

[0353] [Usage example] Figure 23 shows the overall configuration of the content supply system ex100 that realizes the content distribution service. The area where the communication service is provided is divided into desired sizes, and fixed radio stations, base stations ex106, ex107, ex108, ex109, and ex110, are installed in each cell.

[0354] In this content supply system ex100, various devices such as a computer ex111, a game console ex112, a camera ex113, a home appliance ex114, and a smartphone ex115 are connected to the internet ex101 via an internet service provider ex102 or a communication network ex104, and base stations ex106~ex110. The content supply system ex100 may also connect any combination of the above elements. Each device may be directly or indirectly connected to each other via a telephone network or short-range radio, etc., without going through the base stations ex106~ex110, which are fixed radio stations. In addition, the streaming server ex103 is connected to various devices such as a computer ex111, a game console ex112, a camera ex113, a home appliance ex114, and a smartphone ex115 via the internet ex101, etc. Furthermore, the streaming server ex103 is connected to terminals in a hotspot on an airplane ex117 via satellite ex116.

[0355] Note that instead of base stations ex106~ex110, wireless access points or hotspots may be used. Also, streaming server ex103 may be connected directly to the communication network ex104 without going through the internet ex101 or internet service provider ex102, or it may be connected directly to the airplane ex117 without going through satellite ex116.

[0356] Camera ex113 is a device capable of taking still images and videos, such as a digital camera. Smartphone ex115 is a smartphone, mobile phone, or PHS (Personal Handyphone System) that supports mobile communication systems generally known as 2G, 3G, 3.9G, 4G, and the upcoming 5G.

[0357] Home appliance ex118 refers to appliances such as refrigerators or equipment included in household fuel cell cogeneration systems.

[0358] In the content supply system ex100, live streaming becomes possible when a terminal with a shooting function is connected to the streaming server ex103 via a base station ex106 or the like. In live streaming, the terminal (computer ex111, game console ex112, camera ex113, home appliance ex114, smartphone ex115, and terminal inside an airplane ex117, etc.) performs the encoding process described in each of the above embodiments on still images or video content captured by the user using the terminal, multiplexes the video data obtained by encoding with sound data encoded from the sound corresponding to the video, and transmits the obtained data to the streaming server ex103. In other words, each terminal functions as an image encoding device according to one aspect of this disclosure.

[0359] Meanwhile, the streaming server ex103 streams the content data sent to the requesting client. The client is a computer ex111, a game console ex112, a camera ex113, a home appliance ex114, a smartphone ex115, or a terminal on an airplane ex117, etc., that is capable of decoding the encoded data. Each device that receives the distributed data decodes and plays back the received data. That is, each device functions as an image decoding device according to one aspect of this disclosure.

[0360] [Distributed Processing] Furthermore, the streaming server ex103 may consist of multiple servers or computers that distribute data processing, recording, and distribution. For example, the streaming server ex103 may be implemented using a CDN (Content Delivery Network), where content delivery is achieved through a network connecting numerous edge servers distributed worldwide. In a CDN, the physically closest edge server is dynamically assigned depending on the client. Latency can be reduced by caching and delivering content to the edge server. In addition, if an error occurs or the communication state changes due to an increase in traffic, processing can be distributed among multiple edge servers, the delivery entity can be switched to another edge server, or delivery can be continued by bypassing the failed part of the network, thus enabling high-speed and stable delivery.

[0361] Furthermore, beyond the distributed processing of the distribution itself, the encoding process of the captured data can be performed on each terminal, on the server side, or shared among them. For example, encoding generally involves two processing loops. In the first loop, the complexity or code amount of the image at the frame or scene level is detected. In the second loop, processing is performed to improve encoding efficiency while maintaining image quality. For example, if the terminal performs the first encoding process and the server that receives the content performs the second encoding process, it is possible to improve the quality and efficiency of the content while reducing the processing load on each terminal. In this case, if there is a request to receive and decode near real time, the first encoded data from the terminal can be received and played back on other terminals, enabling more flexible real-time distribution.

[0362] Another example is the camera ex113, which extracts features from an image, compresses the feature data as metadata, and sends it to the server. The server performs compression according to the meaning of the image, for example, by determining the importance of an object from the features and switching the quantization precision. Feature data is particularly effective in improving the accuracy and efficiency of motion vector prediction during further compression on the server. Alternatively, a simple encoding such as VLC (Variable Length Coding) may be performed on the terminal, and a more computationally intensive encoding such as CABAC (Context-Adaptive Binary Arithmetic Coding) may be performed on the server.

[0363] Another example is a scenario in a stadium, shopping mall, or factory where multiple video data sets of nearly identical scenes may exist, captured by multiple terminals. In such cases, the encoding process is distributed among the multiple terminals that captured the footage, along with other terminals and servers as needed, by assigning encoding tasks to each unit, for example, at the Group of Picture (GOP) level, picture level, or tile level (a division of a picture). This reduces latency and enables more real-time performance.

[0364] Furthermore, since multiple video data sets depict essentially the same scene, the server may manage and / or instruct the video data captured by each terminal to reference each other. Alternatively, the server may receive the encoded data from each terminal, change the reference relationships between the multiple data sets, or correct or replace the pictures themselves and re-encode them. This allows for the creation of a stream with improved quality and efficiency for each individual data set.

[0365] Furthermore, the server may transcode the video data to change its encoding method before distributing it. For example, the server may convert an MPEG-based encoding to a VP-based encoding, or convert H.264 to H.265.

[0366] Thus, the encoding process can be performed by a terminal or one or more servers. Therefore, in the following, the terms "server" or "terminal" will be used to refer to the entity performing the processing, but some or all of the processing performed by the server may be performed by the terminal, and some or all of the processing performed by the terminal may be performed by the server. The same applies to the decoding process.

[0367] [3D, Multi-angle] In recent years, it has become increasingly common to integrate and utilize images or videos of different scenes, or the same scene, captured from different angles, using multiple cameras ex113 and / or smartphones ex115, which are nearly synchronized with each other. The videos captured by each device are integrated based on the relative positional relationship between the devices, or on areas where feature points contained in the videos coincide, which are acquired separately.

[0368] The server may not only encode 2D video but also encode still images automatically based on scene analysis of the video, or at a time specified by the user, and send them to the receiving terminal. Furthermore, if the server can obtain the relative positional relationship between the shooting terminals, it can generate a 3D shape of the scene based not only on 2D video but also on video of the same scene taken from different angles. The server may also separately encode 3D data generated by a point cloud, or it may select or reconstruct video to send to the receiving terminal from video taken by multiple terminals based on the results of recognizing or tracking a person or object using the 3D data.

[0369] In this way, users can enjoy scenes by arbitrarily selecting each video corresponding to each shooting terminal, or they can enjoy content in which video from an arbitrary viewpoint is extracted from 3D data reconstructed using multiple images or videos. Furthermore, just like the video, sound can also be collected from multiple different angles, and the server may multiplex and transmit sound from a specific angle or space in conjunction with the video.

[0370] In recent years, content that links the real world with a virtual world, such as Virtual Reality (VR) and Augmented Reality (AR), has also become popular. In the case of VR images, the server may create separate viewpoint images for the right and left eyes and perform encoding that allows referencing between the viewpoint images using Multi-View Coding (MVC), or it may encode them as separate streams without referencing each other. When decoding the separate streams, it is advisable to synchronize playback so that the virtual 3D space is reproduced according to the user's viewpoint.

[0371] In the case of AR images, the server superimposes virtual object information from the virtual space onto camera information from the real space, based on its three-dimensional position or the user's viewpoint movement. The decoding device may acquire or store the virtual object information and three-dimensional data, generate a two-dimensional image according to the user's viewpoint movement, and create superimposed data by smoothly stitching them together. Alternatively, the decoding device may send the user's viewpoint movement to the server in addition to requesting virtual object information, and the server may create superimposed data from the three-dimensional data held by the server according to the received viewpoint movement, encode the superimposed data, and distribute it to the decoding device. The superimposed data may have an α value indicating transparency in addition to RGB, and the server may set the α value of parts other than the object created from the three-dimensional data to 0, etc., so that those parts are transparent, and encode the data. Alternatively, the server may set a predetermined RGB value to the background, like chroma keying, and generate data in which parts other than the object are the background color.

[0372] Similarly, the decryption process of the distributed data can be performed on each client terminal, on the server side, or shared between them. For example, one terminal may send a reception request to the server, and other terminals may receive the content corresponding to that request, perform the decryption process, and then transmit the decrypted signal to a device with a display. By distributing the processing and selecting appropriate content regardless of the performance of the communication-capable terminals themselves, it is possible to play back data with good image quality. Another example is that while receiving large image data on a TV or similar device, a portion of the picture, such as tiles, may be decrypted and displayed on the viewer's personal terminal. This allows for sharing the overall picture while allowing users to check their own area of responsibility or areas they want to examine in more detail on their own device.

[0373] In the future, it is expected that content will be seamlessly received by switching appropriate data for the connected communication, using distribution system standards such as MPEG-DASH, in situations where multiple short-range, medium-range, or long-range wireless communications are available both indoors and outdoors. This will allow users to freely select and switch in real time between decoding devices or display devices, such as displays installed indoors or outdoors, as well as their own terminals. Furthermore, decoding can be performed while switching between the decoding terminal and the display terminal based on the user's location information. This will make it possible to display map information on the wall or part of the ground of an adjacent building with a displayable device embedded, while traveling to a destination. It will also be possible to switch the bitrate of the received data based on the ease of access to the encoded data on the network, such as when the encoded data is cached on a server that can be accessed quickly from the receiving terminal, or copied to an edge server in the content delivery service.

[0374] [Scalable encoding] Regarding content switching, we will explain using a scalable stream compressed and encoded using the video encoding method described in each of the embodiments above, as shown in Figure 24. The server may have multiple streams with the same content but different qualities as individual streams, but it may also be configured to switch content by taking advantage of the characteristics of a temporally / spatially scalable stream realized by encoding it in layers, as shown in the figure. In other words, the decoding side can freely switch between decoding low-resolution and high-resolution content by deciding which layer to decode according to internal factors such as performance and external factors such as the state of the communication bandwidth. For example, if you want to watch the rest of a video that you were watching on your smartphone ex115 while traveling, on a device such as an internet TV when you get home, that device only needs to decode the same stream to a different layer, thus reducing the burden on the server.

[0375] Furthermore, in addition to the configuration described above, in which pictures are encoded for each layer and an enhancement layer exists above the base layer to achieve scalability, the enhancement layer may include metadata based on statistical information of the image, and the decoding side may generate high-quality content by super-resolution the picture in the base layer based on the metadata. Super-resolution may refer to either an improvement in the signal-to-noise ratio at the same resolution or an increase in resolution. The metadata may include information for identifying linear or nonlinear filter coefficients used in the super-resolution process, or information for identifying parameter values in the filtering process, machine learning, or least-squares operation used in the super-resolution process.

[0376] Alternatively, the picture may be divided into tiles or similar structures according to the meaning of objects within the image, and the decoding side may select tiles to decode, thereby decoding only a portion of the area. Furthermore, by storing the attributes of objects (people, cars, balls, etc.) and their positions within the image (coordinate positions within the same image, etc.) as metadata, the decoding side can identify the location of a desired object based on the metadata and determine the tile containing that object. For example, as shown in Figure 25, the metadata is stored using a data storage structure different from pixel data, such as the SEI message in HEVC. This metadata indicates, for example, the position, size, or color of the main object.

[0377] Furthermore, metadata may be stored in units consisting of multiple pictures, such as streams, sequences, or random access units. This allows the decryption side to obtain information such as the time when a specific person appears in the video, and by combining this with the picture-level information, it can identify the picture in which the object exists and the object's position within that picture.

[0378] [Web page optimization] Figure 26 shows an example of a web page display screen on a computer ex111, etc. Figure 27 shows an example of a web page display screen on a smartphone ex115, etc. As shown in Figures 26 and 27, a web page may contain multiple linked images, which are links to image content, and their appearance will differ depending on the viewing device. When multiple linked images are visible on the screen, the display device (decoder) will display still images or I-pictures from each content as linked images, display video such as a GIF animation using multiple still images or I-pictures, or receive only the base layer and decode and display the video, until the user explicitly selects a linked image, or until the linked image approaches the center of the screen or the entire linked image is within the screen.

[0379] When a linked image is selected by the user, the display device prioritizes decoding the base layer. If the HTML of the web page contains information indicating that the content is scalable, the display device may decode up to the enhancement layer. Furthermore, to ensure real-time performance, before selection or when bandwidth is very limited, the display device can decode and display only forward-referenced pictures (I-pictures, P-pictures, and B-pictures that only use forward references), thereby reducing the delay between the decoding time and display time of the first picture (the delay from the start of content decoding to the start of display). Alternatively, the display device may deliberately ignore the reference relationships between pictures and roughly decode all B-pictures and P-pictures using forward references, then perform normal decoding as time passes and more pictures are received.

[0380] [Autonomous driving] Furthermore, when transmitting and receiving still images or video data such as 2D or 3D map information for autonomous driving or driving assistance of a vehicle, the receiving terminal may receive metadata such as weather or construction information in addition to image data belonging to one or more layers, and decode these in association with each other. The metadata may belong to a layer, or it may simply be multiplexed with the image data.

[0381] In this case, since the vehicle, drone, or airplane containing the receiving terminal is in motion, the receiving terminal can transmit its location information when a reception request is made, enabling seamless reception and decoding while switching between base stations ex106 to ex110. Furthermore, the receiving terminal can dynamically switch how much metadata is received or how much map information is updated, depending on the user's selection, the user's situation, or the state of the communication bandwidth.

[0382] As described above, the content supply system ex100 allows the client to receive, decode, and play back encoded information transmitted by the user in real time.

[0383] [Distribution of personal content] Furthermore, the ex100 content delivery system allows for unicast or multicast distribution of not only high-definition, long-duration content from video distribution companies, but also low-definition, short-duration content from individuals. It is also expected that the amount of such individual content will continue to increase. To improve the quality of individual content, the server may perform editing before encoding. This can be achieved, for example, with the following configuration.

[0384] During shooting, or after shooting, the server performs recognition processing such as detecting shooting errors, searching for scenes, analyzing semantics, and detecting objects from the original images or encoded data in real time. Based on the recognition results, the server manually or automatically edits the images, correcting out-of-focus or shaky images, deleting less important scenes such as those with lower brightness or out of focus compared to other pictures, emphasizing object edges, and changing color tones. The server then encodes the edited data based on the editing results. It is also known that viewership decreases if the shooting time is too long, so the server may automatically clip scenes with little movement, as well as less important scenes, based on the image processing results, to ensure that the content falls within a specific time range according to the shooting time. Alternatively, the server may generate and encode a digest based on the results of the semantic analysis of the scenes.

[0385] Furthermore, personal content may contain elements that infringe on copyright, moral rights, or portrait rights, and the scope of sharing may exceed the intended scope, which can be inconvenient for the individual. Therefore, for example, the server may intentionally change the image to one that is out of focus, such as the faces of people at the edges of the screen or the interior of a house, before encoding. The server may also recognize whether the face of a person other than those previously registered is visible in the image to be encoded, and if so, it may apply a mosaic effect to the face. Alternatively, as a pre- or post-processing step before encoding, the user can specify a person or background area that they want to process from a copyright perspective, and the server can replace the specified area with a different image or blur the focus. In the case of a person, the server can track the person in a video and replace the image of their face.

[0386] Furthermore, because viewing personal content with small data volumes requires real-time processing, depending on the bandwidth, the decoder prioritizes receiving, decoding, and playing the base layer first. During this time, the decoder can receive the enhancement layer, and if playback is looped or if the content is played more than once, it may play the high-quality video including the enhancement layer. With a stream that uses this scalable encoding, it is possible to provide an experience where the video is rough when unselected or at the beginning of viewing, but gradually the stream becomes smarter and the image quality improves. In addition to scalable encoding, a similar experience can be provided even if the rough stream played the first time and the second stream encoded by referencing the first video are configured as a single stream.

[0387] [Other usage examples] Furthermore, these encoding or decoding processes are generally performed by the LSIex500 present in each terminal. The LSIex500 may be a single chip or a multi-chip configuration. Alternatively, video encoding or decoding software may be embedded in some recording medium (such as a CD-ROM, flexible disk, or hard disk) that can be read by a computer ex111, and the encoding or decoding process may be performed using that software. In addition, if the smartphone ex115 has a camera, video data acquired by that camera may be transmitted. In this case, the video data is data encoded by the LSIex500 present in the smartphone ex115.

[0388] The LSIex500 may also be configured to be activated by downloading application software. In this case, the terminal first determines whether it supports the content encoding method or whether it has the capability to perform the specific service. If the terminal does not support the content encoding method or does not have the capability to perform the specific service, the terminal downloads the codec or application software, and then acquires and plays the content.

[0389] Furthermore, not only the content supply system ex100 via the Internet ex101, but also digital broadcasting systems can incorporate at least one of the video encoding device (image encoding device) or video decoding device (image decoding device) of each of the above embodiments. While the content supply system ex100 has a configuration that is more suited to multicast than unicast, as it transmits and receives multiplexed data with video and sound multiplexed onto broadcast radio waves using satellites, etc., the encoding and decoding processes are similar and can be applied in the same way.

[0390] [Hardware configuration] Figure 28 shows the smartphone ex115. Figure 29 shows an example of the configuration of the smartphone ex115. The smartphone ex115 includes an antenna ex450 for transmitting and receiving radio waves with the base station ex110, a camera unit ex465 capable of taking video and still images, and a display unit ex458 that displays video captured by the camera unit ex465 and data decoded from video received by the antenna ex450. The smartphone ex115 further includes an operation unit ex466, such as a touch panel, an audio output unit ex457, such as a speaker for outputting voice or sound, an audio input unit ex456, such as a microphone for inputting voice, a memory unit ex467 capable of storing captured video or still images, recorded audio, received video or still images, encoded data such as emails, or decoded data, and a slot unit ex464, which is an interface unit with SIM ex468 for identifying the user and authenticating access to various data, including the network. External memory may be used instead of the memory unit ex467.

[0391] Furthermore, the main control unit ex460, which comprehensively controls the display unit ex458 and the operation unit ex466, is connected via the bus ex470 to the power supply circuit unit ex461, the operation input control unit ex462, the video signal processing unit ex455, the camera interface unit ex463, the display control unit ex459, the modulation / demodulation unit ex452, the multiplexing / decompression unit ex453, the audio signal processing unit ex454, the slot unit ex464, and the memory unit ex467.

[0392] The power supply circuit unit ex461, when the power key is turned on by the user, supplies power from the battery pack to each component, thereby starting up the smartphone ex115 and making it operational.

[0393] The smartphone ex115 performs tasks such as phone calls and data communication based on the control of the main control unit ex460, which has a CPU, ROM, RAM, etc. During a call, the audio signal picked up by the audio input unit ex456 is converted into a digital audio signal by the audio signal processing unit ex454, which is then subjected to spread spectrum processing by the modulation / demodulation unit ex452, and after digital-to-analog conversion and frequency conversion processing by the transmission / reception unit ex451, it is transmitted via the antenna ex450. Similarly, received data is amplified, subjected to frequency conversion and analog-to-digital conversion processing, despread spectrum processing by the modulation / demodulation unit ex452, converted into an analog audio signal by the audio signal processing unit ex454, and then output from the audio output unit ex457. In data communication mode, text, still images, or video data are sent to the main control unit ex460 via the operation input control unit ex462 by the operation unit ex466 of the main unit, and transmission and reception processing is performed in the same manner. When transmitting video, still images, or video and audio in data communication mode, the video signal processing unit ex455 compresses and encodes the video signal stored in the memory unit ex467 or the video signal input from the camera unit ex465 using the video encoding method shown in each of the above embodiments, and sends the encoded video data to the multiplexing / decoding unit ex453. The audio signal processing unit ex454 encodes the audio signal picked up by the audio input unit ex456 while the camera unit ex465 is capturing video or still images, and sends the encoded audio data to the multiplexing / decoding unit ex453. The multiplexing / decoding unit ex453 multiplexes the encoded video data and encoded audio data in a predetermined manner, performs modulation and conversion processing in the modulation / demodulation unit (modulation / demodulation circuit unit) ex452 and the transmission / reception unit ex451, and transmits the data via the antenna ex450.

[0394] When receiving video attached to an email or chat, or video linked to a webpage, etc., the multiplexing / decomposition unit ex453 separates the multiplexed data received via antenna ex450 to decode the multiplexed data, dividing it into a video data bitstream and an audio data bitstream. It then supplies the encoded video data to the video signal processing unit ex455 and the encoded audio data to the audio signal processing unit ex454 via the synchronization bus ex470. The video signal processing unit ex455 decodes the video signal using a video decoding method corresponding to the video encoding method shown in each embodiment above, and displays the video or still image contained in the linked video file from the display unit ex458 via the display control unit ex459. The audio signal processing unit ex454 decodes the audio signal, and audio is output from the audio output unit ex457. However, since real-time streaming is widespread, there may be situations where audio playback is socially inappropriate depending on the user's circumstances. Therefore, as an initial setting, it is preferable to have a configuration that plays only video data and not audio signals. Audio may be synchronized and played only when the user performs an action, such as clicking on video data.

[0395] Furthermore, although the smartphone ex115 was used as an example here, there are three possible implementation formats for terminals: a transceiver-type terminal that has both an encoder and a decoder, a transmitting terminal that has only an encoder, and a receiving terminal that has only a decoder. In addition, although it was explained that multiplexed data, in which audio data etc. is multiplexed with video data, is received or transmitted in a digital broadcasting system, the multiplexed data may also include text data related to the video in addition to audio data, or the video data itself may be received or transmitted instead of multiplexed data.

[0396] Although it was explained that the main control unit ex460, including the CPU, controls the encoding or decoding process, terminals often also have a GPU. Therefore, a configuration that leverages the GPU's performance to process a wide area at once using memory shared by the CPU and GPU, or memory whose addresses are managed so that it can be used in common, is also possible. This can shorten the encoding time, ensure real-time performance, and achieve low latency. In particular, it is efficient to perform motion detection, deblocking filters, SAO (Sample Adaptive Offset), and transformation / quantization processes at once on the GPU, rather than on the CPU, in units such as pictures.

[0397] This embodiment may be implemented in combination with at least some of the other embodiments of this disclosure. Furthermore, some of the processes, some of the configurations of the apparatus, some of the syntax, etc., described in the flowchart of this embodiment may be implemented in combination with the other embodiments. [Industrial applicability]

[0398] This disclosure can be used, for example, in television receivers, digital video recorders, car navigation systems, mobile phones, digital cameras, digital video cameras, video conferencing systems, or electronic mirrors. [Explanation of symbols]

[0399] 100 Encoding device 102 Division 104 Subtraction Unit 106 Conversion Unit 108 Quantization section 110 Entropy coding unit 112, 204 Inverse quantization section 114, 206 Inverse Transform Section 116, 208 Addition section 118, 210 block memory 120, 212 Loop filter section 122,214 frame memory 124, 216 Intra Prediction Unit 126, 218 Interpretation Unit 128, 220 Prediction Control Unit 160, 260 circuits 162,262 memory 200 Decoders 202 Entropy Decoder

Claims

1. Circuits and, Equipped with memory, The circuit performs motion compensation processing using the memory, and the motion compensation processing includes a first operating mode and a second operating mode. In the first operating mode, The first motion vector is derived by searching the surrounding region of the motion vector derived by merge mode in prediction block units obtained by dividing the image contained in the video. A first motion compensation process is performed to generate a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the first motion vector, In the second operating mode, Based on the motion vectors of blocks spatially adjacent to the prediction block, motion vectors of multiple control points of the prediction block are derived. Using the motion vectors of the multiple control points, the prediction block is divided into multiple sub-blocks, and a second motion vector for bidirectional prediction is derived for each sub-block. The second motion compensation process generates a predicted image without referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector. Encoding device.

2. Circuits and, Equipped with memory, The circuit performs motion compensation processing using the memory, and the motion compensation processing includes a first operating mode and a second operating mode. In the first operating mode, The first motion vector is derived by searching the surrounding region of the motion vector derived by merge mode in prediction block units obtained by dividing the image contained in the video. A first motion compensation process is performed to generate a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the first motion vector, In the second operating mode, Based on the motion vectors of blocks spatially adjacent to the prediction block, motion vectors of multiple control points of the prediction block are derived. Using the motion vectors of the multiple control points, the prediction block is divided into multiple sub-blocks, and a second motion vector for bidirectional prediction is derived for each sub-block. The second motion compensation process generates a predicted image without referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector. Decoding device.

3. An encoding method for performing motion compensation processing, wherein the motion compensation processing includes a first operating mode and a second operating mode. In the first operating mode, The first motion vector is derived by searching the surrounding region of the motion vector derived by merge mode in prediction block units obtained by dividing the image contained in the video. A first motion compensation process is performed to generate a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the first motion vector, In the second operating mode, Based on the motion vectors of blocks spatially adjacent to the prediction block, motion vectors of multiple control points of the prediction block are derived. Using the motion vectors of the multiple control points, the prediction block is divided into multiple sub-blocks, and a second motion vector for bidirectional prediction is derived for each sub-block. The second motion compensation process generates a predicted image without referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector. Encoding method.

4. A decoding method that performs motion compensation processing, wherein the motion compensation processing includes a first operating mode and a second operating mode. In the first operating mode, The first motion vector is derived by searching the surrounding region of the motion vector derived by merge mode in prediction block units obtained by dividing the image contained in the video. A first motion compensation process is performed to generate a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the first motion vector, In the second operating mode, Based on the motion vectors of blocks spatially adjacent to the prediction block, motion vectors of multiple control points of the prediction block are derived. Using the motion vectors of the multiple control points, the prediction block is divided into multiple sub-blocks, and a second motion vector for bidirectional prediction is derived for each sub-block. The second motion compensation process generates a predicted image without referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector. Decryption method.

5. A method for transmitting a bitstream that performs motion compensation processing, wherein the motion compensation processing includes a first operating mode and a second operating mode. In the first operating mode, The first motion vector is derived by searching the surrounding region of the motion vector derived by merge mode in prediction block units obtained by dividing the image contained in the video. A first motion compensation process is performed to generate a predicted image by referring to the spatial gradient of brightness in the image generated by motion compensation using the first motion vector, In the second operating mode, Based on the motion vectors of blocks spatially adjacent to the prediction block, motion vectors of multiple control points of the prediction block are derived. Using the motion vectors of the multiple control points, the prediction block is divided into multiple sub-blocks, and a second motion vector for bidirectional prediction is derived for each sub-block. A second motion compensation process is performed to generate a predicted image without referring to the spatial gradient of brightness in the image generated by motion compensation using the second motion vector, A bitstream containing information for deriving the first motion vector or the second motion vector is generated, The generated bitstream is transmitted. Sending method.