Image processing device, imaging device, control method and program for image processing device

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
The image processing device uses neural network-based inference to determine encoding modes for multiple schemes, reducing circuit size and power consumption by sharing common functional blocks.

JP2026096870APending Publication Date: 2026-06-15CANON KK

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: CANON KK
Filing Date: 2024-12-03
Publication Date: 2026-06-15

Application Information

Patent Timeline

03 Dec 2024

Application

15 Jun 2026

Publication

JP2026096870A

IPC: H04N19/189; H04N19/102

CPC: H04N19/11; H04N19/189; H04N19/14; H04N19/119; H04N19/176

AI Tagging

Application Domain

Digital video signal modification

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure 2026096870000001_ABST

Patent Text Reader

Abstract

This technology provides a way to reduce the circuit size when an image processing device is designed to support multiple encoding schemes. [Solution] An image processing apparatus comprising: an encoding mode determination means that determines a second encoding mode in a second encoding scheme different from the first encoding scheme for an image to be encoded, based on information of a first encoding mode in a first encoding scheme for the image to be encoded, by inference using a neural network; and an encoding means that encodes the image to be encoded using the second encoding mode.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to an image processing apparatus, an imaging apparatus, a control method of an image processing apparatus, and a program.

Background Art

[0002] In recent years, multiple methods such as AOMedia Video 1 (AV1) and Versatile Video Coding (VVC) have been proposed as next-generation video coding methods (Non-Patent Document 1 and Non-Patent Document 2). AV1 is assumed to be used in moving image distribution services, and VVC is assumed to be used in next-generation terrestrial digital broadcasting. Since their target users are different, an image processing apparatus needs to install each codec. In that case, since an ASIC that installs the codec of the image processing apparatus needs to additionally install a codec corresponding to a plurality of methods (AV1 / VVC) in addition to the conventional coding method (H.264 / HEVC), the circuit scale may become enormous.

Prior Art Documents

Non-Patent Documents

[0003]

Non-Patent Document 1

Non-Patent Document 2

[0004] The present invention aims to provide a technology that enables a reduction in circuit size when an image processing device is made compatible with multiple encoding schemes. [Means for solving the problem]

[0005] The invention for solving the above problem is an image processing apparatus, An encoding mode determination means that determines a second encoding mode in a second encoding scheme different from the first encoding scheme for the image to be encoded, based on information of the first encoding mode in a first encoding scheme for the image to be encoded, by inference using a neural network. The system includes encoding means for encoding the image to be encoded using the second encoding mode. [Effects of the Invention]

[0006] According to the present invention, it is possible to provide a technology that enables a reduction in circuit size when an image processing device is made compatible with multiple encoding schemes. [Brief explanation of the drawing]

[0007] [Figure 1A] A diagram showing an example of the configuration of an image processing system corresponding to the embodiment. [Figure 1B] A diagram showing an example of the configuration of an image processing device corresponding to the embodiment. [Figure 2A] A diagram showing an example of the configuration of an image processing device corresponding to the embodiment. [Figure 2B] A diagram illustrating an example configuration of an image processing apparatus corresponding to Embodiment 1. [Figure 3] A flowchart showing an example of the process corresponding to the embodiment. [Figure 4] A diagram illustrating the prediction direction in each coding scheme described in the embodiments. [Figure 5] An explanatory diagram showing an example of an intra-predicted image generation method in the HEVC format. [Figure 6A] A diagram showing an example of a functional configuration for updating inference parameters corresponding to an embodiment. [Figure 6B] A diagram showing another example of a functional configuration for updating inference parameters corresponding to an embodiment. [Figure 7A] An explanatory diagram of an example configuration of a neural network corresponding to an embodiment. [Figure 7B] A diagram showing an example configuration of neurons in a neural network corresponding to an embodiment. [Figure 8A] A diagram showing an example of a block splitting pattern for each coding format described in the embodiment. [Figure 8B] A diagram showing an example of integrating the splitting pattern of the HEVC coding format in units of 128×128 sizes. [Figure 8C] A diagram showing an example of a schematic configuration of an image processing apparatus corresponding to Embodiment 2. [Figure 8D] A flowchart showing an example of a process corresponding to Embodiment 2. [Figure 9A] An explanatory diagram comparing the block splitting method of HEVC and the block splitting method of VVC. [Figure 9B] A diagram showing an example of a schematic configuration of an image processing apparatus corresponding to Embodiment 3. [Figure 9C] A diagram showing an example of a functional configuration for updating inference parameters corresponding to Embodiment 3. [Figure 9D] An explanatory diagram of the relationship between an object contour and a block splitting boundary corresponding to Embodiment 3. [Figure 10A] A diagram showing an example of an image to be coded corresponding to Embodiments 4 and 5. [Figure 10B] A diagram showing an example of a schematic configuration of an image processing apparatus corresponding to Embodiments 4 and 5.

Embodiments for Carrying Out the Invention

[0008] Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Note that the following embodiments do not limit the invention according to the claims. Although a plurality of features are described in the embodiments, not all of these plurality of features are essential for the invention, and the plurality of features may be arbitrarily combined. Further, in the accompanying drawings, the same or similar configurations are given the same reference numerals, and duplicate explanations are omitted.

[0009] (Embodiment 1) FIG. 1A shows a configuration example of an image processing system 1 according to this embodiment. In this embodiment, an imaging device 20 is connected to an image processing device 10, and an image captured by the imaging device 20 is processed in the image processing device 10 to generate a stream. In the image processing device 10, for example, encoding processing according to a plurality of encoding methods described in this embodiment is performed. The stream generated in the image processing device 10 is transmitted to an external management device 30 via a network 40. The management device 30 can execute processes such as displaying the received stream on a screen, changing the settings of the image processing device 10 according to the display content, and changing the settings such as the imaging direction and imaging conditions of the imaging device 20. The image processing system 1 can be configured, for example, as a surveillance camera system, and a plurality of imaging devices 20 can be arranged in a surveillance target area, and surveillance images of the surveillance area for each imaging device 20 can be provided to the management device 30.

[0010] In FIG. 1A, the image processing device 10 and the imaging device 20 are described as independent devices, but they may be configured as an integrated imaging device.

[0011] Next, the configuration of the image processing device 10 of this embodiment will be described with reference to Figure 1B. The image processing device 10 is configured, for example, as a device that encodes an image captured from an imaging device 20 and outputs an encoded stream. The image processing device 10 may be implemented by circuit implementation using an ASIC or FPGA, or it may be configured with a CPU and memory for storing an executable program. The image processing device 10 includes a first encoding mode determination unit 101, a first encoding unit 102, and a storage unit 103 for the first encoded stream for H.264 or HEVC (H.265) (hereinafter collectively referred to as "HEVC") as the first encoding scheme. The image processing device 10 also includes a second encoding mode determination unit 104, a second encoding unit 105, a storage unit 106 for the second encoded stream, and an encoding scheme setting unit 107 for AV1 or VVC as the second encoding scheme.

[0012] For example, an image captured by the imaging device 20 is input to the first encoding mode determination unit 101 and the second encoding mode determination unit as an image to be encoded or an input image. The first encoding mode determination unit 101 determines the HEVC encoding mode and outputs the difference image generated according to the determined encoding mode to the first encoding unit 102, and also outputs information regarding the determined encoding mode to the second encoding mode determination unit 104. The first encoding unit 102 performs encoding processing on the input difference image according to a first encoding method such as integer conversion, quantization, or entropy encoding, and stores the first encoded stream of the encoding result in the first encoded stream storage 103.

[0013] Furthermore, the second coding mode determination unit 104 determines the AV1 / VVC coding mode by referring to the coding mode information of the first coding scheme input from the first coding mode determination unit 101, and outputs the difference image generated according to the determined coding mode to the second coding unit 105. The second coding unit 105 performs coding processing according to the second coding scheme, such as integer conversion, quantization, and entropy coding, on the input difference image, and stores the second coding stream of the coding result in the second coding stream storage 106.

[0014] The encoding method setting unit 107 transmits the specified encoding method to at least the first encoding unit 102 and the second encoding mode determination unit 104 when one of the encoding methods, HEVC, AV1, or VVC, is selected according to the user's specification of the image processing device 10. The first encoding unit 102 performs encoding processing when HEVC is specified, and the second encoding mode determination unit 104 determines the encoding mode using the specified encoding method. In this embodiment, when HEVC is specified, the second encoding mode determination process in the second encoding mode determination unit 104 and the second encoding process in the second encoding unit 105 are not performed. On the other hand, even if either AV1 or VVC is specified, the first encoding mode determination process in the first encoding mode determination unit 101 is performed, while the first encoding process in the first encoding unit 102 is not performed.

[0015] Next, referring to Figure 2A, an example of the common functional configuration of the encoding mode determination unit and encoding unit for HEVC, AV1, and VVC in the image processing device 10 will be described. As will be described later, one of the features of this embodiment is that some of the configurations of AV1 and VVC are made common by neural networking, but Figure 2A shows the basic configuration before neural networking for the purpose of explaining each encoding mode determination unit and each encoding unit. The following describes each configuration.

[0016] The current frame storage unit 201 receives the image to be encoded, captured by the imaging device 20, and temporarily stores it. The block size determination unit 202 determines the encoding block size for the image to be encoded. The determined block size is output to the intra prediction unit 203 and the inter prediction unit 204. The block size determination unit 202 also outputs the image of the current frame to the intra prediction unit 203, the inter prediction unit 204, and the subtractor 207.

[0017] The intra-prediction unit 203 divides the frame image to be processed into predetermined block units, predicts the image of each block from the pixels surrounding the block, and generates a predicted image. As will be described later, the prediction modes in intra-prediction differ for HEVC, AV1, and VVC. For example, HEVC has 35 modes (directional prediction in 33 directions, Planar prediction, DC prediction), AV1 has 60 modes (directional prediction in 56 directions, 3 Smooth predictions, Paeth filter), and VVC has 81 modes (directional prediction in 65 directions, Planar prediction, DC prediction, wide-angle prediction).

[0018] The interpretation unit 204 divides the input current frame image into predetermined block units, and in each block, it performs motion search processing to detect positions that have a high correlation with the reference frame image stored in the reference frame storage unit 211, and detects the difference data of those positions as motion information between frames. The prediction accuracy of interpretation also differs depending on the method; HEVC has a 1 / 4 pixel accuracy for the luminance signal, AV1 has a 1 / 8 pixel accuracy, and VVC has a 1 / 16 pixel accuracy.

[0019] The motion compensation unit 205 performs motion compensation, generating a predicted image of the current frame image from the reference frame image and motion information. The motion compensation algorithm also differs for each method; HEVC uses a motion compensation interpolation filter set for motion vector search, while VVC uses a motion compensation interpolation filter set that uses affine transformation in addition to the same compensation interpolation filter set. AV1 uses five types of motion compensation interpolation filters. The switch unit 206 is a switching mechanism that selects either the predicted image output from the intra prediction unit 203 or the motion compensation unit 205. The output from the switch unit 206 is output to the subtractor 207 and the adder 208.

[0020] The subtractor 207 subtracts the predicted image from the current frame image and outputs the difference image obtained to the frequency conversion unit 212. The adder 208 adds the predicted image and the decoding result of the difference image output from the inverse frequency conversion unit 216 to generate a decoded image. The decoded image is stored in the decoded image storage unit 209 and output to the intra prediction unit 203 and the deblocking filter 210. The deblocking filter 210 performs a deblocking filter process to correct discontinuities in boundary data of predetermined block units. The result of the deblocking filter process is output to the reference frame storage unit 211 and stored as a reference frame in the inter-prediction unit 204.

[0021] The frequency conversion unit 212 performs integer conversion on the difference image provided by the subtractor 207 and outputs the processing result to the quantization unit 213. The integer conversion process here also differs depending on the method. HEVC employs DCT (Discrete Cosine Transform) 2 and supports only square blocks. AV1, on the other hand, employs DCT, ADST (Asymmetric Discrete Sine Transform), Inverse ADST (Inverse Asymmetric Discrete Sine Transform), and Identity Transform, and supports not only square blocks but also rectangular blocks (4x8 pixels, 8x16 pixels, etc.). VVC controls the switching of multiple orthogonal transforms such as DCT2, DCT8, and DST7, and supports both square and rectangular blocks.

[0022] In the quantization unit 213, quantization processing is performed on the conversion coefficients obtained by integer conversion at a predetermined quantization scale. In the entropy coding unit 214, data compression is performed by entropy coding processing on the quantized conversion coefficients. The processing content here also differs depending on the method. HEVC employs context-adaptive binary arithmetic coding (CABAC), AV1 employs adaptive multi-symbol arithmetic coding (AMSAC), and VVC employs adaptive quantization of conversion coefficients in addition to CABAC. The quantization results are also output to the inverse quantization unit 215, which performs a predetermined inverse quantization process on the quantized conversion coefficients and outputs them to the inverse frequency conversion unit 216. The inverse frequency conversion unit 216 performs an inverse integer conversion to return the inversely quantized conversion coefficients back to the original image data space and outputs the difference image obtained as a result of the conversion to the adder 208.

[0023] The above is an example of a common configuration for the image processing device 10 for HEVC, AV1, and VVC. However, as mentioned above, since the technical content to be implemented in each functional block differs, a corresponding configuration must be prepared for each encoding scheme. The configuration shown in Figure 2A is generally implemented in an ASIC, but providing an image processing device for each encoding scheme inevitably increases the circuit size. Therefore, in this embodiment, a circuit configuration is adopted that enables the determination of the second encoding mode by inference from the encoding mode in the first encoding scheme determined in the first encoding mode determination unit 101, thereby reducing the circuit size.

[0024] Specifically, with reference to Figure 2B, an example of the configuration of the image processing device 10 for reducing the circuit size in this embodiment will be described. In Figure 2B, a configuration in which the AV1 encoding mode determination unit 104a and the VVC encoding mode determination unit 104v, which constitute the second encoding mode determination unit 104, share an intra-prediction unit 203 will be described. In Figure 2B, the intra-prediction unit 203 of the VVC encoding mode determination unit 104v is replaced with a neural network (NN) intra-prediction unit 203n, and this unit is shared with the intra-prediction unit of the AV1 encoding mode determination unit 104a. Hereinafter, among the functional blocks in Figure 2A, those with a reference number prefixed with "a" are for AV1, those with a reference number prefixed with "v" are for VVC, and those with a reference number prefixed with "h" are for HEVC. In addition, the reference number of a functional block that has been made into a neural network and shared will be prefixed with "n".

[0025] In the configuration shown in Figure 2B, the NN intra prediction unit 203n receives a signal from the encoding scheme setting unit 107 indicating which encoding scheme among HEVC, AV1, and VVC has been selected. If the VVC encoding scheme is selected, the NN intra prediction unit 203n operates as the NN intra prediction unit in the VVC encoding mode determination unit 104v. The NN intra prediction unit 203n obtains the results of the HEVC intra prediction in the first encoding mode determination unit 101 and the image to be encoded from the first encoding mode determination unit 101, and performs VVC intra prediction by inference using the said intra prediction and the image to be encoded.

[0026] On the other hand, if the AV1 encoding scheme is selected, the NN intra prediction unit 203n operates as the NN intra prediction unit in the AV1 encoding mode determination unit 104a. The NN intra prediction unit 203n obtains the results of the HEVC intra prediction in the first encoding mode determination unit 101 and the image to be encoded from the first encoding mode determination unit 101, and performs AV1 intra prediction by inference using the intra prediction and the image to be encoded.

[0027] If HEVC is selected in the encoding method setting unit 107, only encoding using the first encoding method is performed, and encoding using AV1 and HEVC is not performed. Therefore, if the NN intra prediction unit 203n detects that HEVC has been selected, it stops the operation of the AV1 and VVC encoding mode determination units. This eliminates unnecessary encoding processing and reduces power consumption.

[0028] By adopting the configuration shown in Figure 2B, the intra-prediction unit 203 is removed from and shared with the encoding mode determination unit 104a for AV1 and the encoding mode determination unit 104v for VVC. In this way, the more modes that can be predicted using neural network inference, the smaller the circuit size can be. In Figure 2B, only the intra-prediction unit 203 is shared, but other units such as the block size determination unit 202 and the inter-prediction unit 204 may also be made into neural networks and shared.

[0029] Next, an example of the encoding process corresponding to this embodiment, which is performed based on the configurations shown in Figures 2A and 2B, will be described. Figure 3 is a flowchart showing an example of the process corresponding to this embodiment.

[0030] First, in S301, the encoding method setting unit 107 accepts the selection of an encoding method. In this embodiment, the encoding method is selected from HEVC, AV1, and VVC. In the following S302, the first encoding mode determination unit 101 performs the encoding mode determination process for the HEVC encoding method. The information of the encoding mode determined here includes, for example, information such as block size, intra prediction mode, and inter prediction block division. In the following S303, the encoding method accepted in S301 is determined, and if it is HEVC, the process proceeds to S304; if it is AV1, the process proceeds to S305; and if it is VVC, the process proceeds to S308.

[0031] In S304, the first encoding unit 102 performs encoding processing in HEVC. Meanwhile, in S305, the encoding mode determination unit 104a for AV1 of the second encoding mode determination unit 104 selects NN parameters (inference parameters) for AV1, and in S306, the AV1 encoding mode determination process is executed. For example, in the configuration of Figure 2B, the encoding mode determined in HEVC encoding is referenced, and the NN intra prediction unit 203n performs intra prediction of AV1 by inference using the inference parameters for AV1, and also performs inter prediction. In the subsequent S307, the AV1 encoding unit 105a of the second encoding unit 105 performs encoding processing on the difference image obtained by subtracting the predicted image generated in the determined encoding mode from the frame image.

[0032] In S308, the VVC encoding mode determination unit 104v of the second encoding mode determination unit 104 selects the NN parameters (inference parameters) for VVC, and in S309, the VVC encoding mode determination process is executed. For example, in the configuration of Figure 2B, the encoding mode determined in HEVC encoding is referenced, and the NN intra prediction unit 203n performs intra prediction of VVC by inference using the VVC inference parameters, and also performs inter prediction. In the subsequent S310, the VVC encoding unit 105v of the second encoding unit 105 performs encoding processing on the difference image obtained by subtracting the predicted image generated in the determined encoding mode from the frame image.

[0033] As described above, in this embodiment, the encoding mode in the second encoding scheme is determined by performing inference using a neural network based on the encoding mode determined for the first encoding scheme. This utilizes the correlation between the first encoding mode and the second encoding mode, and the relationship between the two will be explained below.

[0034] In the following, we will explain the intra prediction mode as an example of the coding modes. HEVC intra prediction has 33 modes for angular prediction, DC prediction, and Planar prediction, as shown in Figure 4(A). Angular prediction is a mode that interpolates and predicts directionality from adjacent pixels, and can be predicted from the 33 modes. In this case, the 33 modes can be divided into four groups, labeled A through D. That is, modes 1 through 9 are group D, modes 9 through 17 are group C, modes 17 through 26 are group A, and modes 26 through 33 are group B.

[0035] In contrast, VVC's intra-prediction offers 65 modes for directional prediction, as well as DC prediction and Planar prediction, as shown in Figure 4(B). The 65 modes for directional prediction are almost double the number of modes in HEVC's 33 modes. In VVC, too, the modes can be divided into four groups, corresponding to groups A through D in HEVC. Therefore, if a prediction is made for group A in HEVC, there is a high probability that the prediction in VVC will also be in the direction of the corresponding group A. In addition, VVC has prediction modes called WideAngle prediction, which range from modes -1 to -14 and 67 to 80, and can predict directions that exceed the maximum angle of the directional prediction that can only be selected for non-square prediction blocks.

[0036] Furthermore, AV1's intra-prediction includes 56 modes for directional prediction, as well as 33 Smooth and Paeth predictions, as shown in Figure 4(C). While 56 modes for directional prediction is not as many as VVC, it is nearly double the number of modes in HEVC's 33 modes. In AV1, too, the modes can be divided into four groups, corresponding to HEVC's Groups A through D. Therefore, if a prediction in HEVC is for Group A, there is a high probability that AV1 will also predict in the direction of the corresponding Group A.

[0037] Thus, by applying a neural network (NN) to perform inference based on the directional prediction mode (first prediction direction) in the intra-prediction mode of the HEVC scheme, it becomes possible to efficiently determine the directional prediction mode (second prediction direction) in VVC and AV1 as well. Furthermore, since the directional prediction mode determined in the HEVC scheme for the same encoded image has a high correlation with the directional prediction modes of AV1 and VVC, it is expected that the structure of the NN can be simplified by using this as input. On the other hand, in inference using an NN, the directional prediction mode determined in the HEVC scheme is not directly adopted, so it is possible to avoid the directional prediction mode in AV1 and VVC being limited to group A. For example, even if a group A mode is selected in HEVC, if that selected mode is located at the boundary between group A and group B, or between group A and group C, group A is not necessarily selected in VVC or AV1, and group B or group C may be selected. Inference using an NN can handle such cases.

[0038] Next, a specific example of an intra-predictive image generation method in the HEVC scheme will be explained with reference to Figure 5. Figure 5 shows an example of intra-prediction where a predicted image is generated based on a surrounding reference image 502 for a 4x4 pixel block image 501. Here, we will explain the case where the predicted image for pixel 503 at coordinates (3, 2) of the block image 501 is obtained.

[0039] For example, if we define mode 22 as the direction indicated by the arrow from pixel 503, the tilt of the reference direction is 13 / 32. Therefore, if we move -3 units in the y-direction (vertical direction) to a certain position in the reference image 502, we will shift by 13 / 32 × 3 = 39 / 32 units in the x-direction (horizontal direction). That is, (3 - 39 / 32) = 57 / 32, so we would predict the image to be (57 / 32, -1). However, there is no pixel at the position of 57 / 32, so we need to find the pixel at a position shifted by 25 / 32 from (1, -1) and 7 / 32 from (2, -1) using ratios. Thus, we calculate (7 × pixel value at (1, -1) + 25 × pixel value at (2, -1)) / 32.

[0040] Thus, depending on the selected intra-prediction mode, it may be necessary to add adjacent reference images proportionally to obtain the predicted image. In this case, VVC and AV1 have approximately twice as many modes, making it possible to generate predicted images with higher accuracy compared to HEVC. Also, depending on the selected mode, it may not be necessary to add reference pixels proportionally. For example, the number of modes in VVC is twice that of HEVC's prediction modes, and even if it is not possible to select the desired pixel position in HEVC, it may be possible to select a mode in VVC that directly specifies the desired pixel position.

[0041] Next, we will describe the inference parameters of the neural network (NN) in the AV1 and VVC encoding mode determination unit in this embodiment. First, with reference to Figures 6A and 6B, we will describe an example of a method for learning the inference parameters corresponding to this embodiment. Figure 6A is a diagram showing an example of a functional configuration for updating the inference parameters. Figure 6B is a diagram showing another example of a functional configuration for updating the inference parameters. These configurations may be constructed using some of the configurations of the image processing device 10 in Figure 2B, or they may be constructed as a dedicated system for updating the inference parameters.

[0042] Regarding Figure 6A, in updating the inference parameters, first, the HEVC encoding mode determined for the input image, which is the image to be encoded, by the first encoding mode determination unit 101, along with the input image itself, are input to the second encoding mode determination unit 104. The second encoding mode determination unit 104 uses the inference parameters set at that time and determines the encoding modes for both AV1 and VVC by inference, taking into account the HEVC encoding mode supplied for the input image. The determined AV1 and VVC encoding modes are output to the parameter update unit 601, respectively.

[0043] The input image is supplied to the reference unit 602, and the AV1 and VVC encoding modes are determined. The reference unit 602 determines the respective AV1 and VVC encoding modes for the input image by executing the encoder software provided by the respective standards organizations for AV1 and VVC. The output result from the reference unit 602 can be used as training data for the prediction mode determined in the NN intra prediction unit 203n in the second encoding mode determination unit.

[0044] The parameter update unit 601 updates the inference parameters based on the encoding mode input from the second encoding mode determination unit 104 and the reference encoding mode input from the reference unit 602. The inference parameters can be weight coefficients and bias values as shown in Figure 7B. Details will be described later with reference to Figures 7A and 7B. The inference parameters updated in the parameter update unit 601 are fed back to the second encoding mode determination unit and used when determining the encoding mode for the next input image.

[0045] Next, with respect to Figure 6B, similar to Figure 6A, first, the HEVC encoding mode determined for the input image to be encoded by the first encoding mode determination unit 101, along with the input image itself, are input to the second encoding mode determination unit 104. The second encoding mode determination unit 104 uses the inference parameters set at that time and determines the encoding modes for both AV1 and VVC by inference, taking into account the HEVC encoding mode supplied for the input image. The determined AV1 and VVC encoding modes are output to the second encoding unit 105, respectively. In the configuration of Figure 6B, the second encoding unit 105 performs the AV1 and VVC encoding processing according to the determined AV1 and VVC encoding modes and outputs them to the second encoding stream storage 106. The encoded data held in the second encoding stream storage 106 is decoded by the second decoding unit 612, and the decoded image obtained by decoding and information on the generated code amount of the encoded data are output to the parameter update unit 611.

[0046] The input image is also supplied to the parameter update unit 611, which updates the inference parameters so that the difference between the decoded AV1 and VVC images and the input image is minimized. In this process, PSNR (Peak Signal to Noise Ratio) and SSIM (Structural SIMilarity) are used as evaluation metrics. Furthermore, in addition to the difference information, the amount of generated code in the encoded data can also be considered when updating the inference parameters. For example, the logical parameters can be updated to maximize the value of Ei according to the following formula. Ei = α × PSNR + β × (1 / generated code amount) (α and β are desired coefficients) The updated inference parameters are output to the second coding mode determination unit 104 and used when determining the coding mode for the next input image.

[0047] By repeating the above process, it becomes possible to set the inference parameters to more optimal values, and by performing inference with high accuracy based on the HEVC encoding mode notified by the first encoding mode determination unit 101, it becomes possible to determine the encoding mode of AV1 or VVC.

[0048] Next, with reference to Figures 7A and 7B, an example of the configuration of the NN intra prediction unit 203n will be described. First, Figure 7A is a diagram illustrating the configuration of a neural network (NN). As shown in Figure 7A, the NN of this embodiment can have a four-layer structure having an input layer 701, a first hidden layer 702, a second hidden layer 703, and an output layer 704. Two consecutive layers are connected by one or more neurons 710. The output value of the preceding layer is input to the neuron 710, and the output value from the aforementioned calculation process is output to the subsequent layer. The NN intra prediction unit 203n can also be configured in the same way as shown in Figure 7A.

[0049] The number of data points in0 to inN input to the input layer 701 matches the number of data points out0 to outN output from the output layer 704. On the other hand, the number of data points mid00 to mid0p in the first hidden layer 702 and the number of data points mid11 to mid1q in the second hidden layer 703 do not have to match the number of data points in the input layer 701 and the output layer 704. Therefore, the number of neurons 710 connecting the two layers can be any number of 1 or more. The data points in0 to inN input to the input layer 701 are, for example, the coding mode, input image, and reference image in HEVC, and the data points out0 to outN output from the output layer 704 are the prediction mode and prediction image in AV1 or VVC.

[0050] Next, Figure 7B illustrates the configuration of a neuron 710, which is the computational unit of the neural network shown in Figure 7A. As shown in Figure 7A, the NN of this embodiment can be composed of multiple neurons 710. The neuron 710 performs calculations on multiple input values x1 to xN using weights w1 to wN, bias b, and an activation function to output an output value y. The neuron 710 calculates a value x' using weight coefficients w1 to wN and bias value b, for example, as shown in equation (1) below. The weight coefficients w1 to wN and bias value b correspond to the inference parameters described above, and are values that are variably determined by a predetermined learning process, and can take different values depending on the encoding scheme of AV1 and VVC. [Mathematics 1] TIFF2026096870000002.tif14154

[0051] Next, neuron 710 inputs the calculated value x' into activation function 711 to calculate output y. Activation function 711 is a nonlinear function such as a sigmoid function or a ReLU function (Rectified Linear Unit). The output value y when the value x' is given to the sigmoid function can be obtained by the following equation (2). [Math 2] TIFF2026096870000003.tif2087

[0052] The output value y given to the ReLU function is obtained by the following equation (3). [Math 3] TIFF2026096870000004.tif29117

[0053] In the above description, an example configuration of the NN intra prediction unit 203n was explained, but the NN block size determination unit 202n and the NN inter prediction unit 204n can be configured similarly when block size and inter prediction block division information are provided as the encoding mode. By sharing the block size determination unit 202 and the inter prediction unit 204 in the AV1 and VVC image processing devices, the circuit size can be further reduced. Details will be described in subsequent embodiments.

[0054] According to the embodiments described above, in an image processing device that supports multiple encoding schemes, it becomes possible to determine the encoding mode in other encoding schemes by performing inference using a neural network based on the encoding mode determined in one encoding scheme, thereby reducing the circuit size of the other encoding schemes.

[0055] In the embodiments described above, AV1 and VVC were explained as examples of the second encoding scheme, but the embodiments of the second encoding scheme are not limited to these. For example, other encoding schemes that can utilize the encoding mode determined in the first encoding scheme can also be included.

[0056] (Embodiment 2) Next, Embodiment 2 will be described. The configuration of the image processing system in this embodiment is the same as that shown in Figure 1A. The basic configuration of the image processing device 10 is also the same as that shown in Figures 1B and 2A. Other changes corresponding to this embodiment will be described below as appropriate. In this embodiment, the processing when the information of the first encoding mode determined by the first encoding mode determination unit 101 and provided to the second encoding mode determination unit 104 includes information of the block size will be described.

[0057] Figure 8A schematically shows the block partitioning patterns used in the HEVC, AV1, and VVC encoding schemes. Partitioning pattern 801 shows the block partitioning pattern (first block partitioning pattern) for the first encoding scheme, HEVC. Here, a 64x64 pixel block can be partitioned into 32x32, 16x16, and 8x8 blocks using a quadtree. In contrast, the block partitioning pattern (second block partitioning pattern) for the second encoding scheme (second block partitioning pattern) is different from the first block partitioning pattern. Specifically, partitioning pattern 802 shows the block partitioning pattern for AV1. Here, 10 different partitioning patterns are possible for 128x128 or 64x64 pixel blocks. Furthermore, in the square 4-part partitioning pattern shown in partitioning pattern 802a, further partitioning is possible. In AV1, block sizes from a minimum of 4x4 to a maximum of 128x128 pixels can be selected. Partitioning pattern 803 shows the block partitioning pattern for VVC. Here, six different division patterns are possible for pixel blocks with a maximum CTU size of 128x128 pixels. Even with VVC, block sizes from a minimum of 4x4 to a maximum of 128x128 pixels can be selected.

[0058] Thus, the block partitioning patterns differ in the HEVC, AV1, and VVC encoding schemes, and the partitioning pattern used in HEVC cannot be directly adopted. However, by applying neural network inference while referencing the partitioning pattern in HEVC, the process of determining the partitioning patterns in AV1 and VVC can be made more efficient.

[0059] In this embodiment, the HEVC division patterns are grouped into 128x128 size units, or in other words, four 64x64 pixel blocks are integrated and input into a neural network-based block size determination unit 202n for inference to determine the division pattern in AV1 or VVC. Figure 8B shows an example of the HEVC division patterns being grouped into 128x128 size units. In Figure 8B, the gray hatched areas represent intrablocks, and the white blocks represent interblocks. In addition, the division pattern 812 shows the motion vectors set for each interblock with arrows.

[0060] As shown here, the prediction modes of each 64x64 block do not necessarily coincide. Furthermore, even if the prediction modes coincide, the directions of the motion vectors do not necessarily coincide. For example, division pattern 811 shows an example where intrablocks and interblocks are mixed, and in such a mixed state, the likelihood of blocks merging in AV1 or VVC is low. Similarly, as in division pattern 812, even if all are interblocks, the likelihood of blocks merging in AV1 or VVC is low if the directions of the motion vectors do not coincide.

[0061] By using a neural network to perform inference while referencing the partitioning pattern in HEVC, it is possible to skip unnecessary processing when determining the block partitioning pattern in AV1 and VVC, thereby improving processing efficiency and reducing power consumption.

[0062] Next, the flow of the block size determination process in this embodiment will be explained with reference to Figures 8C and 8D. Figure 8C is a diagram showing an example of the schematic configuration of the image processing apparatus 10 corresponding to this embodiment. Figure 8D is a flowchart showing an example of the process corresponding to this embodiment. The flowchart in Figure 8D is based on the flowchart in Figure 3, with the processes corresponding to this embodiment added. For steps that use the same reference numbers as in Figure 3, the process is basically the same as that described in Figure 3, and unless otherwise specified below, the explanation in Figure 3 will be applied mutatis mutandis.

[0063] First, in S301, the encoding scheme is selected, and in S302, the first encoding mode determination unit 101 determines the HEVC encoding mode. At this time, the block size determination unit 202h of the first encoding mode determination unit 101 determines the block size. The determined HEVC block size information is notified to the integration unit 821, and block integration is performed in S801. Here, block integration means the process of combining four 64x64 pixel blocks into a 128x128 pixel block, as described above. The integration unit 821 can hold information for at least two lines of pixel blocks in the HEVC block size. The block size, or division pattern information, for the 128x128 pixel block obtained by block integration is provided to the NN block size determination unit 202n.

[0064] In the flowchart of Figure 8D, after block aggregation in S801, if the encoding scheme selected in S301 is HEVC, the process proceeds to S304 to perform HEVC encoding. If the selected encoding scheme is AV1, the process proceeds to S305 to select inference parameters for AV1. The inference parameters selected here also include inference parameters for the NN block size determination unit 202n. In the subsequent S802, the NN block size determination unit 202n performs inference using the selected AV1 inference parameters to determine the AV1 block size based on the HEVC division pattern of the pixel block to be processed provided by the integration unit 821. At this time, inference may also be performed based on the prediction mode (intra-prediction mode and inter-prediction mode) for each of the integrated 64×64 pixel blocks, and further, on motion vector information in inter-mode. Once the block size is determined, the process proceeds to S306 to determine the AV1 encoding mode as described in Figure 3, and AV1 encoding is performed in S307.

[0065] Furthermore, if the selected encoding scheme is VVC, the process proceeds to S308, where inference parameters for VVC are selected. The inference parameters selected here also include inference parameters for the NN block size determination unit 202n. In the subsequent S803, the NN block size determination unit 202n performs inference using the selected VVC inference parameters to determine the VVC block size based on the HEVC division pattern of the pixel block to be processed provided by the integration unit 821. At this time, inference may also be performed based on the prediction mode (intra-prediction mode and inter-prediction mode) for each of the integrated 64×64 pixel blocks, and even on motion vector information in inter-mode. Once the block size is determined, the process proceeds to S309, where the VVC encoding mode is determined as described in Figure 3, and VVC encoding is performed in S310.

[0066] The method for learning the inference parameters for the NN block size determination unit 202n can be implemented in the same manner as described in Embodiment 1 with reference to Figures 6A and 6B. In this embodiment, the HEVC division pattern determined in the first coding mode determination unit 101 for the input image, the prediction mode (intra-prediction mode and inter-prediction mode) for each of the integrated 64×64 pixel blocks, the motion vector information during inter-mode, and the input image, which is the image to be encoded, are input to the second coding mode determination unit 104. The configuration example of the NN block size determination unit 202n can also be configured in the same way as described in Embodiment 1, for example, as a four-layer structure having an input layer 701, a first hidden layer 702, a second hidden layer 703, and an output layer 704 as shown in Figure 7A. Furthermore, the neurons 710 between layers can be configured to output an output value y by performing calculations on multiple input values x1~xN using weights w1~wN, bias b, and an activation function, as shown in Figure 7B.

[0067] According to the embodiment described above, in an image processing device that supports multiple encoding schemes, by performing inference using a neural network based on the encoding block size determined in one encoding scheme, it becomes possible to simplify the block size determination process in other encoding schemes and reduce the circuit size of the other encoding schemes.

[0068] (Embodiment 3) Next, Embodiment 3 will be described. The configuration of the image processing system in this embodiment is the same as that shown in Figure 1A. The basic configuration of the image processing device 10 is also the same as that shown in Figures 1B and 2A. Other changes corresponding to this embodiment will be described below as appropriate. In this embodiment, the processing when the information of the first encoding mode determined by the first encoding mode determination unit 101 and provided to the second encoding mode determination unit 104 includes motion vectors and block division information in inter prediction will be described.

[0069] VVC added a mode called GPM (Geometric Partitioning Mode) for interpretation. GPM is one of the merge modes and enables motion compensation for diagonal divisions that cannot be handled by normal block partitioning. A merge mode is a method that references the motion vectors (MVs) of spatially and temporally adjacent encoded blocks and uses them as the MV of the current block, and is an encoding tool defined from HEVC. In GPM, you specify an index that represents the angle and distance combination that represents the shape of the partition, and two merge indices that derive the motion vectors of two regions A and B.

[0070] In GPM, there are 64 possible block division patterns (third block division patterns), and in the case of an 8x8 block, a table of 64 weight coefficients is used. If the weight coefficient is 8, region A is selected, and if the weight coefficient is 0, region B is selected. In all other cases, the final predicted image is generated by weighted averaging of the motion-compensated predicted images of the two regions A and region B according to the weight coefficients. An example of the weight coefficient arrangement is shown in Figure 9A.

[0071] By using GPM, more flexible PU partitioning is possible than with conventional rectangular partitioning, and coding efficiency can be improved by performing interpretation that takes object shape into consideration. However, since there are 64 GPM partitioning patterns, estimating coding distortion for each one is computationally intensive and increases the size of the circuit. Therefore, in this embodiment, the motion vector and block partitioning information determined in the first coding mode determination unit 101 are referenced, and inference is performed using a neural network to improve processing efficiency and reduce the size of the circuit.

[0072] Figure 9A is a diagram illustrating a comparison between the block partitioning method in a conventional encoding scheme (e.g., HEVC) and the block partitioning method in VVC. In Image 901, when encoding block 901A using HEVC, the partitioning pattern follows the first block partitioning pattern as shown in Image 902 or Image 903. In Image 902, it is divided into two vertically, and in Image 903, it is divided into four squares. However, in both partitioning patterns in Image 902 and Image 903, the objects contained in block 901A are spread across multiple partitioned blocks, resulting in poor encoding efficiency. In contrast, Image 904 shows an example where one of the third block partitioning patterns in GPM mode in VVC is applied. In this case, block 901A is divided diagonally, so the objects can be contained within a single partitioned block without being spread across multiple partitioned blocks. Weight coefficients 905 shows a table of weight coefficients corresponding to the partitioning patterns applied to Image 904. The weight coefficients take values between 0 and 8. The final predicted image is generated by performing a weighted average of the motion-compensated predicted images of two regions, A and B, according to the weight coefficients.

[0073] Thus, VVC allows for the setting of a third block division pattern in GPM mode that matches the shape of objects, which was not possible with conventional encoding methods. However, since objects are located across multiple blocks even in the first block division pattern of conventional methods, it is possible to narrow down the candidates from among the 64 possible third block division patterns of VVC by using the information of this division pattern. Furthermore, by performing inference using a neural network based on the first block division pattern, it is possible to identify the third block division pattern more efficiently. In addition, regarding motion vectors, by taking into account the motion vectors extracted using conventional methods and performing inference using a neural network, it is possible to calculate the motion vectors in the third block division pattern more efficiently.

[0074] Next, with reference to Figure 9B, a configuration using a neural network corresponding to this embodiment will be described. Figure 9B is a diagram showing an example of a schematic configuration of the image processing device 10 corresponding to this embodiment. In Figure 9B, the motion detection unit is separated from the inter-prediction unit 204, left in the same position on the VVC encoding mode determination unit 104v side, and implemented as a neural network and placed outside the encoding mode determination unit 104v as the NN inter-prediction unit 204n. The NN inter-prediction unit 204n has the same configuration as the AV1 encoding mode determination unit 104a, and the configuration on the encoding mode determination unit 104a side is also the same.

[0075] Furthermore, the object extraction unit 911 extracts objects from the current frame stored in the current frame storage unit 201 of the VVC encoding mode determination unit 104v and outputs them to the NN interpretation unit 204n. The NN interpretation unit 204n takes the object information into consideration and, based on the encoding mode information provided by the HEVC encoding mode determination unit 101, particularly the first block division pattern information, it determines parameters for interpretation, such as motion vectors and the third block division pattern of the GPM mode. The determined motion vectors are provided to the motion compensation unit 205, and the third block division pattern is provided to the block size determination unit 202. If interpretation prediction is to be performed, the block size determination unit 202 determines the block size based on the third block division pattern information from the NN interpretation unit 204n.

[0076] The configuration of the NN interpretation unit 204n can also be configured in the same way as described in Embodiment 1, for example, as a four-layer structure having an input layer 701, a first hidden layer 702, a second hidden layer 703, and an output layer 704 as shown in Figure 7A. Furthermore, the neurons 710 between layers can be configured to output an output value y by performing calculations on multiple input values x1 to xN using weights w1 to wN, a bias b, and an activation function, as shown in Figure 7B. The method for learning the inference parameters for the NN intraprediction unit 203n can be carried out in the same way as described in Embodiment 1 with reference to Figures 6A and 6B. In addition, learning can also be performed using the method shown in Figure 9C, for example.

[0077] The method for learning inference parameters corresponding to this embodiment will be described below with reference to Figure 9C. In Figure 9C, an input image is input to the first coding mode determination unit 101, the first coding mode is determined, and coding is performed in the first coding unit in the determined coding mode using the first coding scheme. The generated first coding stream is stored in the first coding stream storage 103. The stream is decoded in the first decoding unit 921, and the decoded image of the composite result and the information of the code amount of the coded data are provided to the parameter update unit 922.

[0078] The input image is also input to the second coding mode determination unit (in this case, the coding mode determination unit for VVC) 104v, and using the inference parameters set at that time, the VVC coding mode including the third block division pattern is determined by considering the information of the first block division pattern supplied from the first coding mode determination unit and the object information provided from the object extraction unit for the input image. Here, the object extraction unit 923 extracts object information from the input image and provides it to the second coding mode determination unit 104v. The object information can be information about the region where the object is located, information about its shape, or an object image.

[0079] The encoding mode determined by the second encoding mode determination unit 104v is notified to the second encoding unit 105v, and encoding is performed using the second encoding scheme (VVC in this case). The resulting second encoded stream is stored in the second encoded stream storage 106v. The encoded data stored in the second encoded stream storage 106v is decoded by the second decoding unit 612v, and the decoded image obtained by decoding and the information of the encoding amount of the encoded data are output to the parameter update unit 922.

[0080] The input image and object information extracted by the object extraction unit 923 are also supplied to the parameter update unit 922, which updates the inference parameters so that the difference between the VVC decoded image and the input image is minimized. In this process, PSNR (Peak Signal to Noise Ratio) and SSIM (Structural SIMilarity) can be used as evaluation metrics. The updated inference parameters are output to the second coding mode determination unit 104v and used when determining the coding mode for the next input image.

[0081] As a loss function during neural network training, for example, we can set equation 4 below and train the network to minimize f(x).

[0082] f(x)="VVC codec distortion (Dv)" / "conventional codec distortion (Dh)"+"penalty (Pn)"···(Equation 4) In Equation 4, Dv and Dh can be calculated based on the difference between the input image and the decoded image. Furthermore, the penalty Pn can be considered, for example, as follows.

[0083] The penalty Pn value can be set based on whether or not objects are split in the GPM partitioning of the NN output. Specifically, let W0 be the penalty Pn value when objects are split in GPM partitioning, and W1 = α × D be the penalty Pn when objects are not split. In this case, D is a value normalized to the range of 0 to 1, and W0 is a much larger value than the value of α when D = 1 (W0 >> W1 (= α × 1)).

[0084] An example of the concept of distance will be explained with reference to Figure 9D. Figure 9D is a diagram to explain the relationship between the contour of an object and the block division boundary. In Figure 9D, the block to be processed is Image 931. An object, a car, is shown in this image. In contrast, Image 932 shows the state in which the object is divided in GPM division. The dotted lines crossing Image 932 indicate the division boundary in the GPM division pattern. In this state, the penalty Pn value is set to W0, increasing the penalty, and learning is performed so that this division pattern is not adopted.

[0085] Furthermore, as shown in images 933 and 934, even when the shortest distances d1 and d2 between the object's contour and the division boundary are the same (d1=d2), the value of Dv will differ. That is, when the division boundary follows the object as in image 933, the value of distortion Dv is smaller, and the value of f(x) is larger in image 933. Therefore, the learning process is performed to select division boundaries that are closer to the object's contour.

[0086] Furthermore, while the distance between an object and a GPM partition boundary can be determined by finding the shortest distance, it is also possible to calculate multiple distances between the object and the GPM partition boundary and use the average or median of these distances as the distance used in the loss function f(x). For example, the distances between the object's contour line and the GPM partition boundary are different in image 935 and image 936. In the case of image 935, for example, the distance from d31 to d34 is measured, and its average value is smaller than the average distance from d41 to d44 in image 936. Therefore, the GPM partition n in image 935 results in a smaller value for the loss function f(x).

[0087] Also, learning may be performed such that the error between the motion vector MVc of the processing target block and the motion vector MVa of the adjacent block falls within an acceptable range. Specifically, when the positive / negative relationship of the vector elements of MVc and MVa is the same, taking R1 as the prediction error of HEVC, R2 as the prediction error of the VVC reference, and T as the prediction error of using the NN output GPM, if R1 < T, a penalty W0 is applied for learning. Also, a penalty W2 = α×(T - R2) may be further applied to take into account the degree of divergence between T and R2 during learning. Thereby, learning can be performed to avoid the case where the prediction error when using GPM becomes worse than the prediction error of HEVC, and learning can bring it closer to the VVC reference. Also, when adjacent blocks (for example, the two on the left and above) are input and the processing target block and the adjacent blocks have the same object, a penalty based on the distance between the partition boundary position of the adjacent block at the boundary between the adjacent block and the processing target block and the GPM partition boundary position of the processing target block can also be taken into account.

[0088] In this way, by increasing the penalty when the block partition boundary in GPM partitioning intersects an object and divides the object, or when the distance between the object and the partition boundary is far, learning can be performed so that the object and the partition boundary of GPM partitioning are close to each other.

[0089] According to the present embodiment described above, when using the newly adopted GPM in the inter prediction of the VVC method, by performing an inference applying a neural network based on the block size and motion vector information determined in the conventional encoding method, the processing load of GPM partitioning can be reduced. Thereby, the process of determining the block size and motion vector of the VVC encoding method can be simplified, and the circuit scale can be reduced.

[0090] (Embodiment 4) Next, Embodiment 4 will be described. The configuration of the image processing system in this embodiment is the same as that shown in Figure 1A. The basic configuration of the image processing device 10 is also the same as that shown in Figures 1B and 2A. Other changes corresponding to this embodiment will be described below as appropriate. In this embodiment, the processing when the intra-prediction mode is included as information of the first encoding mode determined by the first encoding mode determination unit 101 and provided to the second encoding mode determination unit 104 will be described.

[0091] AV1 and VVC, compared to conventional encoding methods such as HEVC, have added an intra-prediction mode that generates predicted pixels for the chrominance signal from the decoded luminance signal. AV1 has added CfL (Chroma from Luma), and VVC has added CCLM (Cross-component Linear Model Prediction). By adopting CfL and CCLM technologies, improved encoding efficiency can be expected in complex texture regions containing vivid colors that are difficult to predict with directional prediction, DC prediction, and planar prediction.

[0092] CfL and CCLM employ methods to estimate color difference from luminance, thereby reducing the amount of data that needs to be retained. Specifically, in some images, color difference information (Cb(U), Cr(V)) may have a somewhat linear relationship with luminance information (Luma(Y)). Also, depending on the subject, edges contained in luminance information and color difference information may be in the same location, and a correlation between them may be observed.

[0093] However, when using this tool, the prediction results of the luminance signal are used to predict the color difference signal. Specifically, in order to generate the local decoded pixel value of the luminance, the luminance signals surrounding the target color difference signal are added, then downsampled, and further weighted to generate the predicted pixels. At this time, the weighting coefficients can be determined by the least squares method. This method is difficult to parallelize and also increases the circuit size. Therefore, in this embodiment, we propose a configuration that can reduce the circuit size while reducing the burden of parallelization.

[0094] In intra-prediction mode, efficiency decreases if directional prediction is not employed, as prediction from surrounding pixels is unlikely to be possible. However, DC prediction and Planar prediction may be more efficient than CfL or CCLM in some cases, so they can be used as one of the factors in making a decision. For example, there is an advantage to using intra-prediction mode when there is variation in brightness within a block, but the color difference is uniform.

[0095] Furthermore, CCLM has three modes corresponding to the prediction direction: INTRA_L_CCLM mode, which refers to the left direction; INTRA_LT_CCLM mode, which refers to the left and up directions; and INTRA_T_CCLM mode, which refers to the up direction. The method of calculating the weights is switched according to the selected mode. For example, if the HEVC intra prediction mode refers to the left adjacent block, it is highly likely that the CCLM will also be selected in the INTRA_L_CCLM mode. Therefore, by inputting the intra prediction mode, it is possible to select the CCLM mode more efficiently. On the other hand, by performing inference with a neural network, it is possible to avoid the CCLM mode being narrowed down to only INTRA_L_CCLM as a candidate mode.

[0096] Furthermore, object information includes edge detection results, flatness detection results, and object detection results (detection of the object itself in the image, and information about the type of object (person, animal, branch, sky, wall), etc.). This information can be used to determine areas where CfL and CCLM are highly effective, such as areas with high saturation (including vivid colors) and many high-frequency components (e.g., complex textures). For example, the fireworks display in image 1001, the birds in image 1002, and the tropical fish in image 1003, as shown in Figure 10A, have many edge components, thus containing many high-frequency components, and their high saturation results in rich and vivid colors, making them suitable for CfL and CCLM.

[0097] Therefore, in this embodiment, the coding mode (intra-prediction mode) determined in HEVC, etc., and object information are provided to the neural network, and conversion to an appropriate coding mode (chrominance intra-prediction mode) in AV1 or VVC is performed. By inputting the intra-prediction mode determined in conventional methods such as HEVC, and object information into the neural network and performing inference, the possibility of determining an appropriate prediction mode can be increased.

[0098] Next, with reference to Figure 10B, the configuration of the image processing device 10 corresponding to this embodiment will be described. Figure 10B is similar to the diagram shown in Figure 2B, but differs in that it includes an object extraction unit 1011 that extracts object information from the input image. The object extraction unit 1011 extracts edge detection results, flatness detection results, and object detection results (detection of the object itself in the image, and information on the type of object (person, branch, sky, wall), etc.) as object information from the input image. The extracted object information is supplied to the NN intra prediction unit 203n.

[0099] As a result, the NN intra prediction unit 203n performs intra prediction based on the prediction mode information of the intra prediction in the first coding mode determination unit 101 supplied from the first coding mode determination unit 101, the image to be encoded, and object information from the object extraction unit 1011. If AV1 is specified in the coding method setting unit, CfL can be set as the coding mode for intra prediction, and if VVC is specified, CCLM can be set. For example, if the block to be processed is a complex texture region with high saturation based on the edge detection result and flatness detection result as object information, CfL or CCLM can be set. The other configurations in Figure 10B are the same as those already explained in Figure 2B, so their explanation is omitted here.

[0100] By adopting the configuration shown in Figure 10B, the intra prediction unit 203 is removed from and shared with the encoding mode determination unit 104a of AV1 and the encoding mode determination unit 104v of VVC, thereby reducing the circuit size.

[0101] Furthermore, the learning method for the NN intra prediction unit 203n can adopt the configurations shown in Figures 6A and 6B. However, a configuration equivalent to the object extraction unit 1011 in Figure 10B, which extracts and provides object information from the input image, is added to the second coding mode determination unit 104. The second coding mode determination unit 104 uses the inference parameters set at that time to determine the coding mode for both AV1 and VVC, taking into account the HEVC coding mode and object information supplied for the input image. The determined AV1 and VVC coding modes are output to the parameter update unit 601, respectively. Other operations are the same as those described in relation to Figures 6A and 6B. The neural network configuration of the NN intra prediction unit 203n is also the same as those described in relation to Figures 7A and 7B.

[0102] According to the embodiment described above, when using CfL added to AV1 and CCLM added to VVC in intra-prediction mode, inference using a neural network is performed based on the intra-prediction mode information in the conventional encoding scheme and object information extracted from the input image, making it possible to efficiently determine whether to select CfL or CCLM. This simplifies the process of determining the intra-prediction mode in the AV1 and VVC encoding schemes and reduces the circuit size.

[0103] (Embodiment 5) Next, Embodiment 5 will be described. The configuration of the image processing system in this embodiment is the same as that shown in Figure 1A. The basic configuration of the image processing device 10 is also the same as that shown in Figures 1B and 2A. Other changes corresponding to this embodiment will be described below as appropriate. In this embodiment, the processing when the intra-prediction mode is included as information of the first encoding mode determined by the first encoding mode determination unit 101 and provided to the second encoding mode determination unit 104 will be described.

[0104] AV1 and VVC offer enhanced encoding tools for screen content (computer graphics) compared to older encoding methods like HEVC. Specifically, they enable features such as IntraBlockCopy (IBC), which uses pre-encoded portions within the same frame as a predicted image for screen content, and Palette Mode, which uses a limited number of colors. Screen content includes CG images superimposed on live-action images, and CG background images superimposed on live-action images. Both IBC and Palette Mode are available in AV1, and IBC is also available in VVC. For example, as shown in image 1004 in Figure 10A, this applies to images where CG image 1004A is superimposed on image 1001, which is a landscape image of fireworks.

[0105] IBC is a special mode of intra-prediction that generates a predicted image for each luminance and chrominance block by copying a reference image block by block from the processed surrounding region of the same encoded image as the block being processed. For example, it can achieve high encoding efficiency in graphic images where similar texture patterns are repeated. IBC is available for CU sizes from 4x4 to 64x64, and the reference image is determined by specifying a block vector (BV) for each CU.

[0106] Palette mode is an intra-prediction tool that allows you to specify 2 to 8 colors and define areas and indices within a picture. For example, if you use 3 colors, you can create a table by assigning indices 0, 1, and 2 to the colors with the highest probability of occurrence, and then specify the color values for each index in the palette. In this way, palette mode allows you to encode pixel values by replacing them with indices, thus reducing the amount of information.

[0107] This embodiment describes an example of using a neural network to perform conversion to AV1 or VVC intrablock copy mode using HEVC encoding mode information and object information.

[0108] For example, in HEVC, if the block size and block prediction mode are common to two blocks in the determined encoding mode, the IBC mode is more likely to be selected if they specifically match or are similar. On the other hand, if the block size and block prediction mode do not match between two blocks, another mode is more likely to be selected than the IBC mode. Specifically, as shown in Figure 10A, image 1005 is an image of the character "A", and this image 1005 is divided into four blocks, with the upper part predicted vertically and the lower part predicted horizontally. Similarly, image 1006 is also an image of the character "A", and both the block division pattern and the prediction direction of each block match. On the other hand, image 1007 is an image of the character (C), and the left half is divided into 4x4 pixel blocks, while the right half is divided into 8x4 pixel blocks. Also, the left side has the top horizontal and the bottom vertical, and the right side is vertical. Thus, image 1007 does not match image 1005 in terms of block division pattern or prediction direction. Figure 10A illustrates an example using a single character, but the same process can be applied to cases involving multiple characters.

[0109] Furthermore, based on the object information, including the CG / natural image detection result, the texture (repeating pattern, text) detection result, and the solid area (flat area, or low frequency region) detection result, it becomes possible to determine whether the area is a CG area, which is a strength of IBC mode.

[0110] For example, in the encoding mode determined in HEVC, if the block size is small, the overhead of the header information is large, and the likelihood of selecting palette mode is low. On the other hand, if the block size is large, the likelihood of it being a flat area is high, and the likelihood of selecting palette mode is high. Also, regarding object information, similar to the IBC mode described above, it is possible to determine whether it is a single-color block that is well handled by palette mode based on the CG / natural image judgment result and the detection result of solid areas (flat areas or low-frequency regions). In this embodiment, learning is performed using these methods.

[0111] The configuration of the image processing apparatus 10 corresponding to this embodiment is the same as that shown in Figure 10B. In the configuration of Figure 10B, the NN intra prediction unit 203n performs intra prediction based on the prediction mode information of the intra prediction in the first coding mode determination unit 101 supplied from the first coding mode determination unit 101, the image to be encoded, and object information from the object extraction unit 1011. If AV1 is specified in the coding method setting unit, IBC or palette mode can be set as the coding mode for intra prediction, and if VVC is specified, IBC can be set as the coding mode for intra prediction. For example, if a repeating pattern texture is detected as object information, IBC is set. However, in the case of palette mode AV1, intra prediction is performed based only on block size information and object information, without using prediction mode information. The other configurations in Figure 10B are the same as those already described in Figures 10B and 2B, so their explanation is omitted here.

[0112] Furthermore, the learning method for the NN intra prediction unit 203n corresponding to this embodiment is the same as that described in Embodiment 4. Here, too, a configuration corresponding to the object extraction unit 1011 can be adopted in addition to the configurations in Figures 6A and 6B. However, in the case of the palette mode of AV1, learning is performed for the case where the palette mode is selected using only the block size information and object information, without using the prediction mode information. Other operations are the same as those described in Embodiment 4. Also, the neural network configuration of the NN intra prediction unit 203n is the same as that described in relation to Figures 7A and 7B.

[0113] According to the embodiment described above, when using the IBC added to AV1, the palette mode, and the IBC added to VVC in the intra-prediction mode, the IBC and palette mode can be efficiently utilized based on the intra-prediction mode information and object information extracted from the input image in an image processing device corresponding to a conventional encoding scheme. This simplifies the process of determining the intra-prediction mode in the AV1 and VVC encoding schemes and reduces the circuit size.

[0114] The disclosures herein include the following image processing apparatus, imaging apparatus, control method for the image processing apparatus, and computer program. [Item 1] An encoding mode determination means that determines a second encoding mode in a second encoding scheme different from the first encoding scheme for the image to be encoded, based on information of the first encoding mode in a first encoding scheme for the image to be encoded, by inference using a neural network. Encoding means for encoding the image to be encoded using the second encoding mode, An image processing device equipped with the following features. [Item 2] The aforementioned second encoding scheme includes at least two encoding schemes, The image processing apparatus according to item 1, wherein the encoding mode determination means determines the second encoding mode by inference using inference parameters learned for a predetermined input image in a designated encoding scheme among the at least two encoding schemes. [Item 3] The image processing apparatus according to item 1 or 2, wherein, when the information of the first coding mode includes information of the first intra-prediction mode in the first coding scheme, determining the second coding mode by inference includes determining the second intra-prediction mode in the second coding mode. [Item 4] The image processing apparatus according to item 3, wherein determining the second intra-prediction mode includes determining a second prediction direction in the second intra-prediction mode based on a first prediction direction included in the information of the first intra-prediction mode. [Item 5] The system further comprises an integration means for integrating information of the first encoding mode of a plurality of blocks of a first block size in the first encoding scheme to generate information of the first encoding mode of a second block size different from the first block size for the second encoding scheme, When the information of the first encoding mode includes information of the first block division pattern in the first encoding scheme, The image processing apparatus according to any one of items 1 to 4, wherein determining the second coding mode by the inference includes determining the second block division pattern in the second coding scheme based on information of the first block division pattern of the second block size generated by the integration means. [Item 6] The first block size is 64 x 64, and the second block size is 128 x 128. The image processing apparatus according to item 5, wherein the integration means integrates information of the first block division pattern of four blocks in the first encoding scheme to generate information of the first block division pattern of the second block size. [Item 7] In the inter prediction of the first coding scheme, a first block partitioning pattern is permitted. In the inter prediction of the second coding scheme, if a second block division pattern different from the first block division pattern is permitted, and the second block division pattern includes a third block division pattern that divides the block diagonally, The information of the first encoding mode includes information of the first block division pattern in the first encoding scheme, The image processing apparatus according to any one of items 1 to 6, wherein determining the second coding mode by the inference comprises selecting one of the third block division patterns based on the first block division pattern. [Item 8] The system further comprises object extraction means for extracting information about objects contained in the image to be encoded, The image processing apparatus according to item 7, wherein determining the second encoding mode by the inference further includes selecting one of the third block division patterns based on the information of the object. [Item 9] The image processing apparatus according to item 8, wherein selecting one of the third block division patterns includes selecting a pattern that does not divide the images of the objects included in the blocks to be divided. [Item 10] The image processing apparatus according to item 9, wherein selecting a pattern that does not divide the image of an object included in the block includes selecting a pattern in which the distance between the contour of the object and the division boundary in the third block division pattern is closer than that of other patterns. [Item 11] The image processing apparatus according to item 9 or 10, wherein selecting a pattern that does not divide the image of the objects included in the block includes selecting a pattern in which the division boundary in the third block division pattern follows the contour of the object. [Item 12] The information of the first encoding mode further includes information of the motion vectors of the blocks divided in the first block division pattern, The image processing apparatus according to any one of items 7 to 11, wherein determining the second coding mode by the inference comprises determining the motion vector information in the selected third block division pattern based on the motion vector information. [Item 13] The system further includes object extraction means for extracting object information from the image to be encoded, The image processing apparatus according to any one of items 1 to 12, wherein determining the second coding mode by the inference includes determining a third intra-prediction mode of the second coding mode based on information of the intra-prediction mode in the first coding scheme and information of the extracted object. [Item 14] The third intra-prediction mode includes a first prediction mode that generates predicted pixels of a color difference signal from the luminance signal after decoding the same block, The selection of the first prediction mode is based on the amount of edge components and the saturation level in the object information, as described in item 13, for the image processing apparatus. [Item 15] The third intra prediction mode includes a second prediction mode in which an image encoded by the second encoding method is used as the predicted image for the same image to be encoded, The selection of the second prediction mode is based on the block size information in the information of the first encoding mode and the commonality of prediction modes among the blocks, as described in item 13 or 14 of the image processing apparatus. [Item 16] The selection of the second prediction mode is further based on whether the object is a computer graphics image, as described in item 15. [Item 17] The third intra prediction mode includes a third prediction mode that includes specifying the index of the colors that constitute the image to be encoded, The selection of the third prediction mode is based on the size of the block in the information of the first coding mode and whether the object is monochrome, according to the image processing apparatus according to any one of items 13 to 16. [Item 18] The image processing apparatus according to any one of items 1 to 17, wherein the first encoding scheme is H.264 or HEVC, and the second encoding scheme is AV1 or VVC. [Item 19] A first coding mode determination means for determining the first coding mode in the first coding scheme, A first encoding means that encodes the image to be encoded using the first encoding mode, Furthermore, The image processing apparatus according to item 18, wherein when encoding using the second encoding scheme is specified, the first encoding mode is determined in the first encoding mode determination means, and encoding by the first encoding means is not performed. [Item 20] The inference using the neural network is performed using at least the image to be encoded, according to any one of items 1 to 19. [Item 21] Imaging means, The image processing device described in any one of items 1 through 20 and An imaging device equipped with the following features. [Item 22] A coding mode determination step in which, based on information of the first coding mode in the first coding scheme determined for the image to be coded, a second coding mode is determined for the image to be coded by inference using a neural network for at least one second coding scheme different from the first coding scheme, An encoding step of encoding the image to be encoded using the second encoding mode described above, A method for controlling an image processing device, including the control method for an image processing device. [Item 23] A program for causing a computer to function as one of the means of an image processing apparatus described in any one of items 1 through 20.

[0115] The invention is not limited to the embodiments described above, and various modifications and variations are possible without departing from the spirit and scope of the invention. Accordingly, claims are attached to disclose the scope of the invention.

[0116] (Other examples) The present invention can also be realized by supplying a program that implements one or more of the functions of the above-described embodiments to a system or device via a network or storage medium, and by having one or more processors in the computer of that system or device read and execute the program. It can also be realized by a circuit (e.g., an ASIC) that implements one or more functions. [Explanation of symbols]

[0117] 10 Image processing equipment, 20 Imaging devices, 30 Management devices, 40 Network

Claims

1. An encoding mode determination means that determines a second encoding mode in a second encoding scheme different from the first encoding scheme for the image to be encoded, based on information of the first encoding mode in a first encoding scheme for the image to be encoded, by inference using a neural network. Encoding means for encoding the image to be encoded using the second encoding mode, An image processing device equipped with the following features.

2. The second encoding scheme includes at least two encoding schemes, The image processing apparatus according to claim 1, wherein the encoding mode determination means determines the second encoding mode by inference using inference parameters learned for a predetermined input image in a designated encoding method among the at least two encoding methods.

3. The image processing apparatus according to claim 2, wherein, when the information of the first encoding mode includes information of the first intra-prediction mode in the first encoding scheme, determining the second encoding mode by inference includes determining the second intra-prediction mode in the second encoding mode.

4. The image processing apparatus according to claim 3, wherein determining the second intra-prediction mode includes determining a second prediction direction in the second intra-prediction mode based on a first prediction direction included in the information of the first intra-prediction mode.

5. The system further comprises an integration means for integrating information of the first encoding mode for a plurality of blocks of a first block size in the first encoding scheme to generate information of the first encoding mode for a second block size different from the first block size for the second encoding scheme, When the information of the first encoding mode includes information of the first block division pattern in the first encoding scheme, The image processing apparatus according to claim 2, wherein determining the second encoding mode by the inference includes determining the second block division pattern in the second encoding scheme based on information of the first block division pattern of the second block size generated by the integration means.

6. The first block size is 64 x 64, and the second block size is 128 x 128. The image processing apparatus according to claim 5, wherein the integrating means integrates information of the first block division pattern of the four blocks in the first encoding scheme to generate information of the first block division pattern of the second block size.

7. In the inter prediction of the first coding scheme, a first block partitioning pattern is permitted. In the inter prediction of the second coding scheme, if a second block division pattern different from the first block division pattern is permitted, and the second block division pattern includes a third block division pattern that divides the block diagonally, The information of the first encoding mode includes information of the first block division pattern in the first encoding scheme, The image processing apparatus according to claim 2, wherein determining the second coding mode by the inference includes selecting one of the third block division patterns based on the first block division pattern.

8. The system further comprises object extraction means for extracting information about objects contained in the image to be encoded, The image processing apparatus according to claim 7, wherein determining the second coding mode by the inference further includes selecting one of the third block division patterns based on the information of the object.

9. The image processing apparatus according to claim 8, wherein selecting one of the third block division patterns includes selecting a pattern in which the images of the objects included in the blocks to be divided are not divided.

10. The image processing apparatus according to claim 9, wherein selecting a pattern that does not divide the image of an object included in the block includes selecting a pattern in which the distance between the contour of the object and the division boundary in the third block division pattern is closer than that of other patterns.

11. The image processing apparatus according to claim 9, wherein selecting a pattern that does not divide the image of an object included in the block includes selecting a pattern in the third block division pattern where the division boundary follows the contour of the object.

12. The information of the first encoding mode further includes information of the motion vectors of the blocks divided in the first block division pattern, The image processing apparatus according to claim 7, wherein determining the second coding mode by the inference includes determining the motion vector information in the selected third block division pattern based on the motion vector information.

13. The system further includes object extraction means for extracting object information from the image to be encoded, The image processing apparatus according to claim 2, wherein determining the second coding mode by the inference includes determining a third intra-prediction mode of the second coding mode based on information of the intra-prediction mode in the first coding scheme and information of the extracted object.

14. The third intra-prediction mode includes a first prediction mode that generates predicted pixels of a color difference signal from the luminance signal after decoding the same block, The image processing apparatus according to claim 13, wherein the selection of the first prediction mode is based on the amount of edge components and the saturation level in the information of the object.

15. The third intra prediction mode includes a second prediction mode in which an image encoded by the second encoding method is used as the predicted image for the same image to be encoded. The image processing apparatus according to claim 13, wherein the selection of the second prediction mode is based on block size information in the information of the first encoding mode and the commonality of prediction modes among blocks.

16. The image processing apparatus according to claim 15, wherein the selection of the second prediction mode is further based on whether the object is a computer graphics image.

17. The third intra prediction mode includes a third prediction mode that includes specifying the index of the colors that constitute the image to be encoded, The image processing apparatus according to claim 13, wherein the selection of the third prediction mode is based on the size of the block in the information of the first encoding mode and whether the object is monochrome.

18. The image processing apparatus according to claim 1, wherein the first encoding scheme is H.264 or HEVC, and the second encoding scheme is AV1 or VVC.

19. A first coding mode determination means for determining the first coding mode in the first coding scheme, A first encoding means for encoding the image to be encoded using the first encoding mode, Furthermore, The image processing apparatus according to claim 18, wherein when encoding using the second encoding method is specified, the first encoding mode is determined in the first encoding mode determination means, and encoding by the first encoding means is not performed.

20. The image processing apparatus according to claim 1, wherein the inference using the neural network is performed using at least the image to be encoded.

21. Imaging means, Image processing apparatus according to any one of claims 1 to 20 and An imaging device equipped with the following features.

22. A coding mode determination step in which, based on information of the first coding mode in a first coding scheme for the image to be coded, a second coding mode in a second coding scheme different from the first coding scheme is determined for the image to be coded by inference using a neural network, An encoding step of encoding the image to be encoded using the second encoding mode described above, A method for controlling an image processing device, including the control method for an image processing device.

23. A program for causing a computer to function as one of the means of an image processing apparatus according to any one of claims 1 to 20.