Moving image encoding device, decoding device
The motion image encoding and decoding devices with SEI messages containing random seed information support neural network image processing, addressing the inefficiencies in existing methods and ensuring consistent image generation results.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SHARP KK
- Filing Date
- 2025-10-15
- Publication Date
- 2026-06-25
AI Technical Summary
Existing image encoding and decoding methods do not support neural network image processing using a generative AI, and there is a need to improve the efficiency of video encoding and decoding using image generation techniques.
A motion image encoding device and decoding device that includes an image decoding device, a generated information decoding device, and an image generation device, with SEI messages containing random seed information to support neural network image processing using a diffusion model.
Enables efficient video encoding and decoding by ensuring consistent image generation results through the use of random seed information, allowing for improved image processing efficiency.
Smart Images

Figure JP2025036249_25062026_PF_FP_ABST
Abstract
Description
Moving image encoding device, decoding device
[0001] Embodiments of the present invention relate to a moving image encoding device and a decoding device.
[0002] In order to efficiently transmit or record a moving image, a moving image encoding device that generates encoded data by encoding an image and a moving image decoding device that generates a decoded image by decoding the encoded data are used.
[0003] Specific moving image encoding methods include, for example, the H.266 / VVC (Versatile Video Coding) method.
[0004] In such traditional image encoding methods, the image is divided and encoded / decoded. First, a predicted image is generated based on a local decoded image obtained by encoding the input image / decoding the encoded data. Next, a prediction error (sometimes referred to as a "difference image" or "residual image") obtained by subtracting the predicted image from the input image (original image) is encoded / decoded.
[0005] On the other hand, in recent years, a generative AI method called Stable Diffusion using a diffusion model has been disclosed as an image generation method using a neural network. In this method, an image can be generated based on text input by a user called a prompt.
[0006] Also, in Non-Patent Document 1, an auxiliary enhancement information SEI (Supplemental Enhancement Information) message for transmitting image properties, display methods, timings, etc. simultaneously with encoded data as a technique for moving image encoding and decoding is defined. A neural-network post-filter activation SEI message (Neural-Network Post-filter Activation SEI message) indicating the application of post-filter processing based on a neural network is presented.
[0007] Non-Patent Document 2 discloses an SEI message that encodes and decodes a prompt, extending the method of Non-Patent Document 1.
[0008] ITU-T Rec. H.274 V3 “Versatile supplemental enhancement information messages for coded video bitstreams”J. Boyce, J. Chen, S. Deshpande, MM Hannuksela, Hendry, S. McCarthy, GJ Sullivan and Y.-K. Wang, “Additional SEI messages for VSEI version 4 (Draft 4),” JVET Document, JVET-AH2006-v2, Dec.7 2024.
[0009] The method disclosed in Non-Patent Document 1 does not support neural network image processing using a generative AI that inputs prompt information text, thus posing a problem in that generative AI based on a diffusion model cannot be applied. The method disclosed in Non-Patent Document 2 allows prompt input as auxiliary input data, but it has the problem that the information of the random number seed necessary to always obtain the same result in the diffusion model is not defined.
[0010] A motion image decoding device according to one aspect of the present invention comprises an image decoding device for decoding encoded data of an image signal, a generated information decoding device for decoding SEI messages, and an image generation device for generating an image from the image information decoded by the image decoding device and the generated information decoded by the generated information decoding device, wherein the SEI message has means for having random seed information, random seed information provided from an external source, or using pre-set random seed information.
[0011] Furthermore, a motion image decoding device according to one aspect of the present invention comprises an image decoding device for decoding encoded data of an image signal, a generated information decoding device for decoding SEI messages, and an image generation device for generating an image from the image information decoded by the image decoding device and the generated information decoded by the generated information decoding device, wherein the SEI message has means for updating random number seed information for each image.
[0012] Furthermore, a motion image encoding device according to one aspect of the present invention is characterized by comprising an image encoding device for encoding encoded data of an image signal, a generation information encoding device for encoding an SEI message, and means for the SEI message to have random seed information, random seed information provided from an external source, or pre-set random seed information.
[0013] This configuration solves the problem of efficiently performing video encoding and decoding using image generation techniques.
[0014] This is a schematic diagram showing the configuration of the image transmission system according to this embodiment. This is a diagram showing an example of a block diagram of the image generation processing device according to this embodiment. This is a diagram showing an example of a block diagram of the generation information creation device according to this embodiment. This is a diagram showing a part of the syntax of the NNPFC SEI message described in Non-Patent Literature 2. This is a diagram showing an example of the extension of the NNPFC SEI message according to this embodiment. This is a diagram showing a part of the processing of auxiliary input data according to this embodiment. This is a diagram showing a part of the processing of auxiliary input data according to this embodiment. This is a diagram showing a part of the processing of auxiliary input data according to this embodiment. This is a diagram showing an example of the extension of the NNPFC SEI message according to this embodiment. This is a diagram showing the syntax of the NNPFA SEI message described in Non-Patent Literature 2. This is a diagram showing an example of the extension of the NNPFA SEI message according to this embodiment.
[0015] (First Embodiment) Figure 1 is a conceptual diagram showing the configuration of the image transmission system according to this embodiment.
[0016] The image transmission system 1 consists of a video encoding device 10, a transmission network 20, a video decoding device 30, and an image display device 40.
[0017] The video encoding device 10 takes an input image signal T as input and outputs encoded data Te.
[0018] The transmission network 20 transmits encoded data Te from the video encoding device 10 to the video decoding device 30. The transmission network 20 is the Internet, a wide area network (WAN), a local area network (LAN), or a combination thereof. The network 20 is not necessarily limited to a bidirectional communication network; it may also be a unidirectional communication network that transmits broadcast waves such as terrestrial digital broadcasting or satellite broadcasting. Furthermore, the transmission network 20 may be replaced by a storage medium that records encoded data Te, such as a DVD (Digital Versatile Disc: trademark) or a BD (Blu-ray Disc: registered trademark).
[0019] The video decoding device 30 takes encoded data Te as input, outputs a generated image Td, and sends it to the image display device 40.
[0020] The image display device 40 displays all or part of the generated image Td output from the video decoding device 30. The image display device 40 includes a display device such as a liquid crystal display or an organic EL (electro-luminescence) display. Examples of display forms include stationary, mobile, and HMD (head-mounted display). Furthermore, if the video decoding device 30 has high processing power, it displays a high-quality image, and if it has lower processing power, it displays an image that does not require high processing power or display power.
[0021] The video encoding device 10 consists of an image encoding device 101, a generation information creation device 102, and a generation information encoding device 103.
[0022] The image encoding device 101 encodes the input image signal T, creates encoded data Te, and sends the decoded image information to the generated information creation device 102.
[0023] The generated information creation device 102 takes the input image signal T, external data such as model information, and the image information decoded from the video encoding device as input, creates generated information, and sends it to the generated information encoding device 103.
[0024] The generated information encoding device 103 encodes the generated information to generate auxiliary extended information encoding data.
[0025] The video decoding device 30 consists of an image decoding device 301, an image generation processing device 302, and a generated information decoding device 303.
[0026] The image decoding device 301 receives live encoded data Te transmitted via the transmission network 20 as input, decodes the image information, and sends it to the generated information decoding device 302 and the image generation device 303.
[0027] The generated information decoding device 302 decodes the auxiliary extension information from the encoded data Te based on its syntax, decodes the generated information from the image information created by the image decoding device 301 and external data such as encoded data and model information, and sends the generated information to the image generating device 301.
[0028] The image generation processing device 303 processes the image information decoded by the image decoding device 301 and the generated information decoded by the generated information decoding device 303 to generate a generated image Td, which is then output to the image display device 40.
[0029] In this embodiment, the image encoding device 101 and the image decoding device 301 are realized by applying general-purpose video encoding and decoding methods such as AVC, HEVC, and VVC.
[0030] Figure 2 is a conceptual diagram showing the configuration of the image generation processing device of this embodiment. The image generation processing device in this embodiment uses a generation image processing method based on so-called image generation AI, which is configured using a neural network such as a diffusion model. It takes image information, generation information, and model data as input and outputs a generated image.
[0031] The image generation processing device 303 consists of an image generation unit 3031, a control unit 3032, and a control image generation unit 3033. The image generation unit 3031 uses a generation image processing method configured with a Stable Diffusion neural network. The control unit 3032 uses a control method configured with a neural network called Control Net. The control image generation unit 3033 generates a control image signal from image information.
[0032] The control unit 3032 takes image information, control parameter information from the generated information, and model data specified by the control parameters as input, and outputs control image information for input to the image generation unit 3031. Here, the image information is the locally decoded image signal output by the image encoding device 101, or the decoded image signal output by the image decoding device 301.
[0033] In both cases, the image information is obtained by encoding and decoding the input image signal.
[0034] The control image signal is created from image information by the control image generation unit 3033. Specifically, the control image signal uses the following information. Identification of these is included in the control parameters. ・Canny image ・Soft edge image ・Sketch image ・Line art image ・Normal map image ・Depth map image ・Segmentation image ・Open pose image ・Wireframe (MLSD) image ・Inpaint image ・Reference image All of these are monochrome or color image information created using images. Note that the control image signal is not limited to a single image; multiple different control image signals may exist for the same image information.
[0035] The generation information consists of random number seed information (nnpfSeedVal) and prompt information (nnpfPromptVal).
[0036] Other control parameter information is acquired as external data. This control parameter information consists of parameters for controlling the control unit 3302 described above, and includes the identification of the basic image in the image information, the identification of the control image, and the model information of the control unit.
[0037] Other model information includes the neural network model name and neural network model data used for image generation by the image generation unit 3301. The model data is input as external data to the image generation processing device 303 and shared between the video encoding device 10 and the video decoding device 30. Alternatively, the video encoding device 10 and the video decoding device 30 may already have the same model data. Model parameters are parameters for controlling the neural network and consist of various numerical and string information such as intensity values, number of steps, and sampler type.
[0038] Prompt information is string information that indicates the content of the generated image. Prompt information includes positive prompt information that represents the content to be generated and negative prompt information that represents the content to be generated but not the content to be generated.
[0039] Positive prompt information can be automatically generated by analyzing the input image signal. Alternatively, it can be automatically generated by analyzing image information decoded by the image encoding device 101 or the image decoding device 301. The positive prompt may be encoded and decoded using the information created from the input image signal as part of the generated information. Alternatively, if information created from image information is used, it can be created by the generated information decoding device 302, and mode information indicating this can be sent. Alternatively, the difference between the information created from the input image signal and the information created from image information may be encoded as part of the generated information, and the information created from the input image signal may be decoded using the information created by the video decoding device 30.
[0040] Negative prompt information is shared between the video encoding device 10 and the video decoding device 30, and may be encoded and decoded as part of additional generated information when additional information needs to be sent.
[0041] Figure 3 is a block diagram showing the configuration of the generation information creation device 102 of the present embodiment. In the generation information creation device 102 in the present embodiment, an input image signal, image information created by the moving image encoding device 10, and external data are input, generation information is output, and the generation information is sent to the generation information encoding device 103.
[0042] The generation information creation device 102 includes a generation information creation unit 1021, an encoding control unit 1022, and an image generation processing device 1023. The image generation processing device 1023 is the same as the above-described image generation processing device 303, and outputs a generated image from the generation information, the image information, and the model data. The encoding control unit 1032 selects the generation information created based on two indexes, the evaluation criterion D of image similarity such as the mean square error, the sum of absolute value errors, SSIM (Structural Similarity), MS-SSIM (Multi-Scale Structural Similarity), and LPIPS (Learned Perceptual Image Patch Similarity) based on the comparison between the generated image result of the image generation device 303 and the input image signal, and the code amount R of the generation information created by the generation information creation unit 1021, and outputs the optimal one.
[0043] The generation information creation unit 1021 generates generation information through information exchange with the encoding control unit 1022 and sends it to the image generation processing device 1023.
[0044] The generation information encoding device 103 encodes the generation information created by the generation information creation device 102, and together with the encoding data output by the image encoding device 101 for the auxiliary extended information encoding data, creates encoding data Te and sends it to the created transmission network 20.
[0045] The generation information decoding device 302 decodes the auxiliary extended encoding data among the encoding data Te sent from the transmission network 20, and sends the decoding result to the image generation processing device 303 as generation information.
[0046] In this embodiment, it is assumed that encoding and decoding are performed as SEI (Supplemental Enhancement Information) messages based on the syntax described later. Note that the encoding and decoding methods are not limited to SEI messages, and they may be encoded and decoded as syntax in video encoding and decoding methods, such as APS (Adaptation Parameter Set).
[0047] FIG. 4, FIG. 5, FIG. 10, FIG. 11, and FIG. 12 show the syntax of the SEI message of the SEI message in Non-Patent Document 2 and the SEI message of the generated information encoded data encoded and decoded by the generated information encoding device 103 and the generated information decoding device 302 in this embodiment.
[0048] The meaning of the notation of Descriptor in the following syntax table is interpreted as follows. ・b(8): Represents the value of a byte having an arbitrary pattern of a bit string (8 bits). ・f(n): Represents a bit string of a fixed pattern using n bits written in order from the left bit (from left to right). ・se(v): Represents a syntax element obtained by encoding a signed integer with a 0th-order Exp-Golomb code. ・st(v): Represents a string encoded in UTF-8 and terminated with null. ・u(n): Represents an unsigned integer using n bits. In the syntax table, when n is "v", the number of bits varies according to the values of other syntax elements. ・ue(v): Represents a syntax element obtained by encoding an unsigned integer with a 0th-order Exp-Golomb code (the left bit is the first).
[0049] In the method disclosed in Non-Patent Document 1, there is a problem that the generation AI by a diffusion model cannot be applied because it does not support neural network image processing by a generation AI that inputs text of prompt information. Therefore, in the method disclosed in Non-Patent Document 2, prompt information can be input as auxiliary input data.
[0050] Figure 4 shows a portion of the syntax of the NNPFC SEI message (Neural Network Post-Filter Characteristic SEI message) from Non-Patent Document 2, particularly the parts added between Non-Patent Document 1 and Non-Patent Document 2.
[0051] The syntax element nnpfc_auxiliary_inp_idc indicates whether auxiliary input data exists in the NNPFC SEI input tensor. If nnpfc_auxiliary_inp_idc is greater than 0, it indicates that auxiliary input data exists in the NNPF input tensor. If nnpfc_auxiliary_inp_idc is 0, there is no auxiliary input data in the input tensor. If nnpfc_auxiliary_inp_idc is 1, 2, or 3, auxiliary input data exists.
[0052] The value of nnpfc_auxiliary_inp_idc must be in the range of 0 to 255. Values of nnpfc_auxiliary_inp_idc from 4 to 255 are reserved for future use.
[0053] If nnpfc_auxiliary_inp_idc is 1, the auxiliary input data consists of strengthControlScaledVal[i]. If nnpfc_auxiliary_inp_idc is 2, the auxiliary input data consists of the string nnpfc_prompt. If nnpfc_auxiliary_inp_idc is 3, the auxiliary input data consists of strengthControlScaledVal[i] and the string nnpfc_prompt.
[0054] In other words, if the least significant bit of nnpfc_inp_order_idc in binary is 1, that is, if (nnpfc_inp_order_idc & 1) is greater than 0, then the auxiliary input data strengthControlScaledVal[i] exists. If the second least significant bit of nnpfc_inp_order_idc in binary is 1, that is, if (nnpfc_inp_order_idc & 2) is greater than 0, then the auxiliary input data is the string nnpfc_prompt.
[0055] Note that if nnpfc_auxiliary_inp_idc is 2 or 3, then nnpfc_spatial_extrapolation_prompt_present_flag will be equal to 1.
[0056] The syntax element nnpfc_inp_order_idc indicates how the pixel array of the input image is ordered to form the input tensor to NNPF. The value of nnpfc_inp_order_idc ranges from 0 to 255. Values of nnpfc_inp_order_idc from 4 to 255 are reserved for future use.
[0057] The syntax element nnpfc_inband_prompt_flag indicates whether the prompt string contained in the input tensor is included in this NNPFC SEI message or in the NNPFA SEI message that activates the NNPF defined in this NNPFC SEI message. If nnpfc_inband_prompt_flag is 1, it indicates that the prompt string contained in the input tensor is included in this NNPFC SEI message or in the NNPFA SEI message that activates the NNPF defined in this NNPFC SEI message. If nnpfc_inband_prompt_flag is 0, it indicates that the prompt string contained in the input tensor is provided to the decoding system by an external means.
[0058] The function `byte_aligned()` returns whether the current encoded data is in byte units. If it is not, it inserts the syntax element `nnpfc_alignment_zero_bit_c` to adjust the bit position so that the next element aligns with a byte boundary. `nnpfc_alignment_zero_bit_c` is assumed to be equal to 0.
[0059] The syntax element nnpfc_prompt specifies a text string prompt to be used as input for NNPF. If nnpfc_prompt exists, it must not be a null string.
[0060] The variable nnpfcPrompt, which specifies the prompt string provided to the input tensor of a particular image picA when NNPF is activated, is derived as follows:
[0061] - If nnpfc_inband_prompt_flag is 1 and nnpfa_prompt_update_flag is 1, then nnpfcPrompt will be set to nnpfa_prompt.
[0062] - Otherwise, if nnpfc_inband_prompt_flag is 1 and nnpfa_prompt_update_flag is 0, nnpfcPrompt will be set to nnpfc_prompt.
[0063] Otherwise, if nnpfc_inband_prompt_flag is 0 and the prompt string is provided by an external means, nnpfcPrompt will be set to that prompt string.
[0064] - Otherwise (nnpfc_inband_prompt_flag is 0 and no prompt string is provided by an external means), nnpfcPrompt is set to a nul string.
[0065] If the value of nnpfc_inp_order_idc is 0, there is one luminance matrix in the input tensor for each input image. If the value of nnpfc_inp_order_idc is 1, there are two chrominance matrices in the input tensor. If the value of nnpfc_inp_order_idc is 2, there is one luminance matrix and two chrominance matrices in the input tensor. If the value of nnpfc_inp_order_idc is 3, there are four luminance matrices and two chrominance matrices.
[0066] Note that if the color difference format is not 4:2:0, nnpfc_inp_order_idc cannot be 3. For monochrome formats, nnpfc_inp_order_idc must be 0. In the case of resolution scaling of a color difference image, nnpfc_inp_order_idc must not be 0.
[0067] The auxiliary input data has channels added after the pixel data of the input image, depending on the values of nnpfc_inp_order_idc and nnpfc_inp_order_idc. If they exist, they are input into the input tensor in the order of strengthControlScaledVal[i] and the string nnpfc_prompt.
[0068] The method disclosed in Non-Patent Document 2 allows for prompt input as auxiliary input data, but it has the problem that it does not define the information of the random number seed necessary to always obtain the same result in the diffusion model, i.e., the initial value for generating pseudorandom numbers. If the random number seed is different, not only will the results not always be the same, but there is also a possibility that an image different from the intended one will be generated.
[0069] Therefore, this embodiment describes a method for inputting random number seed information (random number seed information) as auxiliary input data for Neural Network Post-Filter Characteristic (NNPFC) SEI messages.
[0070] Figure 5 shows a portion of the syntax of the NNPFC SEI message in this embodiment. The syntax element nnpfc_auxiliary_inp_idc, which indicates whether or not auxiliary input data exists in the NNPFC SEI input tensor, is extended to perform encoding and decoding of random seed information.
[0071] The syntax element nnpfc_auxiliary_inp_idc indicates whether auxiliary input data exists in the NNPFC SEI input tensor. If nnpfc_auxiliary_inp_idc is greater than 0, it indicates that auxiliary input data exists in the NNPF input tensor. If nnpfc_auxiliary_inp_idc is 0, there is no auxiliary input data in the input tensor. If nnpfc_auxiliary_inp_idc is 1, 2, 3, or 4, auxiliary input data exists.
[0072] The value of nnpfc_auxiliary_inp_idc must be in the range of 0 to 255. Values of nnpfc_auxiliary_inp_idc from 5 to 255 are reserved for future use.
[0073] If nnpfc_auxiliary_inp_idc is 1, the auxiliary input data consists of strengthControlScaledVal[i]. If nnpfc_auxiliary_inp_idc is 2, the auxiliary input data consists of the string nnpfc_prompt. If nnpfc_auxiliary_inp_idc is 3, the auxiliary input data consists of strengthControlScaledVal[i] and the string nnpfc_prompt. If nnpfc_auxiliary_inp_idc is 4, the auxiliary input data consists of random number seed information nnpfc_seed. If nnpfc_auxiliary_inp_idc is 5, the auxiliary input data consists of strengthControlScaledVal[i] and random number seed information nnpfc_seed. If nnpfc_auxiliary_inp_idc is 6, the auxiliary input data consists of the string nnpfc_prompt and random number seed information nnpfc_seed. If nnpfc_auxiliary_inp_idc is 7, the auxiliary input data consists of strengthControlScaledVal[i], the string nnpfc_prompt, and the random number seed information nnpfc_seed.
[0074] In other words, if the least significant bit of nnpfc_auxiliary_inp_idc in binary is 1, that is, if (nnpfc_auxiliary_inp_idc & 1) is greater than 0 (equal to 1), then the auxiliary input data is strengthControlScaledVal[i]. If the second least significant bit of nnpfc_auxiliary_inp_idc in binary is 1, that is, if (nnpfc_auxiliary_inp_idc & 2) is greater than 0 (equal to 2), then the auxiliary input data is the string nnpfc_prompt. If the third least significant bit of nnpfc_auxiliary_inp_idc in binary is 1, that is, if (nnpfc_auxiliary_inp_idc & 4) is greater than 0 (equal to 4), then the auxiliary input data is the random number seed information nnpfc_seed.
[0075] The syntax element nnpfc_inband_seed_flag indicates whether the random seed information contained in the input tensor is included in this NNPFC SEI message or in the NNPFA SEI message that activates the NNPF defined in this NNPFC SEI message. If nnpfc_inband_seed_flag is 1, it indicates that the random seed information contained in the input tensor is included in this NNPFC SEI message or in the NNPFA SEI message that activates the NNPF defined in this NNPFC SEI message. If nnpfc_inband_seed_flag is 0, it indicates that the random seed information contained in the input tensor is provided to the decoding system by an external means.
[0076] The syntax element nnpfc_seed specifies the random seed information used as input to NNPF.
[0077] The variable nnpfSeedVal, which indicates the random seed information provided to the input tensor of a specific image picA on which NNPF is activated, is derived as follows: - If nnpfc_inband_seed_flag is 1 and nnpfa_seed_update_flag is 1, nnpfSeedVal is set to nnpfa_seed. - Otherwise, if nnpfc_inband_seed_flag is 1 and nnpfa_seed_update_flag is 0, nnpfSeedVal is set to nnpfc_seed. - Otherwise, if nnpfc_inband_seed_flag is 0 and the random seed information is provided by an external means, nnpfSeedVal is set to that external random seed information. - Otherwise (if nnpfc_inband_seed_flag is 0 and the random seed information is not provided by an external means), nnpfSeedVal is set to 0. Note that this value cannot be changed if it is a constant other than 0.
[0078] The syntax element nnpfc_seed should be a positive integer value represented as a 32-bit binary number, indicating the random number seed used as input for the target post-processing. Note that it does not have to be 32 bits; 16 bits, 64 bits, or any other number of bits sufficient for random number generation is acceptable.
[0079] Figures 6, 7, 8, and 9 show the processing of auxiliary input data within the DeriveInputTensors() process that derives the input tensor inputTensor in this embodiment. Here, a patch refers to a rectangular array composed of pixels that make up the components of a picture, i.e., luminance and chrominance signals. The input tensor inputs pixels with an overlap of nnpfc_overlap pixels relative to the picture, so the vertical address variable yPovlp and the horizontal address variable xPovlp are defined as follows, with the vertical range being from yP = ?nnpfc_overlap to yP = inpPatchHeight + nnpfc_overlap - 1 and the horizontal range being from xP = ?nnpfc_overlap to xP = inpPatchWidth + nnpfc_overlap - 1.
[0080] yPovlp = yP + nnpfc_overlap xPovlp = xP + nnpfc_overlap The input format of pixel data differs depending on the value of nnpfc_inp_order_idc. Therefore, when supplementary input data is input to the input tensor inputTensor in addition to the pixel data, the input channel needs to be changed depending on the value of nnpfc_inp_order_idc.
[0081] In this embodiment, a variable numChannel is defined to indicate the position of the input channel. When the value of nnpfc_inp_order_idc is 0, numChannel is set to 1; when the value of nnpfc_inp_order_idc is 1, numChannel is set to 2; when the value of nnpfc_inp_order_idc is 2, numChannel is set to 3; and when the value of nnpfc_inp_order_idc is 3, numChannel is set to 6.
[0082] Then, the following process inputs strengthControlScaledVal[i], the prompt string nnpfcPrompt, and the random number seed information nnpfSeedVal into the input tensor.
[0083] if( ( nnpfc_auxiliary_inp_idc & 1 ) > 0 ) { if( !nnpfc_component_last_flag ) inputTensor[ 0 ][ i ][ numChannel ][ yPovlp ][ xPovlp ] = strengthControlScaledVal[ i ] else inputTensor[ 0 ][ i ][ yPovlp ][ xPovlp ][ numChannel | xPovlp ] = promptCharVal else inputTensor[ 0 ][ i ][ yPovlp ][ xPovlp ][ numChannel ] = promptCharVal numChannel++} if( ( nnpfc_auxiliary_inp_idc & 4 ) > 0 ) { if( !nnpfc_component_last_flag ) inputTensor[ 0 ][ i ][ numChannel ][ yPovlp ][ xPovlp ] = nnpfSeedVal else inputTensor[ 0 ][ i ][ yPovlp ][ xPovlp ][ numChannel ] = nnpfSeedVal} Here, utf8ToUInt(x) is a function that converts a string to an integer and is defined as follows.
[0084] utf8ToUInt( x ) { result = 0 len = 0 / * Check end of text prompt string * / if( x = = null ) return 0 / * Determine the number of bytes in the UTF-8 character * / if( (x[ 0 ] & 0x80 ) = = 0 ) len = 1 / * 1-byte character * / else if( (x[ 0 ] & 0xE0 ) = = 0xC0 ) len = 2 / * 2-byte character * / else if( (x[ 0 ] & 0xF0 ) = = 0xE0 ) len = 3 / * 3-byte character * / else if( (x[ 0 ] & 0xF8 ) = = 0xF0 ) len = 4 / * 4-byte character * / else / * Invalid UTF-8 character; this case shall not occur in bitstreams. * / len = 0 for( i = 0; i < len; i++ ) / * Construct an integer from the bytes * / result = ( result << 8 ) | x[ i ] x = x + len / * Modifies the input variable, which is a syntax element * / return result} Note that the syntax element nnpfc_component_last_flag is a flag that indicates whether the last dimension of the input tensor to NNPF and the output tensor of NNPF are used for the current channel. If nnpfc_component_last_flag is 1, it indicates that the last dimension of the input tensor to NNPF and the output tensor of NNPF are used for the current channel.If nnpfc_component_last_flag is 0, it indicates that the third dimension of the input tensor to NNPF and the output tensor of NNPF will be used for the current channel.
[0085] In the method described above, random seed information was input into the input tensor. However, explicitly indicating the random seed information allows the noise information used in the diffusion model to reproduce the same results, so inputting random seed information into the input tensor is not necessarily required.
[0086] Figure 12 shows part of the syntax of another NNPFC SEI message in this embodiment. When a prompt string is present in the auxiliary input data, the random seed information is encoded and decoded as the input tensor for the NNPFC SEI.
[0087] The difference from the above example is that when nnpfc_auxiliary_inp_idc is 2, the auxiliary input data consists of the string nnpfc_prompt and the random number seed information nnpfc_seed. When nnpfc_auxiliary_inp_idc is 3, the auxiliary input data consists of strengthControlScaledVal[i], the string nnpfc_prompt, and the random number seed information nnpfc_seed. When nnpfc_inp_order_idc is represented in binary, if the second lowest bit is 1, that is, if (nnpfc_inp_order_idc & 2) is greater than 0 (equal to 2), then the auxiliary input data consists of the string nnpfc_prompt and the random number seed information nnpfc_seed.
[0088] The syntax element nnpfc_inband_seed_flag indicates whether the random seed information contained in the input tensor is included in this NNPFC SEI message or in the NNPFA SEI message that activates the NNPF defined in this NNPFC SEI message. If nnpfc_inband_seed_flag is 1, it indicates that the random seed information is included in this NNPFC SEI message or in the NNPFA SEI message that activates the NNPF defined in this NNPFC SEI message. If nnpfc_inband_seed_flag is 0, it indicates that it is provided to the decoding system by an external means.
[0089] The syntax element nnpfc_seed specifies the random seed information used as input to the image generation unit 3031, which uses a diffusion model.
[0090] The variable nnpfSeedVal, which indicates the random seed information provided to the input tensor of a specific image picA when the image generation unit 3031 is activated, is derived as follows: If nnpfc_inband_seed_flag is 1 and nnpfa_seed_update_flag is 1, then nnpfSeedVal is set to nnpfa_seed.
[0091] Otherwise, if nnpfc_inband_seed_flag is 1 and nnpfa_seed_update_flag is 0, nnpfSeedVal will be set to nnpfc_seed.
[0092] - Otherwise, if nnpfc_inband_seed_flag is 0 and random seed information is provided by an external means, nnpfSeedVal will be set to that external random seed information.
[0093] - In all other cases (when nnpfc_inband_seed_flag is 0 and random number seed information is not provided by external means, nnpfSeedVal is set to 0. Note that this value can be any value other than 0, as long as it is a certain integer value (for example, 42). The syntax element nnpfc_seed is a positive integer value represented in 32 bits of binary that indicates the random number seed used as input to the target post-processing. Note that it does not have to be 32 bits; 16 bits or 64 bits are acceptable as long as they are the precision required for random number generation.
[0094] To set auxiliary input data as an input tensor, do the following:
[0095] The variable numChannel, which indicates the position of the input channel, is set to 1 if the value of nnpfc_inp_order_idc is 0, to 2 if the value of nnpfc_inp_order_idc is 1, to 3 if the value of nnpfc_inp_order_idc is 2, and to 6 if the value of nnpfc_inp_order_idc is 3.
[0096] Then, the following process inputs strengthControlScaledVal[i], the prompt string nnpfcPrompt, and the random number seed information nnpfSeedVal into the input tensor. The definitions of yPovlp and xPovlp are the same as in the embodiment described above.
[0097] if( ( nnpfc_auxiliary_inp_idc & 1 ) > 0 ) { if( !nnpfc_component_last_flag ) inputTensor[ 0 ][ i ][ numChannel ][ yPovlp ][ xPovlp ] = strengthControlScaledVal[ i ] else inputTensor[ 0 ][ i ][ yPovlp ][ xPovlp ][ numChannel | xPovlp ] = promptCharVal} inputTensor[ 0 ][ i ][ numChannel + 1 ][ yPovlp ][ xPovlp ] = nnpfSeedVal else inputTensor[ 0 ][ i ][ yPovlp ][ xPovlp ][ numChannel ] = promptCharVal inputTensor[ 0 ][ i ][ yPovlp ][ xPovlp ][ numChannel + 1] = nnpfSeedVal} By using this configuration, it becomes possible to encode, decode, and define the random seed information necessary to uniquely generate an image in a diffusion model.
[0098] In this embodiment, we demonstrated that by defining a new NNPFC SEI message that extends the NNPFC SEI message, and by encoding and decoding seed information for uniquely generating the random numbers required in the diffusion model, an image transmission system using a motion image encoding and decoding method with an image generation method can be realized.
[0099] Furthermore, in the method disclosed in Non-Patent Document 2, prompts can be entered in Neural Network Post-Filter Activation Extension (NNPFAE) SEI messages, and auxiliary input data can be updated.
[0100] The part of the NNPFA SEI syntax in Figure 11, starting from the `if (more_data_in_payload())` statement, is the part that was added from Non-Patent Document 1 to Non-Patent Document 2.
[0101] The syntax and semantics of this section will be explained below.
[0102] more_data_in_payload() is specified as follows:
[0103] - If byte_aligned() is equal to TRUE and the current position in the SEI message syntax structure or vui_parameters() syntax structure is 8 * payloadSize bits from the beginning of the syntax structure, then more_data_in_payload() returns FALSE. - Otherwise, more_data_in_payload() returns TRUE.
[0104] If more_data_in_payload() is TRUE, the following syntax elements exist.
[0105] The syntax element nnpfa_prompt_update_flag is a flag that indicates whether the syntax element nnpfa_prompt exists and whether the syntax element nnpfa_alignment_zero_bit exists. If nnpfa_prompt_update_flag is 1, it indicates that the syntax element nnpfa_prompt exists and whether the syntax element nnpfa_alignment_zero_bit exists. If nnpfa_prompt_update_flag is 0, it indicates that the syntax elements nnpfa_prompt and nnpfa_alignment_zero_bit do not exist. If the syntax element does not exist, the value of nnpfa_prompt_update_flag is presumed to be 0.
[0106] If the nnpfc_prompt_present_flag of the NNPFC SEI message is 0, then the value of nnpfa_prompt_update_flag must be 0.
[0107] The function `byte_aligned()` returns whether the current encoded data is in byte units. If it is not, it inserts the syntax element `nnpfa_alignment_zero_bit` to adjust the bit position so that the next element aligns with a byte boundary. The syntax element `nnpfa_alignment_zero_bit` is assumed to be equal to 0.
[0108] The syntax element nnpfa_prompt indicates the prompt used as input for the target NNPF. If nnpfa_prompt_update_flag is 1, nnpfa_prompt must not be a null string. If nnpfa_prompt exists, the text of nnpfc_prompt for processing in the NNPFC SEI message is replaced with the text of nnpfa_prompt.
[0109] nnpfa_num_input_pic_shift specifies the number of input images to shift within the list of candidate input images to obtain the final input image for the target NNPF. If it does not exist, the value of nnpfa_num_input_pic_shift is assumed to be 0. The value of nnpfa_num_input_pic_shift must be in the range of 0 to 63.
[0110] The method disclosed in Non-Patent Document 2 allows for prompt input as auxiliary input data, but it has the problem that the random seed information necessary to uniquely generate an image in the diffusion model is not defined.
[0111] Therefore, this embodiment demonstrates a method for uniquely updating random seed information for the diffusion model in Neural Network Post-Filter Activation Extension (NNPFAE) SEI messages.
[0112] Figure 12 shows the syntax for sending random seed information for the diffusion model in the NNPFA SEI message.
[0113] If more_data_in_payload() is TRUE, add the following syntax:
[0114] The syntax element nnpfa_seed_update_flag is a flag indicating the possibility of the syntax element nnpfa_seed existing. If nnpfa_seed_update_flag is 1, it indicates that the syntax element nnpfa_seed may exist. If nnpfa_seed_update_flag is 0, it indicates that the syntax element nnpfa_seed does not exist. If nnpfa_seed_update_flag does not exist, its value is presumed to be 0.
[0115] Furthermore, if the nnpfc_seed_present_flag of the NNPFC SEI message is 0, the value of nnpfa_seed_update_flag must also be 0.
[0116] The syntax element nnpfa_seed is a positive integer value represented in 32 bits of binary, which represents the random number seed used as input to the image generation unit 3031 using the target diffusion model. Note that it does not have to be 32 bits; 16 bits or 64 bits are acceptable as long as they provide the necessary precision for random number generation.
[0117] If nnpfa_seed_update_flag is 1 and npfa_seed exists, the value of nnpfSeedVal in the diffusion model image generation unit 3031 shall be replaced with the value of nnpfa_seed. If nnpfa_seed_update_flag is 0, the value of nnpfSeedVal defined in NNPFC SEI shall be used.
[0118] In the above embodiment, the random seed information for NNPFC SEI and NNPFA SEI was defined in conjunction with each other. However, it is also possible to define only NNPFA SEI without defining NNPFC SEI. In this case, the semantics are defined as follows.
[0119] The syntax element nnpfa_seed_update_flag is a flag indicating the possibility of the syntax element nnpfa_seed existing. If nnpfa_seed_update_flag is 1, it indicates that the syntax element nnpfa_seed exists. In this case, the value of nnpfSeedVal will be nnpfa_seed. If nnpfa_seed_update_flag is 0, it indicates that the syntax element nnpfa_seed does not exist. In this case, nnpfSeedVal will use a value provided externally. If no random seed information is provided externally, nnpfSeedVal will be set to 0. Note that this value can be any value other than 0, as long as it is a certain integer value (for example, 42). This configuration allows for timely updates of the random seed information necessary to uniquely generate images in the diffusion model.
[0120] In this embodiment, we demonstrated that by defining a new NNPFA SEI message that extends the NNPFA SEI message, and by encoding and decoding seed information for uniquely generating the random numbers required in the diffusion model, an image transmission system using an image generation method for video encoding and decoding can be realized.
[0121] Furthermore, some or all of the video encoding device 10 and video decoding device 30 in the above-described embodiment may be implemented using a computer. In that case, the program for implementing this control function may be recorded on a computer-readable recording medium, and the program recorded on this recording medium may be loaded into a computer system and executed. Here, "computer system" refers to a computer system built into either the video encoding device 10 or the video decoding device 30, and includes hardware such as an OS and peripheral devices. Furthermore, "computer-readable recording medium" refers to portable media such as flexible disks, magneto-optical disks, ROMs, CD-ROMs, and storage devices such as hard disks built into a computer system. Moreover, "computer-readable recording medium" may also include those that dynamically hold programs for a short period of time, such as communication lines used when transmitting programs via networks such as the Internet or communication lines such as telephone lines, and those that hold programs for a certain period of time, such as volatile memory inside a computer system that acts as a server or client in such a case. Furthermore, the above-mentioned program may be for implementing some of the functions described above, and may also be able to implement the above-mentioned functions in combination with programs already recorded in the computer system.
[0122] Furthermore, some or all of the video encoding device 10 and video decoding device 30 in the above-described embodiment may be implemented as an integrated circuit such as an LSI (Large Scale Integration). Each functional block of the video encoding device 10 and video decoding device 30 may be individually implemented as a processor, or some or all of them may be integrated into a single processor. In addition, the method of implementing the integrated circuit is not limited to LSIs; it may also be implemented using dedicated circuits or general-purpose processors. Furthermore, if an integrated circuit technology that can replace LSIs emerges due to advances in semiconductor technology, an integrated circuit using that technology may be used.
[0123] Although one embodiment of this invention has been described in detail above with reference to the drawings, the specific configuration is not limited to that described above, and various design changes can be made without departing from the spirit of this invention.
[0124] The embodiments of the present invention are not limited to those described above, and various modifications are possible within the scope of the claims. That is, embodiments obtained by combining technical means that have been appropriately modified within the scope of the claims are also included in the technical scope of the present invention.
[0125] Embodiments of the present invention can be suitably applied to a video decoding device that decodes encoded data in which an image signal has been encoded, and a video encoding device that generates encoded data in which image data has been encoded. Furthermore, they can be suitably applied to the data structure of encoded data generated by the video encoding device and referenced by the video decoding device.
[0126] 1 Image transmission system 10 Video encoding device 101 Image encoding device 102 Generated information creation device 1021 Generated information creation unit 1023, 303 Image generation processing device 1022 Encoding control unit 103 Generated information encoding device 1031 Auxiliary extended information encoding unit 20 Transmission network 30 Video decoding device 301 Image decoding device 302 Generated information decoding device 3021 Auxiliary extended information decoding unit 303 Image generation processing device 3031 Image generation unit 3032 Control unit 3033 Control image generation unit 40 Image display device
Claims
1. A motion image decoding device comprising: an image decoding device for decoding encoded data of an image signal; a generation information decoding device for decoding SEI messages; and an image generation device for generating an image from the image information decoded by the image decoding device and the generation information decoded by the generation information decoding device, wherein the SEI message has means for having random seed information, random seed information provided from an external source, or using pre-set random seed information.
2. A motion image decoding apparatus comprising: an image decoding device for decoding encoded data of an image signal; a generated information decoding device for decoding SEI messages; and an image generation device for generating an image from the image information decoded by the image decoding device and the generated information decoded by the generated information decoding device, wherein the SEI message has means for updating random number seed information for each image.
3. A video encoding device comprising: an image encoding device for encoding encoded data of an image signal; a generation information encoding device for encoding an SEI message; and a means for the SEI message to have random seed information, random seed information provided from an external source, or pre-set random seed information.