Image processing apparatus, image processing method, and storage medium
By selecting teacher data with high similarity to the current image, segmenting the image into local regions, generating a learning model, and inferring high-frequency components, the problem of poor high-definition image generation in existing technologies is solved, and high-accuracy high-definition image generation is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CANON KK
- Filing Date
- 2023-01-31
- Publication Date
- 2026-06-23
AI Technical Summary
Existing machine learning super-resolution camera technology struggles to accurately recover high-frequency components in image processing, especially when camera conditions change, resulting in reduced inference accuracy and poor high-definition image generation.
By selecting teacher data with high similarity to the current image, the image is segmented into local regions, a learning model is generated, and high-frequency components are inferred to produce a high-resolution image.
It improves the high definition and accuracy of image processing, especially maintaining high inference accuracy when shooting conditions change.
Smart Images

Figure CN116546152B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to an image processing apparatus and method that uses machine learning to give a group of images high resolution, as well as a storage medium. Background Technology
[0002] Regarding super-resolution photography using machine learning, when upscaling images and performing resolution conversion, high-resolution images can be generated by inferring high-frequency components that cannot be estimated through linear interpolation of pixel values using machine learning. In super-resolution photography, a learning model is first generated using image group G and degraded images obtained by degrading the images in image group G using arbitrary methods as teacher data. The learning model is generated by learning the differences in pixel values between the original image and the degraded image and updating its own super-resolution processing parameters. When an image H with insufficient high-frequency components is input into the learning model generated in this way, the high-frequency components are obtained by inference using the learning model. By superimposing the high-frequency components obtained through inference onto image H, a high-resolution image can be generated. When performing super-resolution processing on moving images, high-resolution moving images can be generated by inputting all frames one at a time into the learning model.
[0003] Typically, when using learning models to provide products or services, developers perform the processing to collect teacher data and generate the learning model, which is then provided to the user. Therefore, during the learning process, the content of the motion image input by the user is unknown. Consequently, on the developer's side, a large number of images of many types and categories without bias in image patterns are prepared as teacher data and reused during learning, enabling inferences of uniform accuracy across all types of target motion images.
[0004] For example, Japanese Patent Application Publication No. 2019-204167 (Patent Document 1) describes a technique that uses a learning model trained on a wide variety of images to perform super-resolution processing on moving images. However, due to the large variety of teacher data, the amount of teacher data that has a high similarity to the user-specified inference target moving image Q may be very small. When using such a learning model, the learning results using images with low similarity to the inference target moving image Q are reflected in the inference processing. As a result, improvements are limited to enhancing sharpness by emphasizing the edges of the subject, and it is difficult to accurately infer high-frequency components such as detailed patterns on the subject, meaning that the inference accuracy cannot be considered high.
[0005] Japanese Patent Application Publication No. 2019-129328 (Patent Document 2) describes an example of a system for solving this problem. The method described here includes: learning on the user side using only images that are similar to the inferred target motion image in terms of shooting location and shooting conditions as teacher data, so as to obtain motion images with higher resolution compared to learning using a wide variety of images.
[0006] In Patent Document 2, teacher data with common camera locations but different camera times is used for learning. More specifically, videos previously captured in segment S of a bus route are collected and used for learning. The resulting learning model is then used to perform inference on real-time videos of segment S. In this case, the teacher data is limited to data captured in segment S. Therefore, a set of images with relatively high similarity to the inference target is obtained, which means that improved inference accuracy can be expected. However, in the videos captured in segment S, the camera locations are different in the videos at the beginning and end of segment S. Therefore, the subjects captured are also very different, making it difficult to say that the similarity is high. This leads to a decrease in inference accuracy for the entire segment S. In addition, in the previous videos used as teacher data and the real-time videos of the inference target, the videos may show the same location, but the subjects shown may be different. Since accurate inference cannot be made for subjects that have not been learned, this also leads to a decrease in inference accuracy.
[0007] Furthermore, as described in Patent Document 2, previous videos are classified into multiple groups based on camera conditions such as weather, and multiple learning models are generated by independently learning from the data of each group. This allows switching the learning model in use based on the camera conditions of the real-time video. According to this technique, the decrease in inference accuracy caused by differences in camera conditions can be suppressed. However, even when conditions such as weather are the same, the frequency components differ between the teacher data and the inference target when the values of illumination, etc., are even slightly different. Therefore, it cannot be said that the decrease in inference accuracy is sufficiently suppressed. For these reasons, the technique of Patent Document 2 cannot provide sufficient inference accuracy for high-frequency components. Summary of the Invention
[0008] According to an aspect of the present invention, an image processing apparatus is provided that can use machine learning to give images high resolution with high accuracy.
[0009] According to one aspect of the present invention, an image processing apparatus uses a first image group to make images of a second image group have high resolution, the images of the second image group having fewer high-frequency components than the images of the first image group, the image processing apparatus comprising: a selection unit for selecting teacher data to be used in learning from a plurality of teacher data using images included in the first image group as one of the image pairs, based on selecting a current image as a high-resolution target from the second image group; and a calculation unit for calculating, for each of a plurality of partial regions obtained by segmenting the current image, a partial region corresponding to a previous image as a high-resolution target preceding the current image. The image contains: a similarity determination component for determining multiple local regions from the current image by combining a set of one or more partial regions with similarity equal to or greater than a threshold into a single local region and treating partial regions with similarity less than the threshold as separate local regions; a model generation component for generating a learning model for inferring high-frequency components using teacher data selected by the selection component for each of the multiple local regions; an inference component for inferring high-frequency components using the learning model for each of the multiple local regions; and an image generation component for generating a high-resolution image based on the current image and the high-frequency components inferred by the inference component.
[0010] According to another aspect of the present invention, an image processing method uses a first image group to make images of a second image group have high resolution, the images of the second image group having fewer high-frequency components than the images of the first image group, the image processing method comprising: selecting teacher data to be used in learning from a plurality of teacher data using images included in the first image group as one of the high-resolution targets, based on a current image selected from the second image group as a high-resolution target; calculating similarity between partial regions corresponding to previous images that were high-resolution targets preceding the current image and each partial region obtained by segmenting the current image; determining a plurality of local regions from the current image by combining a set of one or more partial regions with similarity equal to or greater than a threshold into a local region and treating partial regions with similarity less than the threshold as separate local regions; generating a learning model for inferring high-frequency components using the teacher data selected in the selection for each of the plurality of local regions; inferring high-frequency components using the learning model for each of the plurality of local regions; and generating a high-resolution image based on the current image and the high-frequency components inferred in the inference.
[0011] According to another aspect of the invention, a storage medium is provided that stores a program for enabling a computer to function as a component of the aforementioned image processing device.
[0012] Further features of the invention will become apparent from the following description of typical embodiments with reference to the accompanying drawings. Attached Figure Description
[0013] Figure 1 This is a block diagram illustrating the structure of an image processing apparatus according to a first embodiment.
[0014] Figure 2 This is a diagram illustrating the functional structure of the image processing apparatus according to the first embodiment.
[0015] Figure 3 This is a diagram illustrating an example of the frame structure of an input moving image according to the first embodiment.
[0016] Figure 4 This is a diagram illustrating the functional structure of the image processing apparatus according to the first embodiment.
[0017] Figure 5 This is a diagram illustrating an example of the data structure of the candidate database according to the first embodiment.
[0018] Figure 6 This is a flowchart of the teacher data candidate acquisition process according to the first embodiment.
[0019] Figure 7 This is a flowchart of the high-definition moving image generation process according to the first embodiment.
[0020] Figure 8 This is a schematic diagram illustrating the learning / inference process according to the first embodiment.
[0021] Figure 9 This is a diagram illustrating an example of the frame structure of an input moving image according to the second embodiment.
[0022] Figure 10 This is a flowchart of the teacher data candidate acquisition process according to the second embodiment.
[0023] Figure 11 This is a diagram illustrating an example of the frame structure of an input moving image according to the third embodiment.
[0024] Figure 12 This is a flowchart of the teacher data candidate acquisition process according to the third embodiment.
[0025] Figure 13 This is a diagram illustrating an example of the frame structure of a moving image according to the fifth embodiment.
[0026] Figure 14 This is a diagram for explaining the functional structure of the image processing device according to the fifth embodiment.
[0027] Figure 15 This is a flowchart of the high-definition moving image generation process according to the fifth embodiment.
[0028] Figure 16 This is a flowchart of the high-definition moving image generation process according to the sixth, seventh, eighth, and ninth embodiments.
[0029] Figure 17 This is a diagram showing an example of the learning / inference process according to the sixth embodiment.
[0030] Figure 18 This is a flowchart of the high-definition moving image generation process according to the eighth embodiment.
[0031] Figure 19 This is a diagram showing an example of the teacher data area selection according to the ninth embodiment.
[0032] Figure 20 This is a flowchart of the local area extraction according to the tenth embodiment.
[0033] Figure 21 This is a diagram for explaining the concept of the local area extraction according to the tenth embodiment. Detailed Description of Embodiments
[0034] Hereinafter, embodiments will be described in detail with reference to the drawings. Note that the following embodiments are not intended to limit the scope of the claimed invention. In the embodiments, multiple features are described, but the invention is not limited to the invention that requires all of these features, and multiple of these features can be appropriately combined. In addition, in the drawings, the same reference numerals are assigned to the same or similar configurations, and redundant descriptions thereof are omitted.
[0035] First Embodiment
[0036] Overview of the Image Processing Device
[0037] The image processing device according to the first embodiment accepts, as inputs, two moving images, moving image A and moving image B, which are simultaneously captured by the same imaging device. The relationship between the resolution XA and frame rate FA of moving image A and the resolution XB and frame rate FB of moving image B is such that XA > XB and FA < FB. The image processing device has the following function (high-definition moving image generation function): generating a learning model using the frames of moving image A and moving image B, and generating, via inference using the generated learning model, a moving image C having a resolution of XA and a frame rate of FB from moving image B.
[0038] Description of the structure of an image processing device
[0039] Figure 1 This is a block diagram illustrating an example of the hardware structure of an image processing device 100 according to a first embodiment. The control unit 101 is a computing device such as a central processing unit (hereinafter referred to as a CPU). The control unit 101 implements various types of functions by loading programs stored in a read-only memory (hereinafter referred to as a ROM) 102 onto the working area of a random access memory (hereinafter referred to as RAM) 103 and executing these programs. The control unit 101 can, for example, be used for applications including the following... Figure 2 The analysis unit 211 and the decoding motion image generation unit 212, as well as the following uses Figure 4 The candidate acquisition unit 413 and the teacher data extraction unit 414 contain various functional blocks. ROM 102 stores the control program executed by the control unit 101. RAM 103 serves as the working memory for the program executed by the control unit 101, as well as a temporary storage area for various types of data.
[0040] Decoding unit 104 decodes motion pictures or image data compressed in an encoding format set by the Moving Picture Experts Group (hereinafter referred to as MPEG) into uncompressed data. Learning / inference unit 105 includes functional blocks that accept teacher data as input and generate and update the learning model (hereinafter referred to as...). Figure 4 The learning unit 451 is mentioned above. Furthermore, the learning / inference unit 105 includes a functional block (hereinafter referred to as...) that generates a high-resolution image of the input image by analyzing the input image and inferring high-frequency components using a learning model generated through learning. Figure 4 The inference unit 452 is described above. In this embodiment, a CNN model used for super-resolution processing based on a convolutional neural network (hereinafter abbreviated as CNN) is used as the learning model. This is used to enlarge the input image via linear interpolation, generate high-frequency components to be added to the enlarged image, and add and synthesize the two.
[0041] The storage unit 106 comprises a storage medium (such as a hard disk drive (HDD) or memory card) detachably connected to the image processing device 100 and a storage medium control device that controls the storage medium. The storage medium control device controls storage medium initialization and data transfer between the storage medium and RAM 103 for reading and writing data, according to commands from the control unit 101. The bus 107 is the information communication path connecting the various functions. The control unit 101, ROM 102, RAM 103, decoding unit 104, learning / inference unit 105, and storage unit 106 are communicatively connected to each other.
[0042] Note that the hardware blocks and functional blocks implemented therefrom described in this embodiment do not need to have the above configuration. For example, two or more of the blocks in control unit 101, decoding unit 104, and learning / inference unit 105 can be implemented by a single piece of hardware. Furthermore, the function of one or more functional blocks can be executed through cooperation between two or more pieces of hardware. Each functional block can be implemented by a CPU executing a computer program loaded in memory, or it can be implemented by dedicated hardware. Additionally, one or more functional blocks can reside on a cloud server and be configured to transmit processing result data via communication. For example, decoding unit 104 can be implemented by the same CPU as control unit 101, or it can be implemented by a different CPU. Alternatively, decoding unit 104 can be implemented by a graphics processing unit (GPU) that operates by receiving instructions from control unit 101. In another case, decoding unit 104 can be implemented by hardware processing utilizing electronic circuitry configured for combined processing. For example, learning / inference unit 105 can be implemented by the same CPU as control unit 101, or it can be implemented by a different CPU. Alternatively, the learning / inference unit 105 can be implemented by a GPU that operates by receiving instructions from the control unit 101. In another case, the learning / inference unit 105 can be implemented by hardware processing utilizing electronic circuitry configured for learning and inference.
[0043] Data stored in the storage medium and its decoding and loading methods
[0044] Figure 2This diagram illustrates the functional blocks used to perform the process of loading compressed motion picture data via the control unit 101 (analysis unit 211 and decoding motion picture generation unit 212). Storage unit 106 stores motion picture a and motion picture b, which are input data used for high-definition motion picture generation processing. The term "motion picture" as used herein refers to one or more image data that are sequentially consecutive in time. In this embodiment, motion picture a and motion picture b are simultaneously captured by a camera device with an image sensor and compressed using the MPEG method. Motion picture a and motion picture b can be generated by additionally performing interval culling or downsampling processing on an image captured by a single image sensor, or by capturing the same subject using image sensors with different resolutions and frame rates. Here, motion picture a and motion picture b are two sets of images obtained by performing different image processing on a single image captured by a single image sensor of a single camera device. The motion picture data of motion picture a and motion picture b is compressed using the MPEG method, multiplexed along with the recording time information, and stored in MP4 format. Note that any format other than the one described above can be used, as long as the image data and corresponding camera time information from storage unit 106 can be obtained in pairs.
[0045] The analysis unit 211 has the following functions: parsing the motion image data (MP4 file in this example) stored in the storage unit 106, and calculating the storage location of the compressed image data to be encapsulated and the time information registered as metadata in the file. Using the MP4 format, the location information indicating the storage location of frame data and recording time information in the file is stored in the Moov portion. The analysis unit 211 loads the Moov portion of motion image a from the storage unit 106 into RAM 103, parses the Moov portion, and generates table Pa, which includes the frame number of motion image a, the location information indicating the storage location of frame data, and the location information indicating the storage location of recording time. Similarly, the analysis unit 211 parses the Moov portion of motion image b in a similar manner and generates table Pb, which includes the frame number of motion image b, the location information indicating the storage location of frame data, and the location information indicating the storage location of recording time. Tables Pa and Pb are stored in RAM 103.
[0046] Processes must be performed to convert motion images a and b into an uncompressed format so that they can be used in high-resolution motion image generation. For example... Figure 2As shown, the decoding motion image generation unit 212 of the control unit 101 decodes motion images a and b to generate motion images A and B, and stores them in the storage unit 106. More specifically, the decoding motion image generation unit 212 refers to tables Pa and Pb stored in RAM 103, and sequentially inputs the frame data of motion images a and b stored in the storage unit 106 to the decoding unit 104. The decoding motion image generation unit 212 multiplexes the uncompressed frame data output by the decoding unit 104 with the camera timing information obtained by referring to tables Pa and Pb, and stores it in the storage unit 106. Here, motion image A is obtained by decoding motion image a, and motion image B is obtained by decoding motion image b. In addition, the decoding motion image generation unit 212 generates a table PA including the frame number of motion image A, location information indicating the storage location of the frame data, and location information indicating the storage location of the camera timing, and stores it in RAM 103. Similarly, the motion picture generation unit 212 generates a table PB including the frame number of the motion picture B, location information indicating the storage location of the frame data, and location information indicating the storage location of the recording time, and stores it in RAM 103. Figure 3 The image shows an example of the frame structure for moving image A and moving image B. Figure 3 In this context, n is the total number of frames in motion image A, and m is the total number of frames in motion image B. Frame pairs represented by dashed lines (image pairs A1 and B2, A2 and B5, and A3 and B8, etc.) are frame pairs that include the same capture time information, indicating that these frames were captured at the same timing. Furthermore, as mentioned above, the relationship between the resolution XA of motion image A and the resolution XB of motion image B is XA > XB, and the relationship between the frame rate FA of motion image A and the frame rate FB of motion image B is FA <FB。
[0047] Next, the processing for generating high-resolution images according to this embodiment will be described. This processing is broadly divided into two parts: teacher data candidate acquisition processing and high-resolution moving image generation processing.
[0048] Figure 4 This is a diagram illustrating the structure and operation of functional blocks related to image processing performed by the image processing apparatus 100 of the first embodiment. For example... Figure 2The motion images A and B are stored in storage unit 106, and tables PA and PB are stored in RAM 103. Teacher data candidate acquisition processing is performed by candidate acquisition unit 413. Furthermore, high-resolution motion image generation processing is performed by teacher data extraction unit 414, learning unit 451, and inference unit 452. Candidate acquisition unit 413 extracts frame pairs corresponding to the teacher data candidates used for learning from frame groups of motion image A and frame groups of motion image B, and generates a teacher data candidate database (hereinafter referred to as candidate database D1). Frame By is obtained from frame groups of image B as a high-resolution target and a high-definition target. To generate a learning model suitable for inferring the high-frequency components of frame By, teacher data extraction unit 414 also extracts teacher data suitable for learning from the teacher data candidates registered in candidate database D1. Teacher data extraction unit 414 uses the extracted teacher data to generate a teacher data database (hereinafter referred to as teacher database D2). Learning unit 451 of learning / inference unit 105 uses teacher database D2 and generates a learning model M for frame By. The inference unit 452 inputs the frame By, which serves as the high-resolution target, into the learning model M generated by the learning unit 451, and performs high-resolution processing on the frame By. The teacher data candidate acquisition process and the high-resolution motion image generation process will be explained in more detail below.
[0049] Teacher data candidate acquisition and processing
[0050] In the teacher data candidate acquisition process, a candidate database D1 is generated via control unit 101 (candidate acquisition unit 413). In the first embodiment, candidate acquisition unit 413 acquires pairs of frames from motion image A and motion image B that share the same shooting time as teacher data candidates. Specifically, it acquires all pairs that share a common shooting time between motion image A and motion image B. Figure 3 The frames (represented by dashed lines) are selected as teacher data candidates. Before performing the learning process described below, the candidate acquisition unit 413 checks which frames can be used as teacher data, constructs a candidate database D1, and registers the check results.
[0051] Figure 5 This is a diagram illustrating an example of the data structure of candidate database D1. In candidate database D1, frame groups TA from moving image A that can be used as teacher data and frame groups TB from moving image B that can be used as teacher data are registered in the moving image file according to their frame numbers. Here, pairs of frames with consistent shooting time (pairs of frame numbers) are associated together and registered using an index I specific to candidate database D1. For example, for... Figure 3The moving images A and B shown are combined by taking frames A1 and B2, A2 and B5, and A3 and B8 (hereinafter omitted) as frames captured at the same time. Figure 5 In the candidate database D1 shown, these pairs are displayed as stored by frame number and have a unique index I. In this way, the candidate database D1 is used to manage the acquired teacher data candidates.
[0052] Now will be used Figure 6 The flowchart further details the above-described teacher data candidate acquisition process. In step S601, the candidate acquisition unit 413 selects a frame from the frames of the motion image A and obtains the time information corresponding to the selected frame from the table PA. In this embodiment, frames are selected sequentially starting from the top of the motion image A stored in the storage unit 106. Specifically, the candidate acquisition unit 413 selects a frame sequentially starting from the top of the motion image A stored in the storage unit 106. Hereinafter, the selected frame is referred to as frame Ax. The candidate acquisition unit 413 refers to the table PA stored in the RAM 103, reads the time information corresponding to frame Ax from the storage unit 106, and transmits the time information to the RAM 103.
[0053] In step S602, the candidate acquisition unit 413 compares the time information of frame Ax read in step S601 with the time information of each frame of the motion image B. Specifically, the candidate acquisition unit 413 refers to the position information of the shooting time stored in table PB, sequentially obtains the shooting time information of each frame of the motion image B from the storage unit 106, and compares this shooting time information with the time information of frame Ax. In step S603, the candidate acquisition unit 413 obtains a frame of the motion image B with a shooting time consistent with the time information of frame Ax, and sets this frame as frame Bx.
[0054] In step S604, the candidate acquisition unit 413 assigns a unique index Ix to the combination of frames Ax and Bx in the candidate database D1, and registers both in the candidate database D1. Specifically, the candidate acquisition unit 413 issues a unique index Ix to the combination of frames Ax and Bx, and registers the index Ix, the frame number in motion image A of frame Ax, and the frame number in motion image B of frame Bx in the candidate database D1.
[0055] In step S605, the control unit 101 determines whether the processing described in steps S601 to S604 has been completed for all frames of the motion image A. If the control unit 101 determines that the processing has been completed ("Yes" in step S605), the processing ends. If the control unit 101 determines that the processing has not been completed ("No" in step S605), the processing returns to step S601, and the above processing is performed on the next frame of the motion image A. This processing generates the candidate database D1.
[0056] Note that in this embodiment, in step S602, the pair of frames to be registered in the candidate database D1 is determined by comparing the shooting time. However, such a limitation is not intended. For example, frame Ax is reduced to resolution XB, and a similarity judgment is performed using an index indicating the similarity between frame Ax and the images of each frame of the moving image B. The judgment result can then be used to select a pair of frames to be registered in the candidate database D1. In this case, the candidate acquisition unit 413 has a similarity judgment function for determining similarity by comparing two or more image data. Note that structural similarity (SSIM) can be used as an index indicating the similarity between images, for example. Furthermore, when obtaining the index indicating similarity, the image of frame Ax is reduced to resolution XB. However, such a limitation is not intended. The image of frame Ax may not be reduced, or the reduced resolution may be a resolution other than XB.
[0057] High-resolution motion picture generation and processing
[0058] Next, the high-definition moving image generation process performed by the control unit 101 (teacher data extraction unit 414) and the learning / inference unit 105 (learning unit 451 and inference unit 452) will be described. First, refer to Figure 4 This section provides an overview of the high-definition moving image generation process. The teacher data extraction unit 414 selects teacher data suitable for the learning model used to infer the target frame By from the candidate database D1, and generates the teacher database D2. Figure 4 (The following will refer to) Figure 7Steps S702 to S703 are explained in detail. Learning unit 451 generates a learning model using the extracted teacher data (step S704). Furthermore, inference unit 452 uses the learning model to infer the high-frequency components of the inferred target frame By and performs high-definition processing (step S705), obtaining frame (image) Cy by converting the inferred target frame By into high-definition. Note that before starting the high-definition moving image generation process, control unit 101 generates a moving image C on storage unit 106. At the start of high-definition moving image generation, moving image C is in an empty state without any frame data. Inference unit 452 sequentially stores the generated frame Cy in moving image C.
[0059] Next, refer to Figure 7 The flowchart in the document details the processing used to generate the high-definition moving images described above. In step S701, the teacher data extraction unit 414 reads a frame from the moving image B as the high-definition target frame. In this embodiment, the teacher data extraction unit 414 reads frames sequentially, one frame at a time, starting from the top of the moving image B stored in the storage unit 106. Hereinafter, the frame read in step S701 is defined as frame By. More specifically, the teacher data extraction unit 414 refers to table PB and reads the frame data and shooting time information of frame By from the storage unit 106, and transmits it to RAM 103.
[0060] In step S702, the teacher data extraction unit 414 extracts frames whose camera time difference with frame By is less than a preset threshold from the teacher data candidate TB registered in the candidate database D1, and registers these frames in the teacher database D2. For example, the display time period of one frame of motion image A (via frame rate XA) can be used as the threshold. The structure of the teacher database D2 is similar to the structure of the candidate database D1. Figure 5Similar to ) Specifically, firstly, the teacher data extraction unit 414 refers to the position information of table PB and obtains the time information of each frame group TB registered in the candidate database D1. Then, the teacher data extraction unit 414 compares the obtained time information with the time information of frame By, extracts frames from frame group TB whose difference between the two is less than a threshold, and registers these frames in the teacher database D2 on RAM 103. In the following, the frame group of motion image B registered in the teacher database D2 through this process is represented by UB. Note that in this embodiment, when constructing the teacher database D2, frame groups with a shooting time difference of less than a threshold with frame By are extracted from the candidate database D1. However, such a limitation is not intended. Frame group UB can be extracted using an index indicating the similarity with frame By. For example, the teacher data extraction unit 414 can use SSIM to extract frame groups from frame group TB whose similarity index with frame By is higher than a threshold preset in the system, and register these frame groups as frame group UB.
[0061] In step S703, the teacher data extraction unit 414 registers the frames of frame group TA corresponding to each pair of frames of frame group UB in the candidate database D1 into the teacher database D2. Specifically, the teacher data extraction unit 414 refers to the candidate database D1 on RAM 103 and registers the frames of frame group TA associated with each frame of frame group UB via index I into the teacher database D2. At this time, the combination of the two associated frames remains unchanged, and the index J specific to the teacher database D2 is assigned to each combination. In the following text, the frame group of motion image A registered in the teacher database D2 is represented by UA.
[0062] In step S704, the learning unit 451 uses the teacher data (frame group UA and frame group UB) registered in the teacher database D2 to learn and generate a learning model M.
[0063] Figure 8This diagram schematically illustrates the learning model generation function of learning unit 451. The learning model generation function includes a learning process and an inference process, and the inference process is divided into a feature extraction process using filters including a CNN and a reconfiguration process. First, in the feature extraction process, learning unit 451 inputs a single image (defined as image E) from frame group UB into the CNN, extracts convolutional features via the CNN, and generates multiple feature maps. Next, in the reconfiguration process, learning unit 451 upsamples the image via transposed convolutions of all feature maps and generates predicted high-frequency components. Furthermore, in the reconfiguration process, learning unit 451 reconfigures the image by adding the predicted high-frequency components to image E' obtained by upscaling image E via a bicubic method, and generates an estimated high-resolution image G. During the learning process, learning unit 451 compares the estimated high-resolution image G generated in the above inference process with the image H corresponding to image E from frame group UA, and uses the difference between the two to fine-tune the learning model M using backpropagation. Learning unit 451 improves inference accuracy by repeating this process on the same image E a predetermined number of times. By performing the above series of processes on each image of frame group UB, a learning model M suitable for inference processing of frame group UB is constructed.
[0064] As described above, the learning unit 451 references the teacher database D2, table PA, and table PB, reads frame data of the frame pairs registered as teacher data from the storage unit 106, and executes the learning model generation function described above. The learning unit 451 stores the learning model M generated by the learning model generation function in RAM 103.
[0065] In step S705, the inference unit 452 uses the learning model M generated in step S704 to generate a high-resolution frame Cy from frame By via inference. Specifically, first, the inference unit 452 reads the learning model M stored in RAM 103. Next, the inference unit 452 inputs the frame data (image) of frame By held in RAM 103 in step S701 into the CNN of the learning model M, and generates the expected high-frequency components when the image of frame By is enlarged to resolution XA. The inference unit 452 adds the generated high-frequency components to the image obtained by linearly enlarging the image of frame By to resolution XA to generate an image of high-resolution frame Cy at resolution XA, and stores this image in RAM 103. Note that the process of inferring high-resolution image generation from high-frequency components performed on frame By is the same as the above-described process. Figure 8The inference process described above is similar to the previous process. The inference unit 452 adds the frame data of the high-definition frame Cy stored in RAM 103 to the end of the high-definition motion image C on the storage unit 106. In addition, the capture time information of frame By is copied and reused as the capture time of the high-definition frame Cy, and stored in the motion image C.
[0066] In step S706, the control unit 101 determines whether the above processing has been completed for frames within the inferred target range of motion image B (this could be all frames or a portion of the frames of motion image B). If the control unit 101 determines that the processing is not complete ("No" in step S706), the processing proceeds to step S701, where the teacher data extraction unit 414 selects the next frame of motion image B as frame By, and the above processing is repeated. If the control unit 101 determines that the processing is complete ("Yes" in step S706), the current processing ends. As described above, at the end of the high-definition motion image generation process, the high-definition motion image C with resolution XA and frame rate FB is stored in the storage unit 106 in an uncompressed format.
[0067] Note that in the above embodiments, each of the functional blocks is implemented solely by the control unit 101 or solely by the learning / inference unit 105. However, such limitation is not intended. For example, the functional blocks may be implemented via cooperation between the control unit 101 and the learning / inference unit 105. For example, the function of the inference unit 452 may be implemented by both the control unit 101 and the learning / inference unit 105, and the processing for storing the high-definition frame Cy and the recording time in the moving image C on the storage unit 106 may be performed by the control unit 101.
[0068] Furthermore, in this embodiment, the teacher data candidate acquisition process is performed before the learning process and high-resolution motion image generation process for all motion images, but the teacher data candidate acquisition process can be performed in parallel with the high-resolution motion image generation process. Additionally, in this embodiment, in step S704, a new learning model M is generated for each inferred target frame, and the previously generated learning model M is discarded. However, such a limitation is not intended. For example, an externally trained learning model M' can be pre-loaded, and additional learning using frame group UA and frame group UB can be performed on the loaded learning model M' in step S704.
[0069] As described above, according to the first embodiment, a learning model M is used, trained using an image group similar to the high-resolution target image from an image group captured within the same shooting time period. This enables the image to have high resolution with high accuracy.
[0070] Furthermore, image pairs from the same time period from both image groups are used as teacher data. This enables learning with even higher accuracy.
[0071] Second Embodiment
[0072] In the process for obtaining teacher data candidates in the first embodiment, a combination of frames of motion image A and frames of motion image B with the same shooting time is registered in the candidate database D1. In the case of obtaining motion image A and motion image B from motion images simultaneously captured by the same image sensor using a single camera device, as... Figure 3 As shown, frames with the same capture time can be obtained from motion image A and motion image B. However, using this method, when motion image A and motion image B are motion images captured by multiple image sensors within the same capture time period, it may not be possible to properly extract teacher data candidates. This is because, as... Figure 9 As shown, for a frame in motion image A, there are not always frames in motion image B with the same capture time. Note that examples of structures for capturing motion images A and B via multiple image sensors include structures using a camera device that includes multiple image sensors, and structures using multiple camera devices, each having one or more image sensors. In the process for obtaining teacher data candidates in the second embodiment, the above problem is solved by registering combinations of frames with a time difference less than a predetermined threshold in the candidate database D1, even if the capture times of frames in motion image A and frames in motion image B are inconsistent.
[0073] In the second embodiment, the structure of the image processing device 100 and the high-definition image generation process are similar to those in the first embodiment, but a portion of the processing for obtaining teacher data candidates is different. Figure 10 This is a flowchart illustrating the process for obtaining teacher data candidates according to the second embodiment. In the following description, the process for obtaining teacher data candidates will be primarily explained in relation to the first embodiment. Figure 6 Different parts.
[0074] The processing of steps S1001 to S1002 is the same as in the first embodiment ( Figure 6 Steps S601 to S602 are similar to those in the first embodiment. In step S1003, the candidate acquisition unit 413 obtains frames from the frames of motion image B whose shooting time difference with a frame Ax of motion image A is less than a predetermined threshold as frame Bx, and registers this frame in the candidate database D1 on RAM103. Note that, as a threshold, for example, the display time period of each frame of motion image B at a frame rate XB can be used. The subsequent processing of steps S1004 to S1005 is the same as in the first embodiment. Figure 6Steps S604 to S605 are similar.
[0075] In this way, according to the second embodiment, even when motion image A and motion image B are obtained by multiple image sensors, teacher data candidates can be appropriately extracted.
[0076] Third Embodiment
[0077] In the first and second embodiments, motion image A and motion image B were captured within at least the same shooting time period. Therefore, in the teacher data candidate acquisition process of the first and second embodiments, as... Figure 11 As shown, when motion image A and motion image B are captured by the same or multiple camera devices at different times (non-overlapping camera time periods), teacher data candidates cannot be obtained. In the third embodiment, methods for targeting such... Figure 11 The motion images A and B shown are appropriately processed to obtain teacher data candidates.
[0078] In the process for obtaining teacher data candidates according to the third embodiment, an index indicating the similarity between frames of motion image A and frames of motion image B is calculated, and frame pairs with an index equal to or greater than a pre-set threshold in the system are registered in the candidate database D1. Note that, as an index indicating frame similarity, SSIM can be used, for example, as described above. Furthermore, when determining similarity, the image of the frame of motion image A can be downsized to a resolution of XB, and the index indicating similarity can be calculated using this image and the images of each frame of motion image B. However, at this time, the image of the frame of motion image A may not be downsized, or the downsized resolution may be a resolution other than XB.
[0079] Figure 12 This is a flowchart illustrating the process for obtaining teacher data candidates according to the third embodiment. In the following text, reference will be made primarily to... Figure 12 The description is consistent with the processing for obtaining teacher data candidates in the first embodiment. Figure 6 Different parts.
[0080] In step S1201, the candidate acquisition unit 413 selects a frame from the frames of the motion image A and loads the frame data of the selected frame. The candidate acquisition unit 413 sequentially selects a frame starting from the top of the motion image A stored in the storage unit 106 (hereinafter, the selected frame is referred to as frame Ax). The candidate acquisition unit 413 refers to the table PA stored in the RAM 103 and transfers the frame data of the selected frame Ax from the storage unit 106 to the RAM 103.
[0081] In step S1202, the candidate acquisition unit 413 calculates the similarity between frame Ax read in step S1201 and each frame of the motion image B. More specifically, the candidate acquisition unit 413 refers to the position information (related to frame data) of table PB and sequentially obtains the frame data of each frame of the motion image B from storage unit 106 into RAM 103. Then, the candidate acquisition unit 413 uses the similarity index calculation function (SSIM in this embodiment) to calculate the similarity index between frame Ax and each frame, and stores it in RAM 103. In step S1203, the candidate acquisition unit 413 obtains the frame of motion image B with the highest value from the similarity index calculated in step S1202 as frame Bx. The subsequent processing of steps S1204 to S1205 is the same as in the first embodiment ( Figure 6 Steps S604 to S605 are similar.
[0082] As described above, according to the third embodiment, teacher data candidates can be appropriately obtained even when the shooting time periods of the two image groups (moving image A and moving image B) do not overlap.
[0083] Fourth embodiment
[0084] In the fourth embodiment, the performance improvement of the learning model M, which takes into account image similarity, will be explained for the learning processing of the first to third embodiments. As described in the first embodiment, for Figure 7 In step S701, the selected frame is used to extract appropriate teacher data, and in step S704, the teacher data is used to generate or update the learning model M. When generating or updating the learning model M, as follows... Figure 8 As shown, backpropagation is used to adjust network parameters. In the fourth embodiment, the intensity of the adjustment via backpropagation is controlled based on the frame (image E) used in learning and the attributes (e.g., capture time) of the frame By, which is a high-resolution or high-definition target, or images of these frames. More specifically, the learning unit 451 sets coefficients such that during the learning process, the influence on network parameter updates is strong when the similarity between frame By and each frame of the sequentially input frame group UB is high, and weak when the similarity is low. Here, the similarity between frames can be determined simply based on the time difference between frame By and the input image E, or it can be determined by comparing the images of two frames using SSIM or the like. In an example configuration using the former (the method using the time difference), as described below, the intensity of the adjustment is multiplied by a coefficient of 1 when the time difference is less than a threshold, and multiplied by a coefficient of 0.5 when the time difference is equal to or greater than the threshold.
[0085] if (ABS(time difference between By and E) < threshold) { coefficient = 1} else { coefficient = 0.5}
[0086] In the example configuration where the latter (using the similarity method) is used, as described below, SSIM is used as the coefficient for adjusting the intensity.
[0087] Coefficient = SSIM(By and E) [0 ≤ SSIM(x) ≤ 1]
[0088] Note that examples of how to apply strong or weak effects include multiplying the update rate of network parameters using backpropagation during the learning process by the coefficients mentioned above, and multiplying the number of learning loops performed on the input image E by a coefficient instead of multiplying the parameter update rate by a coefficient.
[0089] Fifth Embodiment
[0090] The first to third embodiments described above have the following configuration: pairs including frames from motion image A and frames from motion image B are extracted as teacher data candidates and registered in the candidate database D1. In the fifth embodiment, motion image A is converted to the resolution XB of motion image B to generate motion image A', and the candidate acquisition unit 413 uses motion image A and motion image A' to obtain teacher data candidates. In other words, the candidate acquisition unit 413 of the fifth embodiment extracts frame Ax' with the same frame number as frame Ax of motion image A from motion image A', and registers pairs including frames Ax and Ax' as teacher data candidates in the candidate database D1. The fifth embodiment will be described in detail below.
[0091] Description of the structure of the image processing device 100
[0092] The hardware and functional structures of the image processing device 100 are the same as those of the first embodiment. Figure 1 Similar to the above. However, the control unit 101 of the fifth embodiment also has a resolution conversion function for reducing and converting the resolution of an image via a bicubic method. This resolution conversion function calculates the pixel value of the pixel to be interpolated by referring to surrounding pixels when performing resolution reduction processing on the image data stored in RAM 103.
[0093] The data stored in storage unit 106 and its decoding and loading methods
[0094] In the first embodiment, motion images a and b in storage unit 106 are converted into an uncompressed format, and motion image A obtained by decoding motion image a and motion image B obtained by decoding motion image b are stored in storage unit 106. Furthermore, in the fifth embodiment, motion image A' is generated by converting motion image A to a resolution XB of motion image B. More specifically, control unit 101 refers to table PA stored in RAM 103 and sequentially inputs the frame data of the frames of motion image A (hereinafter referred to as frame K) stored in storage unit 106 into the resolution conversion function of control unit 101. Then, using the resolution conversion function, a frame (hereinafter referred to as frame K') of frame data at resolution XB is output. Control unit 101 refers to table PA and multiplexes it with the recording time information of frame K read from storage unit 106, and stores it as a frame of motion image A' in storage unit 106. In addition, table PA', which holds the frame number of each frame of motion image A', location information indicating the storage location of frame data, and location information indicating the storage location of camera time data, is stored in RAM 103.
[0095] exist Figure 13 Examples of moving images A, B, and A' are shown. Images (A1' to An') generated by downscaling the resolution of each frame of moving image A (A1 to An) to a resolution of XB are stored as moving image A' in storage unit 106. Note that in the above example, the resolution of moving image A is reduced to XB, but this limitation is not intended. Moving image A' consists of images converted to a resolution lower than that of moving image A, which is sufficient. However, by using images converted to the same resolution as the high-resolution target image, a learning model more suitable for high-resolution target images can be constructed.
[0096] Teacher data candidate acquisition and processing
[0097] Figure 14 This diagram illustrates the structure and operation of functional blocks related to image processing performed by the image processing apparatus 100 of the fifth embodiment. The candidate acquisition unit 413 acquires combinations of frames with the same frame number for each frame of moving image A and moving image A', and registers them in the candidate database D1. More specifically, for each frame of moving image A listed in table PA, the candidate acquisition unit 413 searches for frames with the same frame number in moving image A' by referring to table PA'. The candidate acquisition unit 413 assigns a unique index I to the combination of frames of moving image A and moving image A' with the same frame number and registers it in the candidate database D1. The frame group of moving image A registered in the candidate database D1 is represented by TA, and the frame group of moving image A' is represented by TA'.
[0098] High-resolution motion picture generation and processing
[0099] The following text will mainly refer to Figure 15 The flowchart is used to illustrate the processing in the first embodiment. Figure 7 Different parts.
[0100] The processing of step S1501 is the same as in the first embodiment ( Figure 7 The steps are similar to S701. In step S1502, the teacher data extraction unit 414 extracts frames from the frame group TA' of the teacher data candidates registered in the candidate database D1 whose shooting time difference with frame By is less than a threshold preset in the system. As a threshold, for example, the display time period of one frame of motion image A (the display time period of one frame via frame frequency XA) can be used. The teacher data extraction unit 414 registers the extracted frames in the teacher database D2.
[0101] Specifically, first, the teacher data extraction unit 414 refers to table PA' and obtains the time information of the frames registered in frame group TA'. Then, the teacher data extraction unit 414 registers the frames in the obtained time information of frame group TA' whose time difference with frame By is less than a threshold in the teacher database D2 on RAM 103. Hereinafter, the frame group of motion images A' registered in teacher database D2 is referred to as frame group UA'. Note that in this embodiment, frames whose capture time difference with frame By is less than a predetermined threshold are extracted from candidate database D1. However, such a limitation is not intended. For example, frames with an index (e.g., SSIM) that indicates the similarity between the image of frame By and the images of each frame in frame group TA' is higher than a threshold preset in the system can be extracted from frame group TA' and registered in teacher database D2.
[0102] In step S1503, the teacher data extraction unit 414 registers the frames of frame group TA associated with each frame of frame group UA' via index I in the teacher database D2. Specifically, the teacher data extraction unit 414 refers to the candidate database D1 on RAM 103 and registers the frames of frame group TA associated with each frame of frame group UA' via index I in the teacher database D2. At this time, the associated combinations (frame pairs) remain unchanged, and the index J specific to the teacher database D2 is assigned to each combination. Hereinafter, the frame group of motion image A registered in the teacher database D2 is referred to as frame group UA.
[0103] In step S1504, learning unit 451 references teacher database D2 and learns using frame group UA and frame group UA', generating a learning model M. Specifically, first, learning unit 451 references teacher database D2 and tables PA and PA', reads frame data from storage unit 106, and inputs the frame data into the learning model generation function. Learning unit 451 uses the frame data read through the learning model generation function to learn, and stores the learning model M generated as the learning result in RAM 103. Details of the learning model are as described above. Figure 8 The subsequent processing of steps S1505 to S1506 is the same as that of the first embodiment. Figure 7 The process is similar to steps S705 to S706 in the previous steps.
[0104] As described above, according to the embodiments described, teacher data used in the learning model is selected based on high-resolution target images. Therefore, the learning model trained using the selected teacher data can infer the high-frequency components of the high-resolution target image with higher accuracy, thereby enabling the acquisition of highly accurate high-resolution images. In other words, the accuracy of motion picture super-resolution imaging for achieving high-resolution motion images can be improved.
[0105] Note that in the above embodiments, when obtaining teacher data candidates, the image that forms a pair with the image selected from motion image A is an image selected from motion image B based on the shooting time or similarity to the image, or an image obtained by reducing the resolution of the selected image. However, this embodiment is not limited to this. It is sufficient for the image associated with the image selected from motion image A to be used as a teacher data candidate to have a resolution lower than that of the selected image. For example, whether an image is associated with the image selected from motion image A can be determined based on common characteristics such as the temperature at the time of shooting, the shooting location, or the shooting direction.
[0106] Furthermore, in the above embodiments, the process has two stages: generating a candidate database D1 and then generating a teacher database D2. However, such a limitation is not intended. For example, the teacher data extraction unit 414 can extract frames that may be pairs of teacher data from the motion image A based on frame By, and can use the extracted frames and frames associated with the extracted frames as pairs to obtain teacher data. However, when multiple images of the motion image B are sequentially made to have high resolution, as in the above embodiments, it is more efficient to generate a candidate database D1 and then extract and use appropriate teacher data from the candidate database D1 based on the high-resolution target image.
[0107] Furthermore, in the above embodiments, the processing targets are motion image a and motion image b with a lower resolution than motion image a. However, such limitation is not intended. For example, uncompressed motion image a and motion image b obtained by compression followed by restoration can be processing targets. In this case, motion image a can be intermittently culled and stored in terms of frames. In this way, the relationship between motion image a and motion image b, which are the processing targets in the above embodiments, is not limited to a resolution size relationship, and it is sufficient that motion image a has better sharpness than motion image b. In other words, it is sufficient that the image group forming motion image a (motion image A) includes higher frequency components than the image group forming motion image b (motion image B). For example, the processing of the above embodiments can be applied as long as each image in the image group of motion image a corresponds to one or more images in the image group of motion image b, and the images in the image group of motion image a have higher frequency components than the images corresponding to the image group of motion image b.
[0108] Furthermore, the above has simply described motion picture data. However, for example, in the case of a device that can generate still images at predetermined intervals during the recording of motion pictures, the above embodiments can be applied in the following situations. In other words, still images can be used as data corresponding to motion picture a, and motion pictures can be used as data corresponding to motion picture b. For example, it is assumed that one of the above embodiments is applied to a camera device for capturing 6K raw data size images at 60fps using an image sensor. Furthermore, it is assumed that the still image is, for example, data stored in a format such as JPEG or HEIF after image processing and still image compression without being changed to 6K size. Furthermore, it is assumed that the motion picture is data stored in a format such as MP4 after image processing and motion image compression of the raw data obtained by converting 6K data obtained by the image sensor to 2K data size (motion picture data of 2K size at 60fps). Under these assumptions, by having the user press a release switch and continuously capture still images during the recording of 2K motion picture data at 60fps using the camera device, for example, 6K still images at intervals of 10fps can be generated for the frame rate of the motion picture (60fps). By applying one of the above embodiments to still images and moving images generated in this manner, for example, data with still image quality corresponding to moving images captured over a time period of multiple still images can be generated. In other words, a system can be implemented that obtains a 6K-sized moving image that appears to be captured at a frame rate of 60fps, but has the size of a still image. Furthermore, in this case, still images and moving images are prepared using a camera device, and learning and inference processing is performed within the camera device to generate data of still image quality corresponding to the moving images.
[0109] Sixth Embodiment
[0110] In the sixth embodiment, improvements in learning and inference performance will be described that take into account image similarity related to the learning and inference processes of the first embodiment.
[0111] In the first embodiment, for the purpose of in Figure 7In step S701, the selected frame By extracts appropriate teacher data, and in step S704, this teacher data is used to generate or update the learning model M. Furthermore, in step S705, the learning model M is used to infer high-frequency components, and a high-resolution frame Cy is generated. However, using this method, when frame By includes various textures (such as textures of people, buildings, vegetation, or oceans), the amount of information learned in a single step is large, meaning that the desired learning performance may not be achieved. This is because a single frame includes high-frequency components of various patterns. Therefore, the learning process of the sixth embodiment solves this problem by extracting a region from a frame, generating a learning model for each local region, performing inference using the learning model for each local region, and generating and combining high-resolution images for each local region.
[0112] In the sixth embodiment, the hardware structure and functional structure of the image processing device 100 are the same as those in the first embodiment. Figure 1 The hardware and functional structures are similar. The extracted teacher data can be the same as in any of the embodiments from the first to the fifth. The post-learning processing is different, and this will use... Figure 16 Flowcharts and Figure 17 The example of learning inference processing will be used to illustrate this in detail.
[0113] The processing of steps S1601 to S1603 is the same as in the first embodiment ( Figure 7 The steps S701 to S703 are similar.
[0114] In step S1604, the inference unit 452 extracts a local region (local region determination) from the inference target frame By and stores the local region in RAM 103. Hereinafter, the extracted local region (local image) is referred to as local region Byn 1701.
[0115] Next, in step S1605, the learning unit 451 selects local regions UAn1702 and UBN1703 (local region selection) from the teacher data (frame groups UA and UB) registered in the teacher database D2, which correspond to the same coordinate positions as the local region Byn of the inferred target frame By. The learning unit 451 stores the selected local regions UAn1702 and UBN1703 in RAM 103. In this embodiment, the teacher data is a pair of local regions, but the teacher data can be multiple pairs of local regions. Note that the local region group is a rectangular region of uniform size with tens of pixels × tens of pixels. However, such a limitation is not intended.
[0116] Note that the statement "local region corresponding to the same coordinate position as local region Byn 1701, which is also a local region being inferred" refers to the region indicated by coordinates that are exactly the same as those of the local region of the inferred target frame By in the case of frame group UB. In other words, if the local region coordinates of the inferred target frame By are (sx, sy), then the local region coordinates of local region UBn 1703 are also (sx, sy). Furthermore, in frame group UA, the ratio between the resolution XA of moving image A and the resolution XB of moving image B is taken into consideration. For example, if XA:XB corresponds to a 2:1 relationship in terms of width and height, then if the local region coordinates of the inferred target frame By are (sx, sy), then the local region coordinates of local region UAN 1702 are (sx*2, sy*2). Hereinafter, this will be referred to as "local region corresponding to the same coordinate position".
[0117] In step S1606, learning unit 451 uses local region UAn 1702 and local region UBn 1703, and uses Figure 8 The learning model generation function shown generates a learning model Mn 1704 (local region learning model). The learning unit 451 reads the frame data of the frame pairs registered as teacher data from the storage unit 106, inputs the frame data for each local region into the learning model generation function, and stores the generated learning model Mn 1704 in RAM 103.
[0118] In step S1607, the inference unit 452 uses the learning model Mn 1704 generated in step S1606 to infer the local region Byn 1701 and generate the local region Cyn 1705 (local high-frequency component) of the high-resolution frame. First, the inference unit 452 reads the learning model Mn 1704 stored in RAM 103 in step S1606. Next, the inference unit 452 inputs the local region Byn 1701 held in RAM 103 in step S1604 into the CNN of the learning model Mn 1704 and generates the expected high-frequency component when the local region Byn 1701 is enlarged to the local region UAn 1702. The inference unit 452 generates the local region Cyn 1705 by adding the generated high-frequency component to the image obtained by linearly enlarging the image of the local region Byn 1701 to the local region UAn 1702, and stores it in RAM 103. Note that the processing performed on local regions by Byn 1701, from inferring high-frequency components to generating high-resolution images, is related to... Figure 8 The inference process shown is similar to the processing described.
[0119] Next, in step S1608, the inference unit 452 combines the local region Cyn 1705 stored in RAM 103 based on the frame coordinate position information to generate a high-definition frame Cy 1706, and stores it in RAM 103. Note that... Figure 17 In the diagram, 1705, indicated by a dashed line, represents the local region Cy, while 1706, indicated by a solid line, represents the high-resolution frame Cy.
[0120] In step S1609, the control unit 101 determines whether the above processing has been completed for all local regions of frame By. If the control unit 101 determines that the processing is not complete ("No" in step S1609), the processing proceeds to step S1605, and the above processing is repeated for the next local region of frame By. If the control unit 101 determines that the processing is complete ("Yes" in step S1609), the processing proceeds to step S1610.
[0121] In step S1610, the inference unit 452 adds the frame data of the high-definition frame Cy1706 stored in RAM 103 to the end of the high-definition moving image C on the storage unit 106. Furthermore, the capture time information of frame By is copied and reused as the capture time of the high-definition frame Cy1706, and stored in the moving image C.
[0122] In step S1611, the control unit 101 determines whether the above processing has been completed for all frames of the motion image B. If the control unit 101 determines that the processing is not complete ("No" in step S1611), the processing proceeds to step S1601, and the above processing is repeated with the next frame of the motion image B as frame By. If the control unit 101 determines that the processing is complete ("Yes" in step S1611), the current processing ends. As described above, at the end of the high-definition motion image generation processing, the high-definition motion image C with resolution XA and frame rate FB is stored in the storage unit 106 in an uncompressed format.
[0123] As described above, according to the sixth embodiment, for high-resolution target images with various textures and a large amount of information, by learning for each local region, the amount of information used in a single learning pass can be reduced, thereby enabling learning with higher accuracy. Therefore, higher-resolution images can be generated.
[0124] Seventh Embodiment
[0125] The seventh embodiment described below is an example of improving super-resolution by changing the learning processing for each local region according to the sixth embodiment.
[0126] Using the method of the sixth embodiment, a learning model is generated from a frame different from the inferred target by learning regions at the same locations as the inferred target region. However, using this method, for example, when the subject moves significantly, the inferred region and the content shown in the teacher data may differ. This can make it difficult to achieve the desired super-resolution performance.
[0127] To address this issue, a similarity evaluation function is provided in the learning process of the seventh embodiment. This similarity evaluation function searches for and infers regions with high similarity among candidate teacher data, and then uses the obtained regions with high similarity in the learning process.
[0128] High-resolution motion picture generation and processing
[0129] The difference between the seventh embodiment and the sixth embodiment lies only in Figure 16 The process of step S1605 in the flowchart of the high-definition moving image generation process shown is described. Therefore, only the process of step S1605 according to the seventh embodiment will be described.
[0130] In step S1605, the inference unit 452 extracts the region of the inferred target frame By and stores it as a local region in RAM 103. Note that this local region is a rectangular region of uniform size with tens of pixels × tens of pixels. However, such a limitation is not intended. The control unit 101 uses SSIM provided for implementing the similarity evaluation function to search for the region UBn with the highest similarity to the local region of the inferred target frame By in the frame group UB of the teacher data registered in the teacher database D2, and stores it in RAM 103. The learning unit 451 selects frames from the frame group UA to form a pair with the frame to which the local region UBn stored in RAM 103 belongs, and thereby stores the local region UAN with the same position relative to the local region UBn in RAM 103. Note that peak signal-to-noise ratio (PSNR), signal-to-noise ratio (SNR), or mean square error (MSE) can be used for similarity evaluation. Furthermore, as described above, the region UBn with the highest similarity is searched among all frames included in the frame group UB. However, such a limitation is not intended. For example, the region UBn with the highest similarity can be searched among the frames included in frame group UB. In this case, the logarithm of the obtained local region UBn and local region UAN is equal to the number of frames included in frame group UB.
[0131] As described above, according to the seventh embodiment, learning is performed using regions with high similarity to the inferred region. Therefore, even for moving images where the subject has moved significantly, higher-resolution images can be generated.
[0132] Eighth embodiment
[0133] In the eighth embodiment, a solution to the problem according to the sixth embodiment described in the seventh embodiment is explained, which is different from the solution in the seventh embodiment.
[0134] In the eighth embodiment, a method using motion vectors associated with the inferred region is used to identify regions with high similarity. However, according to the eighth embodiment, it is assumed that inter-frame prediction is used to compress the motion image b into MPEG-4 AVC format. Note that MPEG-4 AVC is an abbreviation for ISO / IEC 14496-10 "MPEG-4 Part 10: Advanced Video Coding".
[0135] Next, the differences between the eighth embodiment and the sixth embodiment will be explained.
[0136] Data stored in the storage medium and its decoding and loading methods
[0137] In the processing of the analysis unit 211 according to the eighth embodiment, in addition to the processing for parsing the motion image data stored in the storage unit 106 (as described in the first embodiment), the following processing is also performed. The analysis unit 211 parses the MP4 file storing the motion image b and obtains the avcC box. Then, the analysis unit 211 obtains the sequence parameter set (hereinafter referred to as SPS) and picture parameter set (hereinafter referred to as PPS) included in the avcC box and stores both in RAM 103.
[0138] High-resolution motion picture generation and processing
[0139] The difference between the high-definition moving image generation process in the eighth embodiment and the sixth embodiment lies in... Figure 16 The processing of steps S1605 to S1607 in the flowchart. Therefore, it will use Figure 18 The flowchart illustrates the processing of steps S1605 to S1607 according to the eighth embodiment.
[0140] Note that in step S1604 according to the sixth embodiment described above, the inference unit 452 extracts the local region Byn of the inferred target frame By as a rectangular region of uniform size with 16×16 pixels.
[0141] In step S1801, if the target frame By is inferred to be an I-image, the control unit 101 causes the process to proceed to step S1803. If the target frame By is inferred to be a P-image or a B-image, the control unit 101 causes the process to proceed to step S1802. For example, the determination of whether the target frame is an I-image, P-image, or B-image can be made by referring to SPS and PPS.
[0142] In step S1802, the control unit 101 obtains the macroblock layer from the local region Byn of the inferred target frame By. Furthermore, if sub-macroblocks are used, sub-macroblock predictions are obtained. Otherwise, macroblock predictions are obtained.
[0143] Control unit 101 derives the prediction unit block region Bynb of the macroblock to which the local region Byn of the target frame By belongs via sub-macroblock prediction or macroblock prediction. The prediction unit block region Bynb can be a macroblock, blocks of a macroblock divided by partitions, blocks of a sub-macroblock, or blocks of a sub-macroblock divided by partitions. These blocks are the units of motion compensation.
[0144] Control unit 101 derives the motion vector, reference frame, mbPartIdx, and subMbPartIdx of block region Bynb via SPS, PPS, macroblock prediction, or submacroblock prediction.
[0145] Here, the control unit 101 generates six pieces of information (“mbPartIdx”, “subMbPartIdx”, “Presence or Absence of Motion Vector”, “Motion Vector”, “Reference / Referenced Frame”, and “Reference Direction”) for each block region Bynb and stores them in RAM 103. “mbPartIdx” and “subMbPartIdx” are used to identify which block region in the macroblock is the block region Bynb. “Motion Vector” refers to the temporal and spatial movement of the block region Bynb, specifically the reference destination block of the referenced frame. “Presence or Absence of Motion Vector” indicates whether the block region Bynb includes such a motion vector. “Reference / Referenced Frame” refers to the referenced frame referenced when decoding the inferred face frame By from which the block region Bynb is extracted, and the reference frame that references the block region Bynb. When the “Reference / Referenced Frame” is generated in step S1802, the referenced frame is stored. Furthermore, regarding the term "reference direction," the direction indicated by the motion vector of the macroblock from the local region Byn of the inferred target frame By is the reference direction, and the direction indicated by the local region Byn of the inferred target frame By from macroblocks of other frames is the referenced direction. In the following text, the above six pieces of information are collectively referred to as motion vector information.
[0146] Control unit 101 checks whether a frame identifiable by the "reference / referenced frame" of the generated motion vector information exists in the teacher data candidate. If a frame identifiable by the "reference / referenced frame" exists in the teacher data candidate, control unit 101 sets the "presence or absence of motion vector" from the motion vector information to "yes", and if a frame identifiable by the "reference / referenced frame" does not exist in the teacher data candidate, control unit 101 sets the "presence or absence of motion vector" to "no".
[0147] Furthermore, for example, if the inferred target frame By is a B-image and the block comprises two motion vectors, the reference frame that is closer to the inferred target frame By in terms of temporal distance is used. If the temporal distance difference with the inferred target frame By is the same, the motion vector that is closer in terms of spatial distance indicated by the motion vector, along with information from the reference frame, is used. If both temporal and spatial distances are equal, either reference frame can be used.
[0148] In step S1803, for the block region Bynb in the motion vector information where "Presence or Absence of Motion Vector" is "No", the control unit 101 searches for a block of the reference block region Bynb in the teacher data candidates. Hereinafter, the block of the reference block region Bynb is also referred to as the reference source block. Note that the method for obtaining the motion vector and reference frame information required to determine whether a block is a reference source block of the block region Bynb has been described with reference to step S1802, and therefore this method is omitted.
[0149] If a block of the reference block region Bynb is found (the reference source block of the block region Bynb), the "Presence / Absence of Motion Vectors" in the motion vector information of the block region Bynb is updated to "Yes". Furthermore, the frame containing the block of the reference block region Bynb is stored as a reference frame in "Reference / Referenced Frames". Note that the range of the searched frames is within three frames before or after the frame containing the block region Bynb. Additionally, the range of the searched macroblocks is within the MaxVmvR of each level according to each MPEG-4 AVC setting. MaxVmvR is derived from the SPS of the motion picture b. Note that the range of the searched frames and the range of the searched macroblocks are not limited to these examples.
[0150] In step S1804, for each block region Bynb whose "presence or absence of motion vector" in the motion vector information is "yes", the inference unit 452 obtains the reference destination or reference source block region UBXnb from frame group UB and stores it in RAM 103. Furthermore, the inference unit 452 obtains the block region UAXnb from frame group UA that corresponds to the same coordinate position as the block region UBXnb obtained via the motion vector information of each block region Bynb stored in RAM 103, and stores it in RAM 103. In other words, the inference unit 452 obtains the block region UAXnb corresponding to the same coordinate position as the same block region UBXnb from frames in frame group UA that form a pair with the frame to which the block region UBXnb belongs. Furthermore, the inference unit 452 associates the block region UAXnb with the block region UBXnb and stores it in RAM 103.
[0151] In step S1805, the control unit 101 determines whether the "presence or absence of motion vectors" of all block regions Bynb included in the local region Byn of the target frame By is "yes" or "no". If the control unit 101 determines that the "presence or absence of motion vectors" of all block regions Bynb is "yes" ("yes" in step S1805), the process proceeds to step S1806.
[0152] In step S1806, the inference unit 452 combines the block regions UBXnb stored in RAM 103 based on the coordinate position information of the block region Bynb, and generates a local region UBXn. The inference unit 452 stores the generated local region UBXn in RAM 103.
[0153] Furthermore, the inference unit 452 combines the block region UAXnb, which corresponds to the block region UBXnb with the same coordinate position as the block region stored in RAM 103, based on the coordinate position information of the block region Bynb, and generates a local region UAXn. The inference unit 452 stores the generated local region UAXn in RAM 103.
[0154] Furthermore, the learning unit 451 uses the local region UAXn and the local region UBXn stored in RAM 103 as well as Figure 8 The learning model generation function shown generates the learning model Mn. Note that the local region UBXn is teacher data corresponding to the same coordinate position as the local region UAXn in the same paired frame. The learning unit 451 reads the teacher data from RAM 103, executes the learning model generation function, and stores the generated learning model Mn in RAM 103.
[0155] In step S1807, the inference unit 452 uses the learning model Mn generated in step S1806 to infer the local region Byn of frame By and generate the local region Cyn of the high-definition frame.
[0156] First, the inference unit 452 reads the learning model Mn stored in RAM 103 in step S1806. Next, the inference unit 452 inputs the local region Byn of frame By held in RAM 103 into the CNN of the learning model Mn, and generates the expected high-frequency components in the local region Byn when the inferred target frame By is enlarged to resolution XA. The inference unit 452 generates the local region Cyn by adding the generated high-frequency components to the local region Byn obtained through linear enlargement based on the ratio between resolution XB and resolution XA, and stores it in RAM 103. Note that the processing performed on the local region Byn, from high-frequency component inference to high-resolution image generation, is related to... Figure 8The inference process shown is handled in a similar manner.
[0157] In step S1805, if the control unit 101 determines that the local region Byn includes a block region Bynb whose "presence or absence of motion vector" is "no" (in step S1805, it is "no"), the process proceeds to step S1808. In step S1808, the control unit 101 determines whether the "presence or absence of motion vector" for the motion vector information of each block region Bynb included in the local region Byn is "yes" or "no". If the control unit 101 determines that the "presence or absence of motion vector" is "yes" (in step S1808, it is "yes"), the process proceeds to step S1809. On the other hand, if the control unit 101 determines that the "presence or absence of motion vector" is "no" in step S1808 (in step S1808, it is "no"), the process proceeds to step S1811.
[0158] In step S1809, learning unit 451 uses block region Bynb and local region UBXnb, using Figure 8 The learning model generation function shown generates the learning model Mnb and stores it in RAM103.
[0159] More specifically, in step S1809, the learning unit 451 uses the local region UBXnb and the local region UAXnb stored in RAM 103 as well as Figure 8 The learning model generation function shown generates the learning model Mnb used for inference of the block region Bynb. Note that the local region UBXnb is teacher data corresponding to the same coordinate position as the local region UAXnb in the same pair of frames. The learning unit 451 reads the teacher data from RAM 103, inputs it into the learning model generation function, and stores the generated learning model Mnb in RAM 103.
[0160] In step S1810, the inference unit 452 uses the learning model Mnb to infer the block region Bynb of frame By and generate the block region Cynb of the high-resolution frame. First, the inference unit 452 reads the learning model Mnb stored in RAM 103 in step S1809. Next, the inference unit 452 inputs the block region Bynb held in RAM 103 into the CNN of the learning model Mnb and generates the expected high-frequency components in the local region Bynb when the inferred target frame By is enlarged to resolution XA. The inference unit 452 generates the block region Cynb of the high-resolution frame by adding the generated high-frequency components to the local region Bynb obtained by linearly enlarging based on the ratio between resolution XB and resolution XA, and stores it in RAM 103. Note that the processing performed on the block region Bynb from high-frequency component inference to high-resolution image generation is related to... Figure 8 The inference process shown is similar to the processing described.
[0161] In step S1811, the control unit 101 stores the block region Cy of the high-definition frame Cy, obtained by linearly magnifying the block region Bynb, which is "no" for the presence or absence of motion vectors in the motion vector information based on the comparison between resolution XA and resolution XB, in RAM 103. Note that the linear magnification method is not limited, as long as magnification can be based on the ratio between resolution XA and resolution XB.
[0162] In step S1812, the control unit 101 determines whether the above processing has been completed for all block regions Bynb. If the control unit 101 determines that the processing is incomplete ("No" in step S1812), the process proceeds to step S1807, and the incomplete block regions Bynb are processed. If the control unit 101 determines that the processing is complete ("Yes" in step S1812), the process proceeds to step S1813. In step S1813, the control unit 101 reads the block regions Cynb held in RAM 103 in steps S1810 and S1811, combines these block regions based on the coordinate position information of the corresponding block regions Bynb, and generates a local region Cyn of the high-definition frame. The generated local region Cyn is held in RAM 103. Figure 16 In step S1608, the local region Cyn generated as described above is used as the local region Cyn1705.
[0163] As described above, according to the eighth embodiment, learning is performed using motion vectors from regions that have a high similarity to the inferred regions of the reference / referenced objects. Therefore, even for motion images with significant subject movement, higher-resolution images can be generated.
[0164] Ninth Embodiment
[0165] In the ninth embodiment, the solution to the problem according to the sixth embodiment described in the seventh embodiment is explained, which is different from the solutions in the seventh and eighth embodiments.
[0166] Next, the differences between the ninth embodiment and the sixth embodiment will be explained.
[0167] High-resolution motion picture generation and processing
[0168] The difference between the ninth embodiment and the sixth embodiment lies only in Figure 16 The flowchart illustrating the high-resolution moving image generation process shows steps S1605 and S1606. Therefore, the processing of steps S1605 and S1606 according to the ninth embodiment will be described below.
[0169] In step S1605, the control unit 101 selects a local region (corresponding to UAN5 and UBn5) from the paired frames of frame groups UA and UB that corresponds to the same coordinate position as the local region Byn of the inferred target frame By, and stores it in RAM 103. Additionally, the control unit 101 stores eight regions adjacent to UBn5 and having the same size as UBn5 in RAM 103. Similarly, the control unit 101 stores eight regions adjacent to UAN5 and having the same size as UAN5 in RAM 103. Figure 19 The diagram illustrates an example of region selection for frames included in frame group UB. Note that in this embodiment, for inferring the target region, a region with the same location coordinates as the local region Byn, along with eight adjacent regions, is selected. However, the method and number of regions selected are not limited to this.
[0170] Next, the control unit 101 evaluates the similarity between the local region Byn of the inferred target frame By and UBn1 to UBn9, and obtains a similarity evaluation value. Then, the control unit 101 determines the number of learning iterations for each of UBn1 to UBn9 based on the similarity evaluation value and stores it in RAM 103 as learning information. Note that the learning information includes, for example, "information for identifying UBn1 to UBn9", "similarity evaluation value with local region Byn", and "number of learning iterations". If the similarity evaluation value with local region Byn in the learning information is less than a pre-set threshold in the system, the control unit 101 updates the number of learning iterations in the learning information to 0. For regions with similarity evaluation values equal to or greater than the threshold, the number of learning iterations is determined using the ratio of similarity evaluation values between regions with similarity evaluation values equal to or greater than the threshold, and the learning information is updated. In this example, the similarity evaluation values of UBn4, UBn5, and UBn6 are equal to or greater than the threshold, and their ratio is 2:5:3. Furthermore, the total number of learning iterations is set to 1000. In this example, the learning times for UBn4 to UBn6 are 200, 500, and 300, respectively. Note that in the method for determining the number of learning times according to this embodiment, the number of learning times is linearly assigned to regions with similarity evaluation values greater than a threshold. However, the method is not limited to this.
[0171] In step S1606, the learning unit 451 uses the image of the local region (one of UBn1 to UBn9) indicated by the learning information and the image of the local region (one of UAn1 to UAn9) in the corresponding frame group UA as teacher data to generate the learning model Mn. The learning unit 451 uses... Figure 8 The learning model generation function shown will perform the learning process for each teacher's data as indicated by the learning information, and generate a learning model Mn. The generated learning model Mn is stored in RAM 103.
[0172] The processing from step S1607 onwards is the same as that in the sixth embodiment, therefore its description is omitted.
[0173] As described above, according to the ninth embodiment, multiple regions with high similarity to the inferred region are used in the learning process based on their similarity to the inferred region. Therefore, even for moving images where the subject has moved significantly, higher-resolution images can be generated.
[0174] As described above, according to the sixth to ninth embodiments, local regions can be determined from high-resolution target images, and the amount of information used in the learning model can be reduced. Furthermore, according to the sixth to ninth embodiments, local regions of teacher data that are highly correlated with the local regions determined from the high-resolution target images can be selected and used for the learning model. Therefore, high-frequency components of the high-resolution target image can be inferred with higher accuracy, thereby enabling the acquisition of highly accurate high-resolution images. In other words, the accuracy of motion picture super-resolution imaging for achieving high-resolution motion pictures can be improved.
[0175] Tenth Embodiment
[0176] The tenth embodiment described below is an example of modifying the learning process for each local region according to the sixth embodiment and reducing the learning processing load. In the method of the sixth embodiment, a frame is segmented into multiple local regions, a learning model is generated for each local region, and super-resolution performance is improved through inference processing. However, using this method, multiple learning models must be generated as many as the number of local regions. This often problematically leads to an increase in the learning processing load. Therefore, in the learning processing of the tenth embodiment, by providing a similarity evaluation function, movement in each local region is detected, and local regions judged to "not have movement" are combined to form new combined local regions, thereby reducing the number of local regions. In this way, the number of generated learning models is reduced, and the learning processing load is alleviated.
[0177] The difference between the tenth embodiment and the sixth embodiment is that Figure 16 The flowchart of the high-definition moving image generation process shown includes step S1604 (processing for extracting local regions from frame By). Therefore, the following will mainly describe the processing of step S1604 according to the tenth embodiment.
[0178] Will use Figure 20 and Figure 21 The local region extraction process for frame By in step S1604 of the tenth embodiment will be explained. Figure 20 This is a flowchart illustrating local region extraction according to the tenth embodiment. Figure 21 This is a diagram used to illustrate the concept of local region extraction according to the tenth embodiment.
[0179] exist Figure 21In the diagram, 2100 represents the inferred target frame By. 2110 represents the image obtained by semantic region segmentation of the inferred target frame By, with the semantic regions bounded. Boxes 2101 and 2102 are "tree" regions, boxes 2103 and 2104 are "ground" regions, and box 2105 is a "person" region. Even if some boxes have the same meaning, they are treated as separate semantic regions. For example, boxes 2101 and 2102 are regions with the same meaning (tree) but are considered different semantic regions.
[0180] 2120 represents an image obtained by determining whether there is movement in the partial regions formed by dividing the inferred target frame By into rectangular target regions Byn' of uniform size, as in the sixth embodiment. This embodiment is an example where the "person" image has large movement and other images have small movement. In 2120, the partial regions By1' to By9', By13' to By16', By20' to By23', By27' to By30', and By34' to By35' indicated by diagonals are identified as partial regions with small movement.
[0181] 2130 represents an image of the local region Byn to be extracted according to this embodiment. The local region Byn is essentially the same as the partial region Byn'. However, in this embodiment, using the result of segmentation into semantic regions (2110) and the result of judging the amount of movement in each local region (2120), the partial regions Byn' that are judged to have "no movement" within the same semantic region are combined to form a local region (combined local region). The local regions indicated by the diagonal in 2130 correspond to the combined local regions described above. In other words, partial regions By1', By2', By8', and By9' are combined to form a local region By1, and partial regions By6', By7', By13', and By14' are combined to form a local region By5. Furthermore, partial regions By22' and By23' are combined to form a local region By16, and partial regions By27' and By28' are combined to form a local region By20. In addition, regions that do not meet the above conditions (Byn') are extracted as local regions.
[0182] Next, refer to Figure 20The flowchart illustrates the processing of the tenth embodiment. In step S2001, the learning / inference unit 105 performs processing to segment the image of the inferred target frame By into semantic regions and stores the processing result in RAM 103. Here, semantic region segmentation can be achieved via inference using a CNN model such as Mask-R CNN. Therefore, the learning / inference unit 105 switches the CNN model to be used from the CNN model used for super-resolution to the CNN model used for semantic region segmentation (e.g., Mask-R CNN) for semantic region segmentation. Alternatively, a dedicated learning / inference unit for semantic region segmentation can be provided separately from the learning / inference unit 105.
[0183] In step S2002, the control unit 101 extracts partial regions Byn' from the inferred target frame By and stores them in RAM 103. Note that in this embodiment, the partial region Byn' is, for example, a rectangular region (square region) of uniform size with tens of pixels × tens of pixels. However, such a limitation is not intended. For example, the partial region Byn' can be an elongated rectangular region.
[0184] In step S2003, the control unit 101 determines, for each partial region Byn' extracted in step S2002, whether there is movement relative to the immediately preceding inferred target frame. The control unit 101 stores information indicating partial regions determined to have "no movement" relative to the immediately preceding inferred target frame in RAM 103. Here, for example, determining whether there is movement in the images of each partial region can be achieved using the similarity evaluation function of SSIM. The control unit 101 uses SSIM to obtain the similarity of partial regions with the same coordinates between the inferred target frame By and the immediately preceding inferred target frame, and if the obtained similarity is greater than a certain threshold, it determines that "no movement exists" in the partial region. If the obtained similarity is equal to or less than the certain threshold, it determines that "movement exists" for that partial region. Note that SSIM is used in the similarity evaluation. However, such a limitation is not intended. For example, peak signal-to-noise ratio (PSNR), signal-to-noise ratio (SNR), or mean squared error (MSE) can be used.
[0185] In step S2004, the control unit 101 selects a portion of the same semantic region calculated in step S2001 that was determined to be "non-moving" in step S2003, and stores it in RAM 103. Note that in this embodiment, if a portion of the region is entirely included in a semantic region, the portion of the region is considered to exist within that semantic region. However, this is not intended to be a limitation, and for example, if a predetermined proportion or more of the portion of the region is included in a semantic region, the portion of the region can be considered as if it exists within that semantic region. In step S2005, the control unit 101 combines the portion of the regions selected in step S2004 and stores the combined local region in RAM 103. Note that in this embodiment, as long as the portion of the region is included in the same semantic region, even if these portion of the region are not contiguous, these portion of the region are considered as a local region. However, this is not intended to be a limitation. For example, local regions that exist within the same semantic region in frame By and are contiguous in the up, down, left, and right directions and are determined to be "non-moving" can be combined to form a local region.
[0186] In step S2006, the inference unit 452 extracts the combined local region Byn held in RAM 103 in step S2005 as a local region and holds it in RAM 103. Furthermore, the inference unit 452 extracts each local region from the partial region Byn' held in RAM 103 in step S2002 that was not selected as a combined target in step S2005 as a local region Byn and holds them in RAM 103. Figure 21 In the example, the partial regions obtained by segmenting the image into 42 regions are combined and processed in step S2005, and 34 local regions are extracted. Figure 16 In the processing following step S1605, these 34 local regions are used.
[0187] As described above, according to the tenth embodiment, since multiple "non-moving" partial regions are combined to form a local region, the number of subsequent processing steps performed to generate the learning model can be reduced. This makes it possible to reduce the learning processing load while maintaining super-resolution performance.
[0188] Note that in this embodiment, local regions within the same semantic region obtained in step S2001 are combined. However, such a limitation is not intended. For example, the control unit 101 may combine all local regions of “no movement” in frame By into a single local region, independent of the semantic region. Furthermore, for example, the control unit 101 may combine adjacent “no movement” partial regions in the front-back and left-right directions, independent of the semantic region. In this case, for example, extracting the local regions obtained in step S2001... Figure 21 The set of "non-moving" partial regions indicated by 2120 in the diagram is considered as a local region. Furthermore, for example, the control unit 101 can combine the "non-moving" partial regions such that the combined local region has a rectangular shape. For example, in obtaining, for example, by... Figure 21 In the case of the "non-moving" partial region shown in 2120, three combined local regions are extracted (e.g., 5×2 local regions on the left and right sides and a 1×3 local region in the center).
[0189] The tenth embodiment based on the sixth embodiment has been described above. However, it should be apparent that the combined local region according to the tenth embodiment can also be used in the processes described in the seventh to ninth embodiments. Furthermore, it needs not be stated that the extracted teacher data for learning can be the same as that according to any of the first to fifth embodiments.
[0190] Other embodiments
[0191] The embodiments of the present invention can also be implemented by providing software (programs) that perform the functions of the above embodiments to a system or device via a network or various storage media, and the computer or central processing unit (CPU) or microprocessor unit (MPU) of the system or device reads out and executes the program.
[0192] Although the invention has been described with reference to exemplary embodiments, it should be understood that the invention is not limited to the disclosed exemplary embodiments. The scope of the appended claims is to be interpreted in the broadest sense to include all such modifications, equivalent structures, and functions.
Claims
1. An image processing apparatus that uses a first image group to give high resolution to images of a second image group, the images of the second image group having fewer high-frequency components than the images of the first image group, the image processing apparatus comprising: The selection component is used to select teacher data to be used in learning from multiple teacher data that use one of the images included in the first image group as an image pair, based on the current image selected as the high-definition target from the second image group; A computing unit is used to calculate, for each of the multiple partial regions obtained by segmenting the current image, the similarity between the partial regions corresponding to the previous image as a high-resolution target preceding the current image; A determining component is used to determine multiple local regions from the current image by combining a set of one or more partial regions with similarity equal to or greater than a threshold into a local region and treating partial regions with similarity less than the threshold as separate local regions; A model generation component is used to generate a learning model for inferring high-frequency components for each of the plurality of local regions using teacher data selected by the selection component. An inference component is used to infer high-frequency components for each of the plurality of local regions using the learning model; as well as An image generation component is used to generate a high-resolution image based on the current image and the high-frequency components inferred by the inference component.
2. The image processing apparatus according to claim 1, in, The determining component combines consecutive partial regions in any of the up, down, left, or right directions in the current image within the partial regions whose similarity is equal to or greater than the threshold into a local region.
3. The image processing apparatus according to claim 1, further comprising: A segmentation component is used to segment the current image into semantic regions. The determining component will combine the partial regions that belong to the same semantic region obtained by the segmentation component and whose similarity is equal to or greater than the threshold.
4. The image processing apparatus according to claim 1, further comprising: The obtaining component is configured to obtain a pair of images, including a first image selected from the first image group and a third image associated with the first image that has fewer high-frequency components than the first image, as candidates for the teacher data. The selection component selects teacher data to be used in the learning from the candidates of teacher data.
5. The image processing apparatus according to claim 4, in, The obtaining component obtains candidates for the teacher data by obtaining the third image from the second image group.
6. The image processing apparatus according to claim 5, in, The acquiring component acquires an image from the second image group whose shooting time is the same as that of the first image as the third image.
7. The image processing apparatus according to claim 5, in, The obtaining component obtains an image from the second image group whose shooting time difference with the first image is less than a predetermined threshold as the third image.
8. The image processing apparatus according to claim 5, in, The obtaining component obtains the image with the highest similarity to the first image from the second image group as the third image.
9. The image processing apparatus according to claim 8, in, The obtaining component determines the similarity between the first image scaled down to the resolution of the second image group and the images in the second image group.
10. The image processing apparatus according to claim 4, in, The obtaining component obtains a smaller, lower-resolution version of the first image as the third image.
11. The image processing apparatus according to claim 10, in, The third image is the first image scaled down to the resolution of the second image group.
12. The image processing apparatus according to claim 4, in, The selection component selects candidate teacher data, including images whose shooting time difference with the current image is less than a predetermined threshold, as teacher data to be used in the learning process.
13. The image processing apparatus according to claim 4, in, The selection component selects candidate teacher data, including images whose similarity to the current image is greater than a predetermined threshold, as teacher data to be used in the learning process.
14. The image processing apparatus according to claim 1, in, The inference component controls the updating of parameters via backpropagation during the learning process based on the teacher data to be used in the learning process and the current image.
15. The image processing apparatus according to claim 14, in, The inference component determines coefficients based on the teacher data to be used in the learning and the current image, and controls the amount of parameter updates via the backpropagation based on the coefficients.
16. The image processing apparatus according to claim 14, in, The inference component determines coefficients based on the teacher data to be used in the learning process and the current image, and controls the number of repetitions of parameter updates via the backpropagation based on the coefficients.
17. The image processing apparatus according to claim 15, in, The inference component determines the coefficient based on the difference between the capture time of the image of the teacher data to be used in the learning and the capture time of the current image.
18. The image processing apparatus according to claim 15, in, The inference component determines the coefficients based on the similarity between the image of the teacher data to be used in the learning and the current image.
19. The image processing apparatus according to claim 1, in, The model generation component extracts image pairs corresponding to each local region among the plurality of local regions from the teacher data selected by the selection component, and uses the extracted image pairs to generate a learning model for the local images of each local region among the plurality of local regions. The inference component uses the learning model applied to the local image to infer the local high-frequency components of the local image. The image generation component uses the local high-frequency components and the local image to generate a high-resolution image of a local region, and combines the high-resolution images generated for each local region.
20. The image processing apparatus according to claim 19, in, The model generation component extracts image pairs of regions corresponding to the same coordinate positions as the local region from the teacher data selected by the selection component.
21. The image processing apparatus according to claim 20, in, The image generation component combines high-resolution images of various local regions based on coordinate location information to generate a high-resolution image of the current image.
22. The image processing apparatus according to claim 19, in, The model generation component extracts image pairs with the highest similarity to the local image from the teacher data selected by the selection component.
23. The image processing apparatus according to claim 19, in, The model generation component extracts image pairs corresponding to the local region from the teacher data selected by the selection component, based on motion vectors set for blocks included in the local region as motion compensation units or based on motion vectors of blocks included in the local region.
24. The image processing apparatus according to claim 19, in, The model generation component extracts multiple image pairs from the teacher data selected by the selection component, corresponding to multiple regions identified based on the location of the local regions. The model generation component determines the number of times each image pair among the multiple image pairs should be used for learning when generating the learning model, based on the similarity between the local image and each image pair among the multiple image pairs.
25. The image processing apparatus according to claim 24, in, The plurality of regions includes a first region corresponding to the location of the local region and a second region adjacent to the first region.
26. The image processing apparatus according to claim 24, in, The model generation component does not use image pairs with a similarity to the local image that is equal to or less than a threshold for learning.
27. The image processing apparatus according to claim 1, in, The first image group and the second image group are two image groups obtained by performing different image processing on an image captured by an image sensor included in a camera device.
28. The image processing apparatus according to claim 1, in, The first image group and the second image group are image groups captured by two different image sensors.
29. The image processing apparatus according to claim 1, in, The first image group has a lower frame rate than the second image group.
30. An image processing method that uses a first image group to give a second image group high resolution, the second image group having fewer high-frequency components than the first image group, the image processing method comprising: Based on the current image selected as the high-resolution target from the second image group, select the teacher data to be used in the learning from multiple teacher data that use one of the image pairs included in the first image group; For each of the multiple partial regions obtained by segmenting the current image, calculate the similarity between the partial regions and the corresponding partial regions of a previous image that is a high-resolution target preceding the current image; Multiple local regions are determined from the current image by combining a set of one or more partial regions with similarity equal to or greater than a threshold into a single local region and treating partial regions with similarity less than the threshold as separate local regions. For each of the plurality of local regions, the teacher data selected in the selection is used to generate a learning model for inferring high-frequency components; For each of the plurality of local regions, the learning model is used to infer high-frequency components; and A high-resolution image is generated based on the current image and the high-frequency components inferred in the inference.
31. A storage medium storing a program for enabling a computer to function as a component of an image processing apparatus according to any one of claims 1 to 29.