Noise reduction model training method, noise reduction method and device
By combining scene semantics and optical parameter vectors to train the denoising model, the problem of obvious noise in RAW image data in low-light environments is solved, achieving adaptive and efficient denoising, avoiding loss of detail, and improving image quality.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- VIVO MOBILE COMM HANGZHOU CO LTD
- Filing Date
- 2026-03-10
- Publication Date
- 2026-06-23
AI Technical Summary
In existing technologies, RAW format image data exhibits significant noise in low-light environments, and using a pre-set noise reduction threshold for denoising results in loss of detail.
By acquiring sample scene semantic vectors and optical parameter vectors from RAW image data, and combining them with a deep learning model for feature fusion, a noise reduction model is trained, and noise reduction parameters are adaptively adjusted to avoid a "one-size-fits-all" noise reduction approach.
It improves the noise reduction effect of RAW image data, avoids loss of detail, adapts to different optical noise and scene types, and improves the accuracy and efficiency of noise reduction.
Smart Images

Figure CN122265074A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of image processing technology, specifically relating to a noise reduction model training method, a noise reduction method, and an apparatus. Background Technology
[0002] With the rapid development of electronic device photography technology, RAW noise reduction plays an increasingly important role in photography. RAW noise reduction is applied to RAW format image data obtained from the sensor. RAW format image data is rich in image details, but noise is obvious in low-light environments. By setting a noise reduction threshold, noise can be reduced in areas of the image data where the brightness is less than the threshold, while the processing intensity can be reduced in texture areas where the brightness is greater than the threshold.
[0003] The above scheme uses a pre-set noise reduction threshold to denoise RAW format image data, which is a "one-size-fits-all" noise reduction method. This results in the loss of details in the denoised image data. Summary of the Invention
[0004] The purpose of this application is to provide a noise reduction model training method, a noise reduction method, and an apparatus that can improve the noise reduction effect on RAW image data.
[0005] In a first aspect, embodiments of this application provide a method for training a noise reduction model, the method comprising: A training data pair is acquired, comprising a first RAW image data, a sample scene semantic vector acquired when the first RAW image data was acquired, a sample optical parameter vector acquired when the first RAW image data was acquired, and reference RAW image data corresponding to the first RAW image data. The first RAW image data comprises at least one frame of RAW image data. A first feature vector, a second feature vector of the sample optical parameter vector, and a third feature vector of the sample scene semantic vector are extracted from the first RAW image data. The first feature vector, the second feature vector, and the third feature vector are fused to obtain a fused feature. The fused feature is input into a first noise reduction model to obtain output RAW image data. The model parameters of the first noise reduction model are adjusted based on the output RAW image data and the reference RAW image data to obtain a second noise reduction model.
[0006] Secondly, embodiments of this application provide a noise reduction method, the method comprising: The process involves acquiring RAW image data to be processed, collecting scene semantic vectors and optical parameter vectors corresponding to the RAW image data to be processed, wherein the RAW image data to be processed includes at least one frame of RAW image data; extracting a first target feature vector, a second target feature vector of the optical parameter vector, and a third target feature vector of the scene semantic vector from the RAW image data to be processed; fusing the first target feature vector, the second target feature vector, and the third target feature vector to obtain target fusion features; and inputting the target fusion features into a second noise reduction model to obtain target output RAW image data.
[0007] Thirdly, embodiments of this application provide a noise reduction model training device, the device comprising: The system includes: an acquisition module for acquiring training data pairs, which include first RAW image data, a sample scene semantic vector acquired when the first RAW image data was acquired, a sample optical parameter vector acquired when the first RAW image data was acquired, and reference RAW image data corresponding to the first RAW image data; the first RAW image data including at least one frame of RAW image data; an extraction module for extracting a first feature vector from the first RAW image data, a second feature vector from the sample optical parameter vector, and a third feature vector from the sample scene semantic vector; a fusion module for fusing the first feature vector, the second feature vector, and the third feature vector to obtain fused features; an input module for inputting the fused features into a first noise reduction model to obtain output RAW image data; and a training module for adjusting the model parameters of the first noise reduction model based on the output RAW image data and the reference RAW image data to obtain a second noise reduction model.
[0008] Fourthly, embodiments of this application provide a noise reduction model training apparatus, applied to the second noise reduction model described in the third aspect, the apparatus comprising: The acquisition module is used to acquire RAW image data to be processed, scene semantic vectors acquired when acquiring the RAW image data to be processed, and optical parameter vectors corresponding to the acquired RAW image data to be processed, wherein the RAW image data to be processed includes at least one frame of RAW image data; the extraction module is used to extract a first target feature vector, a second target feature vector of the optical parameter vector, and a third target feature vector of the scene semantic vector from the RAW image data to be processed; the fusion module is used to fuse the first target feature vector, the second target feature vector, and the third target feature vector to obtain target fused features; and the noise reduction module is used to input the target fused features into a second noise reduction model to obtain target output RAW image data.
[0009] Fifthly, embodiments of this application provide an electronic device, the electronic device including a processor and a memory, the memory storing programs or instructions executable on the processor, the programs or instructions, when executed by the processor, implementing the methods described in the first aspect and / or the methods described in the second aspect.
[0010] In a sixth aspect, embodiments of this application provide a readable storage medium on which a program or instructions are stored, which, when executed by a processor, implement the method described in the first aspect and / or the method described in the second aspect.
[0011] In a seventh aspect, embodiments of this application provide a chip, the chip including a processor and a communication interface, the communication interface being coupled to the processor, the processor being used to run programs or instructions to implement the method described in the first aspect and / or the method described in the second aspect.
[0012] Eighthly, embodiments of this application provide a computer program product stored in a storage medium, which is executed by at least one processor to implement the method described in the first aspect and / or the method described in the second aspect.
[0013] In this embodiment, the first RAW image data, the sample scene semantic vector when the first RAW image data was acquired, and the sample optical parameter vector corresponding to the acquisition of the first RAW image data are used to train the first denoising model. By introducing two modalities, scene semantic vector and optical parameter vector, to train the first denoising model, compared with relying solely on the single modal information of the first RAW image data, it can adaptively denoise the RAW image data for different sources of optical noise and scene types, avoiding the problem of detail loss in RAW image data caused by the "one-size-fits-all" denoising method, and improving the denoising effect of RAW image data. Attached Figure Description
[0014] Figure 1 This is a flowchart illustrating a noise reduction model training method provided in some embodiments of this application; Figure 2 This is a flowchart illustrating some embodiments of the noise reduction method provided in this application; Figure 3 These are schematic diagrams illustrating the structure of a noise reduction model training device according to some embodiments of this application; Figure 4 These are schematic diagrams illustrating the structure of a noise reduction device according to some embodiments of this application; Figure 5 These are schematic diagrams illustrating the structure of an electronic device according to some embodiments of this application; Figure 6These are schematic diagrams illustrating the hardware structure of an electronic device according to some embodiments of this application. Detailed Implementation
[0015] The technical solutions of the embodiments of this application will be clearly described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application are within the scope of protection of this application.
[0016] The terms "first," "second," etc., used in this application's specification are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first," "second," etc., are generally of the same class, without limiting the number of objects; for example, the first object can be one or N objects. Furthermore, in the specification, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects have an "or" relationship.
[0017] The terminology used in the embodiments of this invention will be explained below.
[0018] RAW image data: RAW format image data. When the camera is on, the camera will continuously capture RAW format image data, and then perform image processing on the RAW format image data to obtain a preview image.
[0019] High Dynamic Range (HDR) images are created by taking multiple photos with different exposure levels (e.g., underexposed, normal exposed, overexposed) in succession, and then using algorithms to fuse the best details of these images (e.g., bright details from short exposure frames and dark details from long exposure frames) together to finally generate an image with high dynamic range.
[0020] High dynamic range RAW image data: RAW format image data captured by the camera when taking multiple photos with different exposure levels.
[0021] One-hot encoding is a technique for converting categorical variables into binary vectors. Its core principle is to independently encode N states using an N-bit state register, with only one bit in the binary vector corresponding to each state being valid. In other words, one-hot encoding constructs a binary vector with a dimension equal to the total number of categories. Each category's vector is set to 1 only at its corresponding index position, with the remaining positions being 0. For example, "cat" in the animal category can be encoded as [1,0,0,0], where the vector length is equal to the number of animal categories.
[0022] Min-max normalization, also known as deviation normalization, uses a linear transformation formula to scale the original data proportionally, mapping it to a fixed numerical range. The most commonly used target range is [-1, 1], but it can be adjusted to other ranges as needed.
[0023] Black level value: The black level value of an imaging device refers to the reference level of the electrical signal output by the image sensor under conditions of no light input (or complete darkness). It is essentially a numerical reference point for the "pure black" area in the image data, used to calibrate the dark area performance of the image.
[0024] The technical solutions of this application can be applied to scenarios where noise reduction of image data during the shooting process is required. For example, a user is playing in a park and wants to photograph a flower. During the shooting process, due to the influence of sensor parameters, the RAW image data will have noise. The user wants to obtain a denoised image A. As another example, a user is watching the sunrise at the beach in the morning and wants to take an image B that shows both the sunrise and the sea surface, with noise removed. Because the sunrise and the sea surface have contrast, the user wants to capture a high dynamic range image B. During the shooting of image B, there will be multiple frames of RAW image data with different exposures, for example, a total of 5 frames of RAW image data with different exposures. These 5 frames of RAW image data with different exposures need to be denoised and then fused to obtain a high dynamic range image B.
[0025] In the embodiments of this application, noise removal of RAW image data is achieved through a denoising model. Before using the denoising model, a denoising model needs to be trained first, and the denoising model can only be used after the denoising model has been trained.
[0026] The noise reduction model training method provided in this application will be described in detail below with reference to the accompanying drawings, through specific embodiments and application scenarios.
[0027] Figure 1This is a flowchart illustrating a noise reduction model training method provided in an embodiment of this application. The subject executing the noise reduction model training method can be an electronic device, which may be, but is not limited to, a personal computer (PC), a smartphone, a tablet computer, or a personal digital assistant (PDA).
[0028] like Figure 1 As shown, the noise reduction model training method provided in this application embodiment may include steps 110-150.
[0029] Step 110: Obtain training data pairs, which include first RAW image data, sample scene semantic vectors when acquiring the first RAW image data, sample optical parameter vectors corresponding to the acquisition of the first RAW image data, and reference RAW image data corresponding to the first RAW image data.
[0030] The training data pair can be a data pair used to train the first noise reduction model. This training data pair may include first RAW image data, sample scene semantic vectors acquired when the first RAW image data was acquired, sample optical parameter vectors acquired when the first RAW image data was acquired, and reference RAW image data corresponding to the first RAW image data.
[0031] The aforementioned first RAW image data can be RAW format image data captured by the camera's sensor. This first RAW image data can include at least one frame of RAW image data; that is, the first RAW image data can include one frame of ordinary RAW image data. For example, if a user previously took a selfie image C, the first RAW image data could be one frame of RAW image data captured when taking the user's selfie image C. The first RAW image data can also include multiple frames of RAW image data. These multiple frames of RAW image data can be multiple frames of RAW image data used to obtain a high dynamic range image. For example, if a user previously photographed a sunset, due to the contrast between the sunset and the surrounding scene, the sunset image D taken by the user is a high dynamic range image. During the shooting of the sunset image D, there will be multiple frames of RAW image data with different exposures. For example, there are a total of 6 frames of RAW image data with different exposures, so the first RAW image data here can be the aforementioned 6 frames of RAW image data with different exposures.
[0032] The aforementioned sample optical parameter vector can be a vector of the optical parameters of the acquisition device when acquiring the first RAW image data. In other words, the sample optical parameter vector is a vector obtained by encoding the optical parameters of the acquisition device when acquiring the first RAW image data. The specific method for encoding the optical parameters of the acquisition device when acquiring the first RAW image data to obtain the sample optical parameter vector will be described in detail in later embodiments.
[0033] The aforementioned acquisition device can be a device that acquires the first RAW image data, such as a camera. The optical parameters of the acquisition device can include its own device parameters and exposure control parameters. The device parameters can be inherent to the device itself, such as the sensor model and white balance preset values. The exposure control parameters can be parameters related to exposure when acquiring the first RAW image data, such as, but not limited to, ISO sensitivity, exposure time, aperture size, and focal length. For example, the ISO range could be 100 to 64000, the exposure time range could be 1 / 8000s to 30s, and the aperture size range could be f / 1.2 to f / 22.
[0034] It should be noted that the optical parameters of the acquisition device mentioned above can be obtained through the hardware application programming interface (API) of the acquisition device, or extracted from RAW image files through an exchangeable image file format (EXIF) parsing tool.
[0035] It should be noted that when the first RAW image data consists of multiple frames of RAW image data used to obtain high dynamic range images, the optical parameters, especially the exposure control parameters, are different when each frame of RAW image data is acquired.
[0036] The sample scene semantic vector can be a vector of scene semantic information acquired during the acquisition of the first RAW image data. In other words, the sample scene semantic vector is a vector obtained by encoding the scene semantic information acquired during the acquisition of the first RAW image data. The specific method for encoding the scene semantic information acquired during the acquisition of the first RAW image data to obtain the sample scene semantic vector will be described in detail in the following embodiments.
[0037] The scene semantic information when the first RAW image data is acquired can be the scene at which the first RAW image data is acquired. For example, if the first RAW image data is acquired in a low-light night scene, then the scene semantic information when the first RAW image data is acquired is a low-light night scene. Or, if the first RAW image data is acquired in an outdoor bright light scene, then the scene semantic information when the first RAW image data is acquired is outdoor bright light.
[0038] In some embodiments of this application, scene categories can be pre-defined to identify the specific scene category of the first RAW image data. For example, when the first RAW image data is a single-frame RAW image data, the pre-defined scene categories may include, but are not limited to, six scene categories: low-light night scene, outdoor strong light, indoor portrait, indoor still life, outdoor landscape, and motion scene. When the first RAW image data is multi-frame RAW image data used to obtain high dynamic range images, the pre-defined scene categories may include, but are not limited to, seven scene categories: low-light night scene, outdoor strong light, indoor portrait, indoor still life, outdoor landscape, motion scene, and high-contrast scene. The aforementioned high-contrast scene may be, for example, a backlight or a scene where light and shadow meet.
[0039] It should be noted that when the first RAW image data consists of multiple frames of RAW image data used to obtain high dynamic range images, the scene semantic information at the time of acquisition of each frame of RAW image data is the same.
[0040] The reference RAW image data corresponding to the first RAW image data can be the ideal RAW image data obtained after denoising the first RAW image data. If the first RAW image data is a single frame, the reference RAW image data is a single frame of denoised RAW image data. If the first RAW image data consists of multiple frames used to obtain a high dynamic range image, the reference RAW image data can be a single frame of denoised RAW image data. This denoised RAW image data can be obtained by denoising multiple frames of RAW image data used to obtain a high dynamic range image before denoising, and then fusing the denoised multiple frames of RAW image data.
[0041] In some embodiments of this application, step 110 may specifically include: Acquire first RAW image data and the corresponding optical parameters when acquiring the first RAW image data; encode the optical parameters corresponding to the first RAW image data to obtain a sample optical parameter vector; downsample the first RAW image data to obtain a preview image corresponding to the first RAW image data; input the preview image into a semantic recognition model and output at least one confidence score; determine the scene information when acquiring the first RAW image data based on at least one confidence score and a preset confidence threshold; perform one-hot encoding on the scene information to obtain a sample scene semantic vector corresponding to the first RAW image data.
[0042] The preview image corresponding to the first RAW image data can be obtained by downsampling the first RAW image data and performing image processing. For example, the first RAW image data can be downsampled to a resolution of 256×256, and then image processing can be performed to obtain the preview image.
[0043] It should be noted that if the first RAW image data is a single frame, it can be directly downsampled and processed to obtain a preview image. If the first RAW image data consists of multiple frames of high dynamic range (HDR) RAW image data, the frame with the standard exposure can be downsampled and processed to obtain a preview image.
[0044] To make it easier to understand, the following details the process from when the camera captures RAW format image data to when the preview image is displayed on the preview interface: When the camera is on, it continuously captures RAW format image data and sends it to the image signal processor (ISP) for image processing such as de-mosaicing, white balance calibration, noise reduction, sharpening, and dynamic range optimization to obtain RGB format image data. The RGB format image data is then converted to YUV format image data, which is then sent to the image data cache space for caching. The display module of the electronic device can retrieve the cached YUV format image data from the image data cache space in real time, render and display the YUV format image data in real time, thus displaying the preview image in real time on the photo preview interface.
[0045] A semantic recognition model can be a pre-trained model that identifies the semantic scene of a preview image. This semantic recognition model can be, for example, a deep learning-based neural network model, a support vector machine model, or a decision tree model. For instance, MobileNetV3 could be used as the backbone network for this semantic recognition model.
[0046] The preset semantic vector can be a vector of pre-set semantic information. For example, in the case where the first RAW image data is a single frame, the preset semantic vector could include vectors for low-light night scenes, outdoor bright light scenes, indoor portrait scenes, indoor still life scenes, outdoor landscape scenes, and motion scenes. When the first RAW image data consists of multiple frames used to obtain high dynamic range images, the preset semantic vector could include vectors for low-light night scenes, outdoor bright light scenes, indoor portrait scenes, indoor still life scenes, outdoor landscape scenes, motion scenes, and high-contrast scenes.
[0047] The preset confidence threshold can be a pre-set confidence threshold. For example, the value range of the preset confidence threshold can be 0.6 to 1. For example, the preset confidence threshold can be 0.6. The specific value of the preset confidence threshold can be selected by the user according to their needs. It is not limited in this embodiment.
[0048] In some embodiments of this application, the first RAW image data is directly acquired by the sensor of the acquisition device, and then the optical parameters at the time of acquisition of the first RAW image data are obtained through the hardware API of the acquisition device, or the optical parameters at the time of acquisition of the first RAW image data are extracted from the RAW image file through the EXIF parsing tool.
[0049] It should be noted that, since the first RAW image data consists of multiple frames used to obtain high dynamic range images, the optical parameters acquired for each frame of RAW image data are different. Therefore, it is necessary to obtain the optical parameters acquired for each frame of RAW image data.
[0050] When the first RAW image data is a single frame, its optical parameters can be encoded to obtain a sample optical parameter vector. When the first RAW image data consists of multiple frames used to obtain high dynamic range images, the optical parameters of each frame are encoded to obtain a sample optical parameter vector corresponding to each frame.
[0051] When the first RAW image data is a single frame, it can be downsampled to obtain a corresponding preview image. This preview image is then input into a semantic recognition model to obtain the confidence level of the preview image belonging to different preset scene semantics. Based on at least one confidence level and a preset confidence threshold, the scene information when the first RAW image data was acquired can be determined. This scene information is then one-hot encoded; specifically, the main scene information can be encoded as the value "1", and other scene information as the value "0". This yields the sample scene semantic vector of the frame of RAW image data. In the example above, if six scene categories are preset, after inputting the preview image into the semantic recognition model, the model will output six confidence levels, each corresponding to a scene category. Based on the relationship between these six confidence levels and the preset confidence level, the scene information when the first RAW image data was acquired can be determined. This scene information is then one-hot encoded to obtain the sample scene semantic vector of the frame of RAW image data.
[0052] When the first RAW image data consists of multiple frames used to obtain high dynamic range images, the standard exposure frame of the RAW image data can be downsampled to obtain its corresponding preview image. This preview image is then input into a semantic recognition model to obtain the confidence level of the preview image belonging to different preset scene semantics. Based on at least one confidence level and a preset confidence threshold, the scene information at the time of acquiring the first RAW image data can be determined. This scene information is then one-hot encoded to obtain the sample scene semantic vector of that frame of the RAW image data. In the example above, if seven scene categories are preset, after inputting the preview image into the semantic recognition model, the model will output seven confidence levels, each corresponding to a scene category. Based on the relationship between these seven confidence levels and the preset confidence level, the scene information at the time of acquiring the first RAW image data can be determined. This scene information is then one-hot encoded to obtain the sample scene semantic vector of that frame of the RAW image data.
[0053] It should be noted that when determining scene information based on the confidence level output by the semantic recognition model and performing one-hot encoding on the scene information, a confidence threshold can be pre-set. If the confidence level of a certain semantic scene output by the semantic recognition model is greater than the confidence threshold, the preview image is determined to belong to that semantic scene, and the scene information is encoded as "1". Semantic scenes corresponding to other confidence levels are encoded as "0". If multiple confidence levels are greater than the confidence threshold, the semantic scene corresponding to the highest confidence level is determined to be the main scene. In subsequent encoding, the main scene is encoded as "1", and semantic scenes corresponding to other confidence levels are encoded as "0". If all confidence levels output by the semantic recognition model are less than the confidence threshold, the preview image is determined to be an unknown scene, and the semantic scene corresponding to each confidence level is encoded as "0". Thus, when the first RAW image data is a single frame of RAW image data, the sample scene semantic vector is a 6-dimensional encoding vector. When the first RAW image data is multiple frames of RAW image data used to obtain high dynamic range images, the sample scene semantic vector is a 7-dimensional encoding vector.
[0054] In the embodiments of this application, discrete or continuous optical parameters are converted into numerical vectors to facilitate neural network processing. Furthermore, by downsampling the first RAW image data for subsequent image processing, instead of directly using the first RAW image data, the amount of data processed in subsequent data processing is reduced, computational costs are lowered, and the effectiveness of the semantic recognition model in scene recognition of the preview image is improved, thereby increasing the efficiency of determining the semantic vector of the sample scene.
[0055] In some embodiments of this application, since the aforementioned optical parameters may include device parameters of the acquisition device and exposure control parameters, they can be encoded separately, as follows: Encoding the optical parameters corresponding to the first RAW image data to obtain a sample optical parameter vector may specifically include: performing one-hot encoding on the device parameters corresponding to the first RAW image data to obtain a device parameter encoding vector; performing normalized encoding on the exposure control parameters corresponding to the first RAW image data to obtain an exposure control parameter encoding vector; and concatenating the device parameter encoding vector and the exposure control parameter encoding vector to obtain a sample optical parameter vector.
[0056] The device parameter encoding vector can be a vector obtained by one-hot encoding of the device parameters corresponding to the first RAW image data. The exposure control parameter encoding vector can be a vector obtained by normalizing the exposure control parameters corresponding to the first RAW image data. Specifically, the normalization encoding can be min-max normalization encoding.
[0057] It should be noted that one-hot encoding converts device parameters into binary vectors. For example, when a sensor model includes three types, the sensor model is encoded as a 3D binary code.
[0058] The exposure control parameters corresponding to the first RAW image data can be normalized and encoded to obtain the exposure control parameter encoding vector. Specifically, the exposure control parameters corresponding to the first RAW image data can be normalized and encoded according to the following formula (1) to obtain the exposure control parameter encoding vector.
[0059] In the above formula (1), the original parameter value can be the parameter value of the exposure control parameter, and the maximum and minimum parameter values can be determined according to the preset range of the parameter range of the mainstream acquisition device. For example, for the ISO parameter, the ISO corresponding to the first RAW image data acquired is 2000, and the ISO range of the current mainstream acquisition devices is 100 to 64000. Therefore, for the ISO parameter, the maximum parameter value is 64000 and the minimum parameter value is 100.
[0060] After obtaining the device parameter encoding vector and the exposure control parameter encoding vector, they can be concatenated to obtain the sample optical parameter vector. In the example above, if the first RAW image data is a single frame, the sample optical parameter vector is 32-dimensional; if the first RAW image data includes multiple frames used to obtain high dynamic range images, the sample optical parameter vector is 48-dimensional. The sample optical parameter vector can be determined based on the number of optical parameters.
[0061] In the embodiments of this application, the accuracy of optical parameter encoding is improved by dividing the optical parameters into device fixed parameters and exposure control parameters and encoding them separately, rather than encoding the optical parameters using a uniform encoding method.
[0062] In some embodiments of this application, when the first RAW image data includes multiple frames of RAW image data, the multiple frames of RAW image data can be multiple frames of high dynamic range RAW image data. Before concatenating the device parameter encoding vector and the exposure control parameter encoding vector to obtain the sample optical parameter vector, the method described above may further include: determining the exposure time weight vector corresponding to each frame of RAW image data based on the exposure time corresponding to each frame of RAW image data; concatenating the device parameter encoding vector and the exposure control parameter encoding vector to obtain the sample optical parameter vector includes: concatenating the device parameter encoding vector, the exposure control parameter encoding vector, and the exposure time weight vector corresponding to each frame of RAW image data to obtain the sample optical parameter vector.
[0063] For each frame of RAW image data, the corresponding exposure time weight vector can be determined based on the exposure time of that frame. Specifically, the longer the exposure time, the higher the weight.
[0064] Specifically, for each frame of high dynamic range RAW image data, the exposure time weight vector of the high dynamic range RAW image data can be determined based on the following formula (2).
[0065] In the above formula (2), for a certain frame of RAW image data, its exposure time weight vector can be the ratio of the exposure time of the RAW image data of that frame to the maximum exposure time of multiple frames of RAW image data used to obtain high dynamic range images.
[0066] Then, for each frame of RAW image data, the device parameter encoding vector, exposure control parameter encoding vector, and exposure time weight vector corresponding to that frame of RAW image data can be concatenated to obtain the sample optical parameter vector corresponding to that frame of RAW image data.
[0067] In the embodiments of this application, when the first RAW image data consists of multiple frames used to obtain high dynamic range images, an exposure time weight vector is added. This facilitates subsequent fusion of multiple RAW image data frames based on this exposure time weight vector, avoiding the problem of highlight clipping or excessive noise in shadows caused by treating all RAW image data frames with equal weight. Furthermore, through the joint judgment of optical parameters, multi-frame exposure information, and scene semantics, it can adaptively match different imaging conditions and exposure sequences without manual parameter adjustment, solving the problems of weak generalization ability and limited scene adaptation in existing models.
[0068] In some embodiments of this application, after step 110, the method described above may further include: preprocessing the first RAW image data to obtain second RAW image data; normalizing the second RAW image data to obtain third RAW image data; dividing the third RAW image data into a preset number of RAW image data blocks to obtain M RAW image data blocks; the step of extracting the first feature vector of the first RAW image data may specifically include: extracting the first feature vector of each RAW image data block in the first RAW image data.
[0069] The second RAW image data can be obtained by preprocessing the first RAW image data, such as black level correction and bad pixel repair. The third RAW image data can be obtained by normalizing the second RAW image data.
[0070] In some embodiments of this application, when the first RAW image data is a single frame of RAW image data, the first RAW image data can be processed in the following manner to obtain M RAW image data blocks, where M is a positive integer greater than 1. When the first RAW image data includes multiple frames of RAW image data used to obtain high dynamic range images, each frame of RAW image data can be processed using the same method as for a single frame of RAW image data to obtain M RAW image data blocks corresponding to each frame of RAW image data.
[0071] The following describes the processing procedure for each frame of RAW image data: The frame of RAW image data is preprocessed to obtain the second RAW image data. Then, the second RAW image data is normalized to obtain the third RAW image data. The third RAW image data is then divided into a predetermined number of RAW image data blocks, resulting in M RAW image data blocks. When extracting the first feature vector of the first RAW image data, each RAW image data block is processed to extract the feature vector. For example, when dividing the data into M RAW image data blocks, it can be divided into 256×256 pixel image data blocks, with a certain number of pixels overlapping between blocks, such as 16 pixels, to avoid the loss of edge pixels.
[0072] It should be noted that dividing the third RAW image data into a preset number of RAW image data blocks is optional and not mandatory. Alternatively, the third RAW image data can be directly input into the first feature extraction network to extract the first feature vector without dividing it into image data blocks. Whether or not to divide the third RAW image data into image data blocks can be set according to user needs and is not limited in this embodiment. In the following embodiments, dividing the third RAW image data into image data blocks will be used as an example for explanation.
[0073] It should be noted that the number of RAW image data blocks is the same for each frame of RAW image data, that is, the value of M is the same for each frame of RAW image data.
[0074] In the embodiments of this application, by preprocessing and normalizing the first RAW image data, interference from undesirable pixels in the first RAW image data to subsequent noise reduction can be avoided. By dividing the third RAW image data into a preset number of RAW image data blocks, resulting in M RAW image data blocks, subsequent processing can be performed on the RAW image data blocks rather than on the entire third RAW image data, reducing computational overhead and improving computational efficiency.
[0075] In some embodiments of this application, the preprocessing of the first RAW image data to obtain the second RAW image data may specifically include: subtracting the black level value of the acquisition device from the pixel value of each pixel in the first RAW image data to obtain the fourth RAW image data; and repairing the noise in the fourth RAW image data to obtain the second RAW image data.
[0076] The fourth RAW image data can be obtained by subtracting the black level value of the acquisition device from the pixel value of each pixel in the first RAW image data.
[0077] In some embodiments of this application, the fourth RAW image data can be obtained by subtracting the black level value of the acquisition device from the pixel value of each pixel in the first RAW image data. This fourth RAW image data eliminates the reference noise caused by the sensor's dark current. Then, noise in the fourth RAW image data is repaired to obtain the second RAW image data. Specifically, the neighborhood difference method can be used to repair the noise in the fourth RAW image data. Here, noise refers to pixels with abnormally high or low pixel values.
[0078] In the embodiments of this application, by subtracting the black level value of the acquisition device from the pixel value of each pixel in the first RAW image data, the reference noise caused by the dark current of the sensor can be eliminated. In addition, by repairing the noise in the fourth RAW image data, the interference of noise on subsequent noise reduction can be avoided, thereby improving the accuracy of noise reduction.
[0079] In some embodiments of this application, the normalization process of the second RAW image data to obtain the third RAW image data may specifically include: calculating a first difference between the pixel value of each pixel in the second RAW image data and the black level value of the acquisition device; calculating a second difference between the maximum pixel value in the second RAW image data and the black level value of the acquisition device; obtaining the normalized pixel value of the second RAW image data based on the first and second differences; and obtaining the third RAW image data based on the normalized pixel value of the second RAW image data.
[0080] The first difference can be the difference between each pixel in the second RAW image data and the black level value of the acquisition device. The second difference can be the difference between the maximum pixel value in the second RAW image data and the black level value of the acquisition device.
[0081] In some embodiments of this application, a first difference between the pixel value of each pixel in the second RAW image data and the black level value of the acquisition device can be calculated. Then, a second difference between the maximum pixel value in the second RAW image data and the black level value of the acquisition device can be calculated. Based on the first difference and the second difference, the normalized pixel value of the second RAW image data can be obtained, as shown in the following formula (3). Then, based on the normalized pixel value of the second RAW image data, the third RAW image data can be obtained.
[0082] In formula (3) above, the original pixel value is the pixel value of each pixel in the second RAW image data. The first difference is the maximum pixel value, which is the maximum pixel value in the second RAW image data. This maximum pixel value is determined based on the sensor's bit depth. For example, if the maximum pixel value is 10 bits, the corresponding sensor bit depth is 1023. This is the second difference.
[0083] In the embodiments of this application, the second RAW image data is normalized by comparing the pixel value of each pixel in the second RAW image data with the black level value of the acquisition device, thereby eliminating the reference noise offset between different sensors.
[0084] In some embodiments of this application, when the first RAW image data includes multiple frames of RAW image data used to obtain high dynamic range images, after preprocessing the first RAW image data to obtain the second RAW image data, the method described above may further include: performing feature registration on the second RAW image data corresponding to the multiple frames of RAW image data respectively to obtain registered multiple frames of second RAW image data; the normalization processing of the second RAW image data to obtain the third RAW image data may specifically include: performing normalization processing on each frame of second RAW image data in the registered multiple frames of second RAW image data to obtain the third RAW image data corresponding to the N frames of second RAW image data respectively.
[0085] In some embodiments of this application, after preprocessing the first RAW image data to obtain the second RAW image data, feature registration can be performed on the second RAW image data corresponding to the RAW image data of multiple frames used to obtain high dynamic range images. Specifically, the feature points of the standard exposure frame can be extracted using the scale-invariant feature transform algorithm. Based on the standard exposure frame, feature matching and geometric transformation are performed on the dark exposure frame and the bright exposure frame to obtain the registered multi-frame second RAW image data.
[0086] Then, each frame of the registered second RAW image data is normalized to obtain the third RAW image data corresponding to the N frames of second RAW image data.
[0087] It should be noted that before performing feature registration on the second RAW image data corresponding to the multiple RAW image data used to obtain the high dynamic range image, the sharpness of each RAW image data frame can be calculated. Specifically, the sharpness of each RAW image data frame can be calculated using the variance method; that is, for each RAW image data frame, its sharpness can be the variance of its image grayscale values. Then, a preset sharpness threshold is set, and RAW image data frames with sharpness lower than the preset sharpness threshold are removed, retaining only 3 valid RAW image data frames for subsequent processing. These 3 valid RAW image data frames should include a low-exposure image data frame, a standard-exposure image data frame, and a high-exposure image data frame.
[0088] In the embodiments of this application, when the first RAW image data is multi-frame RAW image data, feature registration is performed on the second RAW image data corresponding to the multi-frame RAW image data respectively, so as to eliminate inter-frame offset caused by hand shake or device displacement.
[0089] Step 120: Extract the first feature vector of the first RAW image data, the second feature vector of the sample optical parameter vector, and the third feature vector of the sample scene semantic vector.
[0090] The first feature vector can be the feature vector extracted from the first RAW image data. The second feature vector can be the feature vector extracted from the sample optical parameter vector. The third feature vector can be the feature vector extracted from the sample scene semantic vector.
[0091] In some embodiments of this application, step 120 may specifically include: The first RAW image data is input into the first feature extraction network to obtain the first feature vector; the sample optical parameter vector is input into the second feature extraction network to obtain the second feature vector; and the sample scene semantic vector is input into the third feature extraction network to obtain the third feature vector.
[0092] The first feature extraction network can be a network used to extract feature vectors from the first RAW image data. The second feature extraction network can be a network used to extract feature vectors from the sample optical parameter vectors. The third feature extraction network can be a network used to extract feature vectors from the sample scene semantic vectors.
[0093] In some embodiments of this application, the first RAW image data can be input into a first feature extraction network to obtain a first feature vector. Specifically, a 256×256 RAW image data block can be input into the first feature extraction network. This first feature extraction network can be a lightweight deep learning model based on the Transformer architecture, such as the Vision Transformer. The Vision Transformer network contains 12 Transformer encoding layers. Each encoder layer consists of a multi-head self-attention mechanism (8 attention heads, head dimension 64), layer normalization, and a feedforward network (hidden layer dimension 1024). The RAW image data block is converted into sequence features through 16×16 patch embedding (embedding dimension 512), ultimately outputting a pixel-level spatial feature vector of dimension 512, i.e., the first feature vector. This first feature vector contains information such as noise distribution and edge details.
[0094] The sample optical parameter vector is input into the second feature extraction network to obtain the second feature vector. Specifically, this second feature extraction network can be a three-layer fully connected network as the feature extractor, with the sample optical parameter vector as its input. In this three-layer fully connected network, the first layer has an output dimension of 256 and uses the Linear Rectification Function (ReLU) as its activation function; the second layer has an output dimension of 512 and uses ReLU as its activation function; and the third layer has an output dimension of 512 and no activation function. This network maps the low-dimensional optical parameters to a high-dimensional feature vector consistent with the feature dimension of the RAW image.
[0095] Thus, the first two ReLU activation layers ensure nonlinear representation capabilities and avoid feature linearization, while the last layer has no activation function to prevent excessive compression of high-dimensional features and solve the gradient vanishing problem. In this way, three fully connected layers can map low-dimensional optical parameters to high-dimensional feature vectors consistent with the feature dimensions of the RAW image, achieving multimodal feature dimension alignment. This solves the problem of poor fusion results caused by dimensional mismatch when fusing optical parameters and RAW image features in existing technologies. Furthermore, by progressively extracting features through three fully connected layers, a gradual dimensionality-upgrading structure of features from low-dimensional to mid-dimensional to high-dimensional is achieved, solving the problem of information loss caused by directly upgrading low-dimensional optical parameters.
[0096] It should be noted that when the first RAW image data consists of multiple frames of high dynamic range RAW image data, the sample optical parameter vector input to the second feature extraction network is a 48-dimensional sample optical parameter vector that includes the exposure time weight vector. When the first RAW image data consists of a single frame of RAW image data, the sample optical parameter vector input to the second feature extraction network is a 32-dimensional sample optical parameter vector that does not include the exposure time weight vector.
[0097] The semantic vector of the sample scene is input into the third feature extraction network to obtain the third feature vector. This third feature extraction network can be a feature extractor consisting of two convolutional neural networks and one fully connected layer, with the sample scene semantic vector as input. The first convolutional neural network is a 1×1 convolutional layer with 64 output channels, a stride of 1, padding of 0, and ReLU activation function. The second convolutional neural network is a 1×1 convolutional layer with 128 output channels, a stride of 1, padding of 0, and ReLU activation function. The fully connected layer has an output dimension of 512 and no activation function. The final output is a feature vector with the same feature dimension as the RAW image data.
[0098] Thus, by optimizing the stride and padding parameters of the 1×1 convolution to address the low-dimensionality of the semantic vectors in the sample scene, redundant computation and feature dimension waste are avoided, solving the overfitting problem of traditional 1×1 convolution when processing low-dimensional vectors. Through a two-step mapping of "convolution to expand channels + full connection to increase dimensions", the category association features of the semantic vectors are preserved, which is more semantically representative than the direct dimensionality increase of traditional single-layer full connection. Finally, the output scene semantic feature vector with the same feature dimension as the RAW image is formed with the optical parameter feature branch to form a multimodal feature with a unified dimension, solving the problems of inconsistent feature dimensions and high fusion difficulty in existing technologies.
[0099] In the embodiments of this application, feature vectors of the first RAW image data, sample optical parameter vectors, and sample scene semantic vectors are extracted separately using different feature extraction networks, instead of using a unified feature extraction network to extract feature vectors of the first RAW image data, sample optical parameter vectors, and sample scene semantic vectors. This improves the accuracy of obtaining feature vectors of the first RAW image data, sample optical parameter vectors, and sample scene semantic vectors.
[0100] In some embodiments of this application, when the first RAW image data includes multiple frames of RAW image data, these multiple frames of RAW image data can be multiple frames of RAW image data used to obtain high dynamic range images. In this case, it is necessary to fuse the features of the multiple frames of RAW image data together. Therefore, the Vision Transformer network described above cannot be used for feature extraction of multiple frames of RAW image data, and a feature extraction network used for high dynamic range RAW image data is required. That is, when the first RAW image data includes multiple frames of RAW image data used to obtain high dynamic range images, the first feature extraction network can include a temporal attention layer and an encoding layer. Here, the encoding layer can be the 12-layer Transformer encoding layer described above for processing single-frame RAW image data.
[0101] When the first RAW image data includes multiple frames of RAW image data, the step of inputting the first RAW image data into the first feature extraction network to obtain the first feature vector may specifically include: inputting the multiple frames of RAW image data into the temporal attention layer of the first feature extraction network, fusing the features of the RAW image data according to the acquisition time sequence of each frame of RAW image data to obtain a fused feature map; inputting the fused feature map into the encoding layer, and extracting the features from the fused feature map to obtain the first feature vector.
[0102] In some embodiments of this application, when the first RAW image data consists of multiple frames used to obtain high dynamic range images, the first feature extraction network can employ a temporal attention layer + VisionTransformer network as the feature extractor.
[0103] Then, the multi-frame RAW image data is input into the temporal attention layer of the first feature extraction network. According to the acquisition time sequence of each frame of RAW image data, the features of the multi-frame RAW image data are fused to obtain a fused feature map. Then, the fused feature map is input into the 12-layer Transformer encoding layer to extract the features from the fused feature map to obtain the first feature vector.
[0104] In the embodiments of this application, when the first RAW image data includes multiple frames of RAW image data, a temporal attention mechanism is used to aggregate the details of the multiple frames of RAW image data, solving the problems of insufficient dynamic range in traditional single-frame noise reduction and easy introduction of noise in traditional HDR solutions. In complex high-contrast scenes, the noise suppression rate is improved, while the dynamic range coverage is also improved, resulting in clear details in dark areas and no overexposure in bright areas. Furthermore, the lightweight feature extractor and temporal attention aggregation mechanism in the embodiments of this application avoid the inefficiency caused by simply superimposing multiple frames of data.
[0105] Step 130: Fuse the first feature vector, the second feature vector, and the third feature vector to obtain the fused feature.
[0106] In some embodiments of this application, in order to accurately obtain the fusion features, step 130 may specifically include: Based on the first feature vector, a query vector is determined; based on the second feature vector, a first key vector and a first value vector are determined; based on the third feature vector, a second key vector and a second value vector are determined; based on the similarity between the query vector and the first key vector, a first weight of the second feature vector is determined; based on the similarity between the query vector and the second key vector, a second weight of the third feature vector is determined; based on the first weight and the second weight, the first feature vector, the second feature vector, and the third feature vector are fused to obtain the fused feature.
[0107] The query vector can be determined based on the first feature vector, specifically by using the first feature vector as the query vector.
[0108] The first key vector and the first value vector can be determined based on the second eigenvector. Specifically, the first key vector can be obtained by multiplying the second eigenvector by matrix A, and the second value vector can be obtained by multiplying the second eigenvector by matrix B. Here, matrix A and matrix B are two different matrices, and the specific matrices A and B can be selected by the user.
[0109] The second key vector and the second value vector can be determined based on the third eigenvector. Specifically, the second key vector can be obtained by multiplying the third eigenvector by matrix C, and the third value vector can be obtained by multiplying the third eigenvector by matrix D. Here, matrices C and D are two different matrices, and the specific matrices C and D can be selected by the user.
[0110] The first weight can be determined based on the similarity between the query vector and the first key vector. The second weight can be determined based on the similarity between the query vector and the second key vector.
[0111] In some embodiments of this application, the first feature vector can be used as the query vector (query), and then the first key vector (key1) and the first value vector (value1) can be determined based on the second feature vector. The second key vector (key2) and the second value vector (value2) can be determined based on the third feature vector. Then, the similarity between the query vector and the first key vector, and the similarity between the query vector and the second key vector are determined according to the following formula (4).
[0112] In the above formula (4), For query vector, For key vectors, i.e., the first key vector key1 or the second key vector key2, in When the first key vector is key1, To query the similarity between the vector and the first key vector, in When the second key vector is key2, To query the similarity between the vector and the second key vector, For the feature dimension, here The value is 512.
[0113] After determining the similarity between the query vector and the first key vector, and the similarity between the query vector and the second key vector, the first weight of the second feature vector can be determined based on the similarity between the query vector and the first key vector; the second weight of the third feature vector can be determined based on the similarity between the query vector and the second key vector. Specifically, there can be a correspondence between similarity and weight. Thus, the first weight W1 of the second feature vector can be determined based on the similarity between the query vector and the first key vector and this correspondence, and the second weight W2 of the third feature vector can be determined based on the similarity between the query vector and the second key vector and this correspondence.
[0114] Then, based on the first and second weights, the first feature vector, the second feature vector, and the third feature vector are fused to obtain the fused feature.
[0115] In the embodiments of this application, a query vector, a key vector, and a value vector are constructed using a first feature vector, a second feature vector, and a third feature vector. This allows for the alignment of optical parameter feature vectors, scene semantic feature vectors, and RAW image feature vectors through a cross-attention mechanism, avoiding the problems of large differences in feature distributions across different modalities and distortion in similarity calculation, thereby improving the accuracy of feature calculation.
[0116] In some embodiments of this application, the step of fusing the first feature vector, the second feature vector, and the third feature vector based on the first weight and the second weight to obtain the fused feature may specifically include: performing a weighted calculation on the first value vector and the second value vector based on the first weight and the second weight to obtain a weighted vector; and calculating the sum of the weighted vector and the query vector to obtain the fused feature.
[0117] In some embodiments of this application, the first value vector and the second value vector can be weighted based on the first weight and the second weight to obtain a weighted vector. Then, the sum of the weighted vector and the query vector is calculated to obtain the fusion feature, as shown in the following formula (5).
[0118] In the above formula (5), As the first weight, As the second weight, For the first value vector, For the second value vector, This is the query vector.
[0119] In the embodiments of this application, by weighting the first value vector and the second value vector and fusing the first feature vector, the optical parameter feature vector and the scene semantic feature vector can be transformed into the same feature space as the RAW image feature vector, thus avoiding the calculation error caused by the optical parameter feature vector and the scene semantic feature vector being in different feature spaces from the RAW image feature vector.
[0120] Step 140: Input the fused features into the first noise reduction model to obtain the output RAW image data.
[0121] The first denoising model can be the denoising model to be trained. The first denoising model can be a multimodal neural network model, such as a Contrastive Language-Image Pre-training (CLIP) model.
[0122] The output RAW image data can be the noise-removed RAW image data obtained by processing the fusion features based on the first noise reduction model.
[0123] Step 150: Adjust the model parameters of the first noise reduction model based on the output RAW image data and the reference RAW image data to obtain the second noise reduction model.
[0124] The second noise reduction model can be a noise reduction model obtained by training the first noise reduction model.
[0125] In some embodiments of this application, step 150 may specifically include: determining the noise reduction loss function value of the first noise reduction model based on the pixel value of each pixel in the output RAW image data and the pixel value of each pixel in the reference RAW image data; determining the enhancement loss function value of the first noise reduction model based on the image content of the output RAW image data and the image content of the reference RAW image data; performing a weighted calculation on the noise reduction loss function value and the enhancement loss function value to obtain the target loss function value of the first noise reduction model; and adjusting the model parameters of the first noise reduction model based on the target loss function value to obtain the second noise reduction model.
[0126] The noise reduction loss function value can be determined based on the pixel value of each pixel in the output RAW image data and the pixel value of each pixel in the reference RAW image data. Specifically, it can be the difference between the pixel value of each pixel in the output RAW image data and the pixel value of each pixel in the reference RAW image data.
[0127] The enhancement loss function value can be determined based on the image content of the output RAW image data and the image content of the reference RAW image data.
[0128] The target loss function value can be the loss function value obtained by weighting the noise reduction loss function value and the noise reduction loss function value.
[0129] In some embodiments of this application, the fusion features can be input into the first noise reduction model to obtain the output RAW image data. Then, according to the following formula (6), the difference between the pixel value of each pixel in the output RAW image data and the pixel value of each pixel in the reference RAW image data is calculated to determine the noise reduction loss function value of the first noise reduction model.
[0130] In the above formula (6), The number of pixels. To output the first RAW image data The pixel value of each pixel. For reference, the first RAW image data The pixel value of each pixel.
[0131] It should be noted that the number of pixels in the output RAW image data is the same as the number of pixels in the reference RAW image data.
[0132] Then, based on the image content of the output RAW image data and the image content of the reference RAW image data, the enhancement loss function value of the first noise reduction model can be determined. Then, according to the following formula (7), the denoising loss function value and the enhancement loss function value are weighted and calculated to obtain the target loss function value of the first denoising model. .
[0133] In the above formula (7), and These are the weights of the noise reduction loss function value and the enhancement loss function value, respectively.
[0134] In some embodiments of this application, when adjusting the model parameters of the first denoising model according to the target loss function value, the model parameters of the first denoising model can be adjusted according to the correspondence between the target loss function value and the preset loss function value. Specifically, when the target loss function value is less than the preset loss function value, it can be determined that the adjustment of the model parameters of the first denoising model is complete, and thus the second denoising model can be obtained.
[0135] In the embodiments of this application, by using dual-task loss function values—denoising loss function value and enhancement loss function value—instead of using a single loss function value to train the first denoising model, the training accuracy and robustness of the first denoising model are improved.
[0136] In some embodiments of this application, the enhancement loss function value may include a detail-aware loss function value and a color fidelity loss function value. Specifically, determining the enhancement loss function value of the first denoising model based on the image content of the output RAW image data and the image content of the reference RAW image data may include: determining the detail-aware loss function value of the first denoising model based on the structural similarity index between the output RAW image data and the reference RAW image data; converting the output RAW image data and the reference RAW image data to a preset color space to obtain first color image data and second color image data; determining the color fidelity loss function value of the first denoising model based on the average absolute error of the first color image data and the second color image data in the luminance channel, red-green difference channel, and blue-yellow difference channel; and performing a weighted calculation on the detail-aware loss function value and the color fidelity loss function value to obtain the enhancement loss function value of the first denoising model.
[0137] The value of the detail-aware loss function can be determined based on the structural similarity index between the output RAW image data and the reference RAW image data.
[0138] The first color image data can be the image data obtained by converting the output RAW image data to a preset color space. The second color image data can be the image data obtained by converting the reference RAW image data to a preset color space. The preset color space can be a pre-set color space, such as the CIELAB color space.
[0139] The color fidelity loss function value can be determined based on the average absolute error of the first color image data and the second color image data in the luminance channel, red-green difference channel, and blue-yellow difference channel.
[0140] In some embodiments of this application, the detail-aware loss function value of the first noise reduction model can be determined according to the structural similarity index between the output RAW image data and the reference RAW image data, as shown in the following formula (8). .
[0141] In the above formula (8), for and Structural similarity index between them To output RAW image data, For reference RAW image data.
[0142] Then, the output RAW image data and the reference RAW image data can be converted to a preset color space to obtain the first color image data and the second color image data. Furthermore, based on the average absolute errors of the first and second color image data in the luminance channel, red-green difference channel, and blue-yellow difference channel, the color fidelity loss function value of the first noise reduction model can be determined. As shown in the following formula (9).
[0143] In the above formula (9), The average absolute error of the first and second color image data in the luminance channel. The mean absolute error of the first and second color image data in the red-green difference channel. The mean absolute error of the first color image data and the second color image data in the blue-yellow difference channel.
[0144] Then, the detail perception loss function value and the color fidelity loss function value can be weighted and calculated to obtain the enhancement loss function value of the first noise reduction model, as shown in the following formula (10).
[0145] It should be noted that in the above formula (10), the weights of the detail perception loss function value and the color fidelity loss function value are both 0.5. In actual operation, the weights of the detail perception loss function value and the color fidelity loss function value can be set by the user according to their needs, and are not limited in this embodiment.
[0146] In the embodiments of this application, by decoupling details and color loss, both noise reduction cleanliness and detail preservation can be achieved, while color reproduction can be made accurate and closely match human visual perception, thus improving the noise reduction effect.
[0147] In some embodiments of this application, when the first RAW image data includes multiple frames of RAW image data, the reference RAW image data is a single frame of high dynamic range RAW image data.
[0148] Before the weighted calculation of the noise reduction loss function value and the enhancement loss function value to obtain the target loss function value of the first noise reduction model, the method mentioned above further includes: determining the high dynamic range fusion loss function value of the first noise reduction model based on the light intensity information of the output RAW image data and the light intensity information of the reference RAW image data; The step of weighting the noise reduction loss function value and the enhancement loss function value to obtain the target loss function value of the first noise reduction model includes: weighting the noise reduction loss function value, the enhancement loss function value, and the high dynamic range fusion loss function value to obtain the target loss function value of the first noise reduction model.
[0149] The light intensity information in the RAW image data can be information related to the light intensity in the RAW image data. For example, the light intensity information can include dynamic range information in the RAW image data used to characterize different light intensities, and it can also include different brightness information in the RAW image data caused by different light intensities.
[0150] The high dynamic range fusion loss function value can be determined based on the output RAW image data and the reference RAW image data.
[0151] In some embodiments of this application, when the first RAW image data includes multiple frames of high dynamic range RAW image data, the high dynamic range fusion loss function value of the first noise reduction model can be determined based on the output RAW image data and the reference RAW image data. Then, the noise reduction loss function value, the enhancement loss function value and the high dynamic range fusion loss function value can be weighted and calculated to obtain the target loss function value of the first noise reduction model, as shown in the following formula (11).
[0152] In the above formula (11), For the high dynamic range fusion loss function value, Weights for high dynamic range fusion loss function values.
[0153] In the embodiments of this application, when the first RAW image data is multi-frame high dynamic range RAW image data, by increasing the value of the high dynamic range fusion loss function, it can be ensured that the details in the multi-frame high dynamic range RAW image data are not distorted or lost, thereby improving the robustness of the first noise reduction model.
[0154] In some embodiments of this application, determining the high dynamic range fusion loss function value of the first denoising model based on the output RAW image data and the reference RAW image data may specifically include: determining the dynamic range coverage loss function value of the first denoising model based on the dynamic range of the output RAW image data and the dynamic range of the reference RAW image data; determining the detail consistency loss function value of the first denoising model based on the structural similarity index between different brightness regions of the output RAW image data and the corresponding different brightness regions of the reference RAW image data; and performing a weighted calculation on the dynamic range coverage loss function value and the detail consistency loss function value to obtain the high dynamic range fusion loss function value of the first denoising model.
[0155] The dynamic range coverage loss function value can be determined based on the dynamic range of the output RAW image data and the dynamic range of the reference RAW image data.
[0156] The detail consistency loss function value can be determined based on the structural similarity index between different brightness regions of the output RAW image data and the corresponding different brightness regions of the reference RAW image data.
[0157] In some embodiments of this application, the dynamic range coverage loss function value of the first noise reduction model can be determined according to the following formula (12) based on the dynamic range of the output RAW image data and the dynamic range of the reference RAW image data.
[0158] In the above formula (12), To output the dynamic range of RAW image data, The dynamic range of the reference RAW image data.
[0159] Based on the structural similarity index between different brightness regions of the output RAW image data and the corresponding different brightness regions of the reference RAW image data, the detail consistency loss function value of the first noise reduction model is determined as shown in the following formula (13).
[0160] In the above formula (13), This is a structural similarity index between the dark and bright areas of the output RAW image data and the dark and bright areas of the reference RAW image data. This is a structural similarity index between the central brightness region of the output RAW image data and the central brightness region of the reference RAW image data. The structural similarity index between the bright areas of the output RAW image data and the bright areas of the reference RAW image data.
[0161] It should be noted that, for any image data in the output RAW image data and the reference RAW image data, the division of different brightness regions of the image data can be determined based on the brightness values of different regions. For example, brightness thresholds can be preset, such as two brightness thresholds, namely brightness threshold 1 and brightness threshold 2, where brightness threshold 1 is greater than brightness threshold 2. Then, the region with a brightness value higher than brightness threshold 1 is defined as the bright brightness region, the region with a brightness value between brightness threshold 1 and brightness threshold 2 is defined as the mid-brightness region, and the region with a brightness value lower than brightness threshold 2 is defined as the dark brightness region.
[0162] Then, the dynamic range coverage loss function value and the detail consistency loss function value can be weighted and calculated to obtain the high dynamic range fusion loss function value of the first noise reduction model, as shown in the following formula (14).
[0163] In the above formula (14), the weights of the dynamic range coverage loss function value and the detail consistency loss function value are 0.6 and 0.4, respectively. In actual operation, the weights of the dynamic range coverage loss function value and the detail consistency loss function value can be set by the user according to their needs, and are not limited in this embodiment.
[0164] In the embodiments of this application, by calculating the dynamic range coverage loss, it can be ensured that the output RAW image data can fully cover the bright and dark details of the high dynamic range, and by using the detail consistency loss, it can be ensured that the details are not lost after HDR fusion, thereby improving the training accuracy of the first noise reduction model.
[0165] In some embodiments of this application, the optical parameters of the acquisition device when acquiring the first RAW image data may include photosensitivity; before the weighted calculation of the noise reduction loss function value, enhancement loss function value, and high dynamic range fusion loss function value to obtain the target loss function value of the first noise reduction model, the method may further include: determining a third weight, a fourth weight, and a fifth weight of the noise reduction loss function value based on ISO and scene semantic information when acquiring the first RAW image data; the weighted calculation of the noise reduction loss function value, enhancement loss function value, and high dynamic range fusion loss function value to obtain the target loss function value of the first noise reduction model includes: weighting the noise reduction loss function value, enhancement loss function value, and high dynamic range fusion loss function value based on the third weight, fourth weight, and fifth weight to obtain the target loss function value of the first noise reduction model.
[0166] The third weight can be the weight of the noise reduction loss function value, i.e., the weight in formula (11) above. The fourth weight can be a weight that enhances the value of the loss function, i.e., the weight in formula (11) above. The fifth weight can be the weight of the high dynamic range fusion loss function value, i.e., the weight in formula (11) above. value.
[0167] In some embodiments of this application, the third weight of the noise reduction loss function value, the fourth weight of the enhancement loss function value, and the fifth weight of the high dynamic range fusion loss function value can be determined based on ISO and scene semantic information when the first RAW image data is acquired. Then, the noise reduction loss function value, the enhancement loss function value, and the high dynamic range fusion loss function value can be weighted and calculated based on the third weight, the fourth weight, and the fifth weight to obtain the target loss function value of the first noise reduction model.
[0168] The following uses semantic information and ISO from different scenarios to illustrate the third weight of the determined noise reduction loss function value, the fourth weight of the enhancement loss function value, and the fifth weight of the high dynamic range fusion loss function value: Low-light, high-ISO scenes: When ISO ≥ 3200 and the scene semantic information is "low-light night scene", it is necessary to enhance the noise suppression weights. For example, the weights of the noise reduction loss function value can be increased. The value was adjusted from 0.6 to 0.8, while the enhancement of shadow detail was increased. For example, a shadow enhancement coefficient was added to the output layer of the first noise reduction model, multiplying the shadow pixel value by 1.2 to 1.5. The specific coefficient is dynamically adjusted according to the exposure time: the shorter the exposure time, the larger the coefficient.
[0169] In high-brightness ISO scenes or high-ratio scenes: when ISO ≤ 400 and the scene semantic information is "high-ratio scene", it is necessary to strengthen the weight of the high dynamic range fusion loss function value, for example, it could be The value was adjusted from 0.3 to 0.5 to optimize the dynamic range distribution in the bright-dark transition area and reduce the noise reduction intensity. The value was adjusted from 0.4 to 0.2 to preserve highlights and shadows, and to prevent halos from appearing during HDR blending.
[0170] Indoor portrait scenes: When the semantic information of the scene is "indoor portrait", the skin texture details can be preserved for the portrait area (specifically, the portrait mask output by the semantic segmentation model can be used for localization first), that is, the noise reduction intensity of the area is reduced, while the noise reduction intensity of the background area is enhanced; at the same time, the weight of the color fidelity loss function value is adjusted, for example, from 0.5 to 0.6, to ensure that the skin color is restored naturally.
[0171] When the scene semantic information is "unknown scene", default parameters can be used, such as =0.6, =0.4, and at the same time, the first noise reduction model adaptive judgment mechanism is enabled. By calculating the noise variance of the input image (noise variance > 0.01 is judged as high noise, otherwise it is low noise), the strategy of low ISO in weak light or low ISO in strong light is automatically matched.
[0172] In the embodiments of this application, the third weight of the denoising loss function value, the fourth weight of the enhancement loss function value, and the fifth weight of the high dynamic range fusion loss function value are dynamically determined by ISO and scene semantic information when the first RAW image data is acquired, instead of weighting the denoising loss function value, the enhancement loss function value, and the high dynamic range fusion loss function value according to fixed weights. This improves the accuracy of determining the target loss function value and thus improves the training accuracy of the first denoising model.
[0173] After training the first denoising model and obtaining the second denoising model, the second denoising model can be used to denoise RAW image data.
[0174] The noise reduction method provided in this application will be described in detail below with reference to the accompanying drawings, through specific embodiments and application scenarios.
[0175] Figure 2 This is a flowchart illustrating a noise reduction method provided in an embodiment of this application, as shown below. Figure 2 As shown, the noise reduction method provided in this application embodiment may include steps 210-240.
[0176] Step 210: Obtain the RAW image data to be processed, the scene semantic vector when acquiring the RAW image data to be processed, and the corresponding optical parameter vector when acquiring the RAW image data to be processed.
[0177] The RAW image data to be processed may include at least one frame of RAW image data. In the example above, the RAW image data to be processed may be one frame of RAW image data captured when image A was taken. The RAW image data to be processed may also include multiple frames of RAW image data used to obtain a high dynamic range image. In the example above, the first RAW image data may be five frames of RAW image data at different exposures captured when image B was taken.
[0178] The scene semantic vector can be a vector of scene semantic information acquired when collecting RAW image data to be processed. The method for determining the scene semantic vector is the same as the method for determining the sample scene semantic vector in the above embodiment, and will not be repeated here.
[0179] The optical parameter vector can be a vector of the optical parameters of the acquisition device when acquiring RAW image data to be processed. The determination method of the optical parameter vector is the same as that of the sample optical parameter vector in the above embodiment, and will not be repeated here.
[0180] Step 220: Extract the first target feature vector, the second target feature vector of the optical parameter vector, and the third target feature vector of the scene semantic vector from the RAW image data to be processed.
[0181] The first target feature vector can be the feature vector extracted from the RAW image data to be processed. The extraction method for this first target feature vector is the same as that for the first feature vector in the above embodiments, and will not be repeated here.
[0182] The second target feature vector can be a feature vector of the extracted optical parameter vector. The extraction method of the second target feature vector is the same as that of the second feature vector in the above embodiment, and will not be repeated here.
[0183] The third target feature vector can be the feature vector of the extracted scene semantic vector. The extraction method of the third target feature vector is the same as that of the third feature vector in the above embodiment, and will not be repeated here.
[0184] Step 230: Fuse the first target feature vector, the second target feature vector, and the third target feature vector to obtain the target fusion feature.
[0185] The target fusion feature can be the feature obtained by fusing the first target feature vector, the second target feature vector, and the third target feature vector. The determination method for this target fusion feature is the same as that for the fusion feature in the above embodiments, and will not be repeated here.
[0186] Step 240: Input the target fusion features into the second noise reduction model to obtain the target output RAW image data.
[0187] The second noise reduction model can be the second noise reduction model obtained in the above embodiments.
[0188] The target output RAW image data can be the output RAW image data of the second noise reduction model after the target fusion features are input into the second noise reduction model.
[0189] In the embodiments of this application, the first RAW image data, the sample scene semantic vector when the first RAW image data was acquired, and the sample optical parameter vector corresponding to the acquisition of the first RAW image data are used to perform noise reduction processing on the first RAW image data. In this way, by introducing two modes, scene semantic vector and optical parameter vector, to perform noise reduction processing on the first RAW image data, compared with relying solely on the single-modal information of the first RAW image data, it is possible to adaptively reduce noise in RAW image data for different sources of optical noise and scene types, avoiding the problem of detail loss in RAW image data caused by the "one-size-fits-all" noise reduction method, and improving the noise reduction effect on RAW image data.
[0190] In some embodiments of this application, after obtaining the target output RAW image data, the method described above may further include: converting the target output RAW image data into DNG format image data.
[0191] In some embodiments of this application, the target output RAW image data is in RAW format. In order to facilitate the storage and ISP processing of the target output RAW image data, the target output RAW image data can be converted to DNG format image data.
[0192] In the embodiments of this application, by converting the target output RAW image data into DNG format image data, it is convenient to store and ISP process the target output RAW image data, avoiding the problem that the target output RAW image data cannot be stored and ISP processed due to incorrect format.
[0193] The denoising model training method provided in this application can be executed by a denoising model training device. This application uses the example of a denoising model training device executing the denoising model training method to illustrate the denoising model training device provided in this application.
[0194] Figure 3 This is a schematic diagram illustrating the structure of a noise reduction model training device according to an exemplary embodiment. Figure 3 As shown, the noise reduction model training device 300 may include: The acquisition module 310 is used to acquire training data pairs, the training data pairs including first RAW image data, sample scene semantic vectors when the first RAW image data is acquired, sample optical parameter vectors corresponding to the acquisition of the first RAW image data, and reference RAW image data corresponding to the first RAW image data. The first RAW image data includes at least one frame of RAW image data. The extraction module 320 is used to extract the first feature vector of the first RAW image data, the second feature vector of the sample optical parameter vector, and the third feature vector of the sample scene semantic vector; The fusion module 330 is used to fuse the first feature vector, the second feature vector, and the third feature vector to obtain fused features; Input module 340 is used to input the fused features into the first noise reduction model to obtain output RAW image data; The training module 350 is used to adjust the model parameters of the first noise reduction model based on the output RAW image data and the reference RAW image data to obtain a second noise reduction model.
[0195] In this embodiment, the first denoising model is trained using first RAW image data, sample scene semantic vectors acquired when the first RAW image data was collected, and sample optical parameter vectors acquired when the first RAW image data was collected. By introducing two modalities, scene semantic vectors and optical parameter vectors, to train the first denoising model, compared to relying solely on single-modal information from the first RAW image data, it can adaptively denoise the RAW image data for different sources of optical noise and scene types. This avoids the loss of details in the RAW image data caused by a "one-size-fits-all" denoising method and improves the denoising effect on the RAW image data.
[0196] In some embodiments of this application, the acquisition module is specifically used for: Acquire first RAW image data and the corresponding optical parameters when acquiring the first RAW image data; encode the optical parameters corresponding to the first RAW image data to obtain a sample optical parameter vector; downsample the first RAW image data to obtain a preview image corresponding to the first RAW image data; input the preview image into a semantic recognition model and output at least one confidence level; wherein, one confidence level corresponds to a preset scene semantic; determine the scene information when acquiring the first RAW image data based on at least one confidence level and a preset confidence threshold; perform one-hot encoding on the scene information to obtain a sample scene semantic vector corresponding to the first RAW image data.
[0197] In some embodiments of this application, the optical parameters include device parameters of the acquisition device and exposure control parameters; the acquisition module is specifically used to: perform one-hot encoding on the device parameters corresponding to the first RAW image data to obtain a device parameter encoding vector; perform normalized encoding on the exposure control parameters corresponding to the first RAW image data to obtain an exposure control parameter encoding vector; and concatenate the device parameter encoding vector and the exposure control parameter encoding vector to obtain a sample optical parameter vector.
[0198] In some embodiments of this application, when the first RAW image data includes multiple frames of RAW image data, the acquisition module is further configured to determine the exposure time weight vector corresponding to each frame of RAW image data based on the exposure time corresponding to each frame of RAW image data before concatenating the device parameter encoding vector and the exposure control parameter encoding vector to obtain the sample optical parameter vector. The acquisition module is specifically used to: concatenate the device parameter encoding vector, the exposure control parameter encoding vector, and the exposure time weight vector corresponding to each frame of RAW image data to obtain a sample optical parameter vector.
[0199] In some embodiments of this application, the extraction module is specifically used to: input the first RAW image data into a first feature extraction network to obtain a first feature vector; input the sample optical parameter vector into a second feature extraction network to obtain a second feature vector; and input the sample scene semantic vector into a third feature extraction network to obtain a third feature vector.
[0200] In some embodiments of this application, when the first RAW image data includes multiple frames of RAW image data, the first feature extraction network includes a temporal attention layer and an encoding layer; the extraction module is specifically used to: input multiple frames of RAW image data into the temporal attention layer of the first feature extraction network, fuse the features of the RAW image data according to the acquisition sequence of each frame of RAW image data to obtain a fused feature map; input the fused feature map into the encoding layer, extract the features from the fused feature map to obtain the first feature vector.
[0201] In some embodiments of this application, the apparatus further includes: The preprocessing module is used to preprocess the first RAW image data after the training data pairs are acquired to obtain the second RAW image data. The normalization processing module is used to normalize the second RAW image data to obtain the third RAW image data; The segmentation module is used to segment the third RAW image data into a preset number of RAW image data blocks, resulting in M RAW image data blocks; where M is a positive integer greater than 1. The extraction module is specifically used to extract the first feature vector of each RAW image data block in the first RAW image data.
[0202] In some embodiments of this application, the normalization processing module is specifically used to: calculate a first difference between the pixel value of each pixel in the second RAW image data and the black level value of the acquisition device; calculate a second difference between the maximum pixel value in the second RAW image data and the black level value of the acquisition device; obtain the normalized pixel value of the second RAW image data based on the first difference and the second difference; and obtain the third RAW image data based on the normalized pixel value of the second RAW image data.
[0203] In some embodiments of this application, the fusion module is specifically used for: determining a query vector based on the first feature vector; determining a first key vector and a first value vector based on the second feature vector; determining a second key vector and a second value vector based on the third feature vector; determining a first weight of the second feature vector based on the similarity between the query vector and the first key vector; determining a second weight of the third feature vector based on the similarity between the query vector and the second key vector; and fusing the first feature vector, the second feature vector, and the third feature vector based on the first weight and the second weight to obtain a fused feature.
[0204] In some embodiments of this application, the fusion module is specifically used to: perform weighted calculation on the first value vector and the second value vector based on the first weight and the second weight to obtain a weighted vector; calculate the sum of the weighted vector and the query vector to obtain the fusion feature.
[0205] In some embodiments of this application, the training module is specifically used to: determine the noise reduction loss function value of the first noise reduction model based on the pixel value of each pixel in the output RAW image data and the pixel value of each pixel in the reference RAW image data; determine the enhancement loss function value of the first noise reduction model based on the image content of the output RAW image data and the image content of the reference RAW image data; perform a weighted calculation on the noise reduction loss function value and the enhancement loss function value to obtain the target loss function value of the first noise reduction model; and adjust the model parameters of the first noise reduction model based on the target loss function value to obtain a second noise reduction model.
[0206] In some embodiments of this application, when the first RAW image data includes multiple frames of high dynamic range RAW image data, the reference RAW image data is one frame of high dynamic range RAW image data; the training module is further configured to: before performing weighted calculation on the denoising loss function value and the enhancement loss function value to obtain the target loss function value of the first denoising model, determine the high dynamic range fusion loss function value of the first denoising model based on the light intensity information of the output RAW image data and the light intensity information of the reference RAW image data; and perform weighted calculation on the denoising loss function value, the enhancement loss function value, and the high dynamic range fusion loss function value to obtain the target loss function value of the first denoising model.
[0207] The noise reduction method provided in this application can be implemented by a noise reduction device. This application uses an example of a noise reduction device performing the noise reduction method to illustrate the noise reduction device provided in this application.
[0208] Figure 4This is a schematic diagram illustrating the structure of a noise reduction device according to an exemplary embodiment. This noise reduction device is applied to the second noise reduction model provided in the above embodiment, such as... Figure 4 As shown, the noise reduction device 400 may include: The acquisition module 410 is used to acquire RAW image data to be processed, scene semantic vector when acquiring the RAW image data to be processed, and optical parameter vector corresponding to the acquisition of the RAW image data to be processed. The RAW image data to be processed includes at least one frame of RAW image data. Extraction module 420 is used to extract the first target feature vector of the RAW image data to be processed, the second target feature vector of the optical parameter vector, and the third target feature vector of the scene semantic vector; The fusion module 430 is used to fuse the first target feature vector, the second target feature vector, and the third target feature vector to obtain target fusion features; The noise reduction module 440 is used to input the target fusion features into the second noise reduction model to obtain the target output RAW image data.
[0209] In the embodiments of this application, the first RAW image data, the sample scene semantic vector when the first RAW image data was acquired, and the sample optical parameter vector corresponding to the acquisition of the first RAW image data are used to perform noise reduction processing on the first RAW image data. In this way, by introducing two modes, scene semantic vector and optical parameter vector, to perform noise reduction processing on the first RAW image data, compared with relying solely on the single-modal information of the first RAW image data, it is possible to adaptively reduce noise in RAW image data for different sources of optical noise and scene types, avoiding the problem of detail loss in RAW image data caused by the "one-size-fits-all" noise reduction method, and improving the noise reduction effect on RAW image data.
[0210] The noise reduction model training device and / or noise reduction device in the embodiments of this application can be an electronic device or a component in an electronic device, such as an integrated circuit or a chip. The electronic device can be a terminal or other devices besides a terminal. For example, the electronic device can be a mobile phone, tablet computer, laptop computer, handheld computer, in-vehicle electronic device, mobile internet device (MID), augmented reality (AR) / virtual reality (VR) device, robot, wearable device, ultra-mobile personal computer (UMPC), netbook or personal digital assistant (PDA), etc. It can also be a server, network attached storage (NAS), personal computer (PC), television (TV), ATM or self-service machine, etc. The embodiments of this application do not specifically limit it.
[0211] The noise reduction model training device and / or noise reduction device in the embodiments of this application can be a device with an operating system. The operating system can be Android, iOS, or other possible operating systems, and this application embodiment does not specifically limit it.
[0212] The noise reduction model training device and / or noise reduction device provided in the embodiments of this application can achieve Figure 1 and / or Figure 3 The various processes implemented in the method implementation examples will not be described again here to avoid repetition.
[0213] Optionally, such as Figure 5 As shown, this application embodiment also provides an electronic device 500, including a processor 501 and a memory 502. The memory 502 stores a program or instructions that can run on the processor 501. When the program or instructions are executed by the processor 501, they implement the various steps of the above-described noise reduction model training method and / or noise reduction method embodiments, and can achieve the same technical effect. To avoid repetition, they will not be described again here.
[0214] It should be noted that the electronic devices in the embodiments of this application include the mobile electronic devices and non-mobile electronic devices described above.
[0215] Figure 6 A schematic diagram of the hardware structure of an electronic device to implement an embodiment of this application.
[0216] The electronic device 600 includes, but is not limited to, components such as: radio frequency unit 601, network module 602, audio output unit 603, input unit 604, sensor 605, display unit 606, user input unit 607, interface unit 608, memory 609, and processor 610.
[0217] Those skilled in the art will understand that the electronic device 600 may also include a power supply (such as a battery) for supplying power to various components. The power supply may be logically connected to the processor 610 through a power management system, thereby enabling functions such as managing charging, discharging, and power consumption through the power management system. Figure 6 The electronic device structure shown does not constitute a limitation on the electronic device. The electronic device may include more or fewer components than shown, or combine certain components, or have different component arrangements, which will not be elaborated here.
[0218] When the electronic device is used as a training device for a noise reduction model: The processor 610 is configured to acquire training data pairs, the training data pairs including first RAW image data, sample scene semantic vectors acquired when the first RAW image data was acquired, sample optical parameter vectors acquired when the first RAW image data was acquired, and reference RAW image data corresponding to the first RAW image data. The first RAW image data includes at least one frame of RAW image data. The processor extracts a first feature vector from the first RAW image data, a second feature vector from the sample optical parameter vectors, and a third feature vector from the sample scene semantic vectors. It then fuses the first feature vector, the second feature vector, and the third feature vector to obtain fused features. The fused features are input into a first noise reduction model to obtain output RAW image data. Based on the output RAW image data and the reference RAW image data, the processor adjusts the model parameters of the first noise reduction model to obtain a second noise reduction model.
[0219] Thus, in this embodiment, the first RAW image data, the sample scene semantic vector when the first RAW image data was acquired, and the sample optical parameter vector corresponding to the acquisition of the first RAW image data are used to train the first denoising model. By introducing two modalities, scene semantic vector and optical parameter vector, to train the first denoising model, compared with relying solely on the single modal information of the first RAW image data, it can adaptively denoise the RAW image data for different sources of optical noise and scene types, avoiding the problem of detail loss in the RAW image data caused by the "one-size-fits-all" denoising method, and improving the denoising effect of the RAW image data.
[0220] Optionally, the processor 610 is further configured to acquire first RAW image data and optical parameters corresponding to the acquisition of the first RAW image data; encode the optical parameters corresponding to the first RAW image data to obtain a sample optical parameter vector; downsample the first RAW image data to obtain a preview image corresponding to the first RAW image data; input the preview image into a semantic recognition model and output at least one confidence level; wherein, one confidence level corresponds to a preset scene semantic; determine the scene information when acquiring the first RAW image data based on the at least one confidence level and a preset confidence threshold; and perform one-hot encoding on the scene information to obtain a sample scene semantic vector corresponding to the first RAW image data.
[0221] Optionally, the optical parameters include device parameters of the acquisition device and exposure control parameters; the processor 610 is further configured to perform one-hot encoding processing on the device parameters corresponding to the first RAW image data to obtain a device parameter encoding vector; perform normalized encoding on the exposure control parameters corresponding to the first RAW image data to obtain an exposure control parameter encoding vector; and concatenate the device parameter encoding vector and the exposure control parameter encoding vector to obtain a sample optical parameter vector.
[0222] Optionally, the processor 610 is further configured to, when the first RAW image data includes multiple frames of RAW image data, before concatenating the device parameter encoding vector and the exposure control parameter encoding vector to obtain the sample optical parameter vector, determine an exposure time weight vector corresponding to each frame of RAW image data based on the exposure time corresponding to each frame of RAW image data; and concatenate the device parameter encoding vector, the exposure control parameter encoding vector, and the exposure time weight vector corresponding to each frame of high dynamic range RAW image data to obtain the sample optical parameter vector.
[0223] Optionally, the processor 610 is further configured to input the first RAW image data into a first feature extraction network to obtain a first feature vector; input the sample optical parameter vector into a second feature extraction network to obtain a second feature vector; and input the sample scene semantic vector into a third feature extraction network to obtain a third feature vector.
[0224] Optionally, when the first RAW image data includes multiple frames of RAW image data, the first feature extraction network includes a temporal attention layer and an encoding layer; the processor 610 is further configured to input the multiple frames of RAW image data into the temporal attention layer of the first feature extraction network, fuse the features of the multiple frames of RAW image data according to the acquisition sequence of each frame of RAW image data, and obtain a fused feature map; input the fused feature map into the encoding layer, and extract the features from the fused feature map to obtain a first feature vector.
[0225] Optionally, the processor 610 is further configured to preprocess the first RAW image data to obtain second RAW image data; normalize the second RAW image data to obtain third RAW image data; divide the third RAW image data into a preset number of RAW image data blocks to obtain M RAW image data blocks, where M is a positive integer greater than 1; and extract the first feature vector of each RAW image data block in the first RAW image data.
[0226] Optionally, the processor 610 is further configured to calculate a first difference between the pixel value of each pixel in the second RAW image data and the black level value of the acquisition device; calculate a second difference between the maximum pixel value in the second RAW image data and the black level value of the acquisition device; obtain the normalized pixel value of the second RAW image data based on the first difference and the second difference; and obtain the third RAW image data based on the normalized pixel value of the second RAW image data.
[0227] Optionally, the processor 610 is further configured to: determine a query vector based on the first feature vector; determine a first key vector and a first value vector based on the second feature vector; determine a second key vector and a second value vector based on the third feature vector; determine a first weight of the second feature vector based on the similarity between the query vector and the first key vector; determine a second weight of the third feature vector based on the similarity between the query vector and the second key vector; and fuse the first feature vector, the second feature vector, and the third feature vector based on the first weight and the second weight to obtain a fused feature.
[0228] Optionally, the processor 610 is further configured to perform a weighted calculation on the first value vector and the second value vector based on the first weight and the second weight to obtain a weighted vector; and calculate the sum of the weighted vector and the query vector to obtain a fusion feature.
[0229] Optionally, the processor 610 is further configured to: determine the noise reduction loss function value of the first noise reduction model based on the pixel value of each pixel in the output RAW image data and the pixel value of each pixel in the reference RAW image data; determine the enhancement loss function value of the first noise reduction model based on the image content of the output RAW image data and the image content of the reference RAW image data; perform a weighted calculation on the noise reduction loss function value and the enhancement loss function value to obtain the target loss function value of the first noise reduction model; and adjust the model parameters of the first noise reduction model based on the target loss function value to obtain a second noise reduction model.
[0230] Optionally, when the first RAW image data includes multiple frames of RAW image data, the reference RAW image data is a single frame of high dynamic range RAW image data; the processor 610 is further configured to, before performing a weighted calculation on the denoising loss function value and the enhancement loss function value to obtain the target loss function value of the first denoising model, determine the high dynamic range fusion loss function value of the first denoising model based on the light intensity information of the output RAW image data and the light intensity information of the reference RAW image data; and perform a weighted calculation on the denoising loss function value, the enhancement loss function value, and the high dynamic range fusion loss function value to obtain the target loss function value of the first denoising model.
[0231] When the electronic device is used as a training device for a noise reduction model: The processor 610 is configured to acquire RAW image data to be processed, acquire scene semantic vectors of the RAW image data to be processed, and acquire optical parameter vectors corresponding to the RAW image data to be processed, wherein the RAW image data to be processed includes at least one frame of RAW image data; extract a first target feature vector, a second target feature vector of the optical parameter vector, and a third target feature vector of the scene semantic vector from the RAW image data to be processed; fuse the first target feature vector, the second target feature vector, and the third target feature vector to obtain target fusion features; and input the target fusion features into a second noise reduction model to obtain target output RAW image data.
[0232] Thus, the first RAW image data, the sample scene semantic vector acquired when the first RAW image data was acquired, and the corresponding sample optical parameter vector acquired when the first RAW image data was acquired are used to perform noise reduction processing on the first RAW image data. By introducing two modalities, scene semantic vector and optical parameter vector, to perform noise reduction processing on the first RAW image data, compared with relying solely on the single modal information of the first RAW image data, it is possible to adaptively reduce noise in RAW image data for different sources of optical noise and scene types, avoiding the problem of detail loss in RAW image data caused by the "one-size-fits-all" noise reduction method, and improving the noise reduction effect on RAW image data.
[0233] It should be understood that, in this embodiment, the input unit 604 may include a graphics processing unit (GPU) 6041 and a microphone 6042. The GPU 6041 processes image data of still images or videos obtained by an image capture device (such as a color camera) in video capture mode or image capture mode. The display unit 606 may include a display panel 6061, which may be configured in the form of a liquid crystal display, an organic light-emitting diode, or the like. The user input unit 607 includes at least one of a touch panel 6071 and other input devices 6072. The touch panel 6071 is also called a touch screen. The touch panel 6071 may include a touch detection device and a touch controller. Other input devices 6072 may include, but are not limited to, physical keyboards, function keys (such as volume control buttons, power buttons, etc.), trackballs, mice, and joysticks, which will not be described in detail here.
[0234] The memory 609 can be used to store software programs and various data. The memory 609 may primarily include a first storage area for storing programs or instructions and a second storage area for storing data. The first storage area may store the operating system, application programs or instructions required for at least one function (such as sound playback, image playback, etc.). Furthermore, the memory 609 may include volatile memory or non-volatile memory, or both. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be random access memory (RAM), static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDRSDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous link dynamic random access memory (SLDRAM), and direct memory bus RAM (DRRAM). The memory 609 in this embodiment includes, but is not limited to, these and any other suitable types of memory.
[0235] Processor 610 may include one or more processing units; optionally, processor 610 integrates an application processor and a modem processor, wherein the application processor mainly handles operations involving the operating system, user interface, and applications, and the modem processor mainly handles wireless communication signals, such as a baseband processor. It is understood that the aforementioned modem processor may also not be integrated into processor 610.
[0236] This application also provides a readable storage medium storing a program or instructions. When the program or instructions are executed by a processor, they implement the various processes of the above-described noise reduction model training method and / or noise reduction method embodiments, and can achieve the same technical effect. To avoid repetition, they will not be described again here.
[0237] The processor is the processor in the electronic device described in the above embodiments. The readable storage medium includes computer-readable storage media, such as computer read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk.
[0238] This application embodiment also provides a chip, which includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the various processes of the above-described noise reduction model training method and / or noise reduction method embodiments, and can achieve the same technical effect. To avoid repetition, it will not be described again here.
[0239] It should be understood that the chip mentioned in the embodiments of this application may also be referred to as a system-on-a-chip, system chip, chip system, or system-on-a-chip, etc.
[0240] This application provides a computer program product stored in a storage medium. The program product is executed by at least one processor to implement the various processes of the above-described denoising model training method and / or denoising method embodiments, and can achieve the same technical effect. To avoid repetition, it will not be described again here.
[0241] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element. Furthermore, it should be noted that the scope of the methods and apparatuses in the embodiments of this application is not limited to performing functions in the order shown or discussed, but may also include performing functions substantially simultaneously or in the reverse order, depending on the functions involved. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
[0242] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a computer software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.
[0243] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.
Claims
1. A method for training a noise reduction model, characterized in that, The method includes: Acquire training data pairs, wherein the training data pairs include first RAW image data, sample scene semantic vectors when the first RAW image data is acquired, sample optical parameter vectors corresponding to the acquisition of the first RAW image data, and reference RAW image data corresponding to the first RAW image data, wherein the first RAW image data includes at least one frame of RAW image data. Extract the first feature vector of the first RAW image data, the second feature vector of the sample optical parameter vector, and the third feature vector of the sample scene semantic vector; The first feature vector, the second feature vector, and the third feature vector are fused to obtain a fused feature. The fused features are input into the first noise reduction model to obtain the output RAW image data; The model parameters of the first noise reduction model are adjusted based on the output RAW image data and the reference RAW image data to obtain the second noise reduction model.
2. The method according to claim 1, characterized in that, The acquisition of training data pairs includes: Acquire the first RAW image data, and the corresponding optical parameters when acquiring the first RAW image data; The optical parameters corresponding to the first RAW image data are encoded to obtain a sample optical parameter vector; The first RAW image data is downsampled to obtain a preview image corresponding to the first RAW image data; The preview image is input into the semantic recognition model, which outputs at least one confidence score; wherein, one confidence score corresponds to a preset scene semantic. Based on the at least one confidence level and a preset confidence threshold, determine the scene information when acquiring the first RAW image data; The scene information is one-hot encoded to obtain the sample scene semantic vector corresponding to the first RAW image data.
3. The method according to claim 2, characterized in that, The optical parameters include the equipment parameters of the acquisition device and the exposure control parameters; The step of encoding the optical parameters corresponding to the first RAW image data to obtain a sample optical parameter vector includes: One-hot encoding is performed on the device parameters corresponding to the first RAW image data to obtain the device parameter encoding vector; The exposure control parameters corresponding to the first RAW image data are normalized and encoded to obtain the exposure control parameter encoding vector; The device parameter encoding vector and the exposure control parameter encoding vector are concatenated to obtain the sample optical parameter vector.
4. The method according to claim 3, characterized in that, When the first RAW image data includes multiple frames of RAW image data, before concatenating the device parameter encoding vector and the exposure control parameter encoding vector to obtain the sample optical parameter vector, the method further includes: Based on the exposure time corresponding to each frame of the RAW image data, determine the exposure time weight vector corresponding to each frame of the RAW image data; The step of concatenating the device parameter encoding vector and the exposure control parameter encoding vector to obtain the sample optical parameter vector includes: The device parameter encoding vector, the exposure control parameter encoding vector, and the exposure time weight vector corresponding to each frame of the RAW image data are concatenated to obtain the sample optical parameter vector.
5. The method according to claim 1, characterized in that, The extraction of the first feature vector from the first RAW image data, the second feature vector from the sample optical parameter vector, and the third feature vector from the sample scene semantic vector includes: The first RAW image data is input into the first feature extraction network to obtain the first feature vector; The sample optical parameter vector is input into the second feature extraction network to obtain the second feature vector; The semantic vector of the sample scene is input into the third feature extraction network to obtain the third feature vector.
6. The method according to claim 5, characterized in that, When the first RAW image data includes multiple frames of RAW image data, the first feature extraction network includes a temporal attention layer and an encoding layer; the step of inputting the first RAW image data into the first feature extraction network to obtain a first feature vector includes: Multiple frames of RAW image data are input into the temporal attention layer of the first feature extraction network. The features of the RAW image data are fused according to the acquisition time sequence of each frame of RAW image data to obtain a fused feature map. The fused feature map is input into the encoding layer, and the features in the fused feature map are extracted to obtain the first feature vector.
7. The method according to claim 1, characterized in that, After acquiring the training data pairs, the method further includes: The first RAW image data is preprocessed to obtain the second RAW image data; The second RAW image data is normalized to obtain the third RAW image data; The third RAW image data is divided into a preset number of RAW image data blocks to obtain M RAW image data blocks; where M is a positive integer greater than 1. The extraction of the first feature vector from the first RAW image data includes: Extract the first feature vector from each RAW image data block in the first RAW image data.
8. The method according to claim 7, characterized in that, The step of normalizing the second RAW image data to obtain the third RAW image data includes: Calculate the first difference between the pixel value of each pixel in the second RAW image data and the black level value of the acquisition device; Calculate the second difference between the maximum pixel value in the second RAW image data and the black level value of the acquisition device; Based on the first difference and the second difference, the normalized pixel values of the second RAW image data are obtained; The third RAW image data is obtained based on the normalized pixel values of the second RAW image data.
9. The method according to claim 1, characterized in that, The process of fusing the first feature vector, the second feature vector, and the third feature vector to obtain the fused feature includes: Based on the first feature vector, determine the query vector; The first key vector and the first value vector are determined based on the second feature vector; The second key vector and the second value vector are determined based on the third feature vector; The first weight of the second feature vector is determined based on the similarity between the query vector and the first key vector. The second weight of the third feature vector is determined based on the similarity between the query vector and the second key vector. Based on the first weight and the second weight, the first feature vector, the second feature vector and the third feature vector are fused to obtain the fused feature.
10. The method according to claim 9, characterized in that, The step of fusing the first feature vector, the second feature vector, and the third feature vector based on the first weight and the second weight to obtain the fused feature includes: Based on the first weight and the second weight, the first value vector and the second value vector are weighted to obtain a weighted vector; The sum of the weighted vector and the query vector is calculated to obtain the fused features.
11. The method according to claim 1, characterized in that, The step of adjusting the model parameters of the first noise reduction model based on the output RAW image data and the reference RAW image data to obtain the second noise reduction model includes: The noise reduction loss function value of the first noise reduction model is determined based on the pixel value of each pixel in the output RAW image data and the pixel value of each pixel in the reference RAW image data. The enhancement loss function value of the first noise reduction model is determined based on the image content of the output RAW image data and the image content of the reference RAW image data. The target loss function value of the first noise reduction model is obtained by weighting the noise reduction loss function value and the enhancement loss function value. The model parameters of the first denoising model are adjusted based on the target loss function value to obtain the second denoising model.
12. The method according to claim 11, characterized in that, In the case where the first RAW image data includes multiple frames of RAW image data, the reference RAW image data is a single frame of high dynamic range RAW image data; Before performing a weighted calculation on the denoising loss function value and the enhancement loss function value to obtain the target loss function value of the first denoising model, the method further includes: Based on the light intensity information of the output RAW image data and the light intensity information of the reference RAW image data, the high dynamic range fusion loss function value of the first noise reduction model is determined; The step of weighting the noise reduction loss function value and the enhancement loss function value to obtain the target loss function value of the first noise reduction model includes: The target loss function value of the first noise reduction model is obtained by weighting the noise reduction loss function value, the enhancement loss function value, and the high dynamic range fusion loss function value.
13. A noise reduction method, characterized in that, The method, applied to the second noise reduction model according to any one of claims 1-12, comprises: Acquire the RAW image data to be processed, collect the scene semantic vector when acquiring the RAW image data to be processed, and collect the optical parameter vector corresponding to the RAW image data to be processed. The RAW image data to be processed includes at least one frame of RAW image data. Extract the first target feature vector of the RAW image data to be processed, the second target feature vector of the optical parameter vector, and the third target feature vector of the scene semantic vector; The first target feature vector, the second target feature vector, and the third target feature vector are fused to obtain the target fusion feature; The target fusion features are input into the second noise reduction model to obtain the target output RAW image data.
14. A noise reduction model training device, characterized in that, The device includes: The acquisition module is used to acquire training data pairs, which include first RAW image data, sample scene semantic vectors when the first RAW image data is acquired, sample optical parameter vectors corresponding to the acquisition of the first RAW image data, and reference RAW image data corresponding to the first RAW image data. The first RAW image data includes at least one frame of RAW image data. The extraction module is used to extract the first feature vector of the first RAW image data, the second feature vector of the sample optical parameter vector, and the third feature vector of the sample scene semantic vector; The fusion module is used to fuse the first feature vector, the second feature vector, and the third feature vector to obtain fused features; The input module is used to input the fused features into the first noise reduction model to obtain the output RAW image data; The training module is used to adjust the model parameters of the first noise reduction model based on the output RAW image data and the reference RAW image data to obtain the second noise reduction model.
15. A noise reduction device, characterized in that, Applied to the second noise reduction model as described in claim 14, the apparatus comprises: The acquisition module is used to acquire RAW image data to be processed, scene semantic vectors when acquiring the RAW image data to be processed, and optical parameter vectors corresponding to the acquisition of the RAW image data to be processed. The RAW image data to be processed includes at least one frame of RAW image data. The extraction module is used to extract the first target feature vector of the RAW image data to be processed, the second target feature vector of the optical parameter vector, and the third target feature vector of the scene semantic vector; The fusion module is used to fuse the first target feature vector, the second target feature vector, and the third target feature vector to obtain target fusion features; The noise reduction module is used to input the target fusion features into the second noise reduction model to obtain the target output RAW image data.