Model training method, image processing method, device, and program product

By generating and training a facial feature encoder and a latent spatial feature restoration module, the image quality problem when shooting moving objects in dark environments is solved, and high-fidelity face image restoration is achieved.

CN122243816APending Publication Date: 2026-06-19VIVO MOBILE COMM CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
VIVO MOBILE COMM CO LTD
Filing Date
2026-03-11
Publication Date
2026-06-19

Smart Images

  • Figure CN122243816A_ABST
    Figure CN122243816A_ABST
Patent Text Reader

Abstract

This application discloses a model training method, an image processing method, an apparatus, and a program product, belonging to the field of image processing technology. The model training method includes: acquiring multiple data pairs, each data pair including a first initial image and a second initial image, wherein the first initial image is an image of a preset object taken in a static state, and the second initial image is an image of the preset object taken in a moving state, and the facial region of the preset object in the second initial image is blurred; generating multiple sets of training data corresponding one-to-one with the multiple data pairs, the training data including original images, reference images, and ground truth images; training an initial feature encoder based on the multiple sets of training data to obtain a facial feature encoder; and training an initial latent spatial feature restoration module based on the multiple sets of training data to obtain a latent spatial feature restoration module.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image processing technology, specifically to a model training method, an image processing method, an apparatus, and a program product. Background Technology

[0002] In related technologies, multi-frame noise reduction algorithms are typically used to reduce noise in captured images during the operation of a mobile terminal's camera. Multi-frame noise reduction is one of the core methods for improving image quality, especially in low-light, high camera ISO, and portrait night scene scenarios. By acquiring multiple consecutive frames with different exposures and performing temporal fusion, the signal-to-noise ratio of the resulting image can be effectively improved, while restoring and preserving texture and structural details.

[0003] However, multi-frame noise reduction algorithms also face some unresolved issues in certain shooting scenarios. For example, when shooting in low-light conditions, increasing the exposure time is often chosen to improve the signal-to-noise ratio. However, this leads to greater differences between frames, resulting in severe motion blur, ghosting, and other problems, while also reducing the image's freeze-frame quality. Conversely, decreasing the exposure time results in a poorer input signal-to-noise ratio, causing issues such as distorted faces in dark areas, residual noise, and fake textures. Therefore, in related technologies, the image quality is often poor when shooting moving objects in low-light environments. Summary of the Invention

[0004] This application provides a model training method, image processing method, apparatus, and program product that can solve the problem of poor image quality when shooting moving objects in dark environments in related technologies.

[0005] Firstly, a model training method is provided, including:

[0006] Multiple data pairs are acquired, including a first initial image and a second initial image. The first initial image is an image of a preset object taken in a static state, and the second initial image is an image of the preset object taken in a moving state. The facial area of ​​the preset object in the second initial image is blurred.

[0007] Generate multiple sets of training data that correspond one-to-one with the multiple data pairs. The training data includes an original image, a reference image, and a ground truth image. The original image is an image obtained by adding Gaussian noise of a first intensity to the first initial image and then darkening it. The reference image is an image obtained by adding Gaussian noise of a second intensity to the second initial image. The first intensity is greater than the second intensity. The ground truth image is the first initial image.

[0008] The initial feature encoder is trained based on the multiple sets of training data to obtain the face feature encoder, and the initial latent space feature restoration module is trained based on the multiple sets of training data to obtain the latent space feature restoration module.

[0009] Secondly, an image processing method is provided, applied to the face feature encoder and latent spatial feature restoration module described in the first aspect, the method comprising:

[0010] Acquire still images and motion-blurred images, wherein the still image is an image taken when the subject is stationary, and the motion-blurred image is an image taken when the subject is in motion, and the facial area of ​​the subject is blurred in the motion-blurred image;

[0011] The still image and the motion-blurred image are respectively input into the face feature encoder to obtain the corresponding first feature map and second feature map; and the still image is input into the variational autoencoder to obtain the first latent space feature map.

[0012] The first feature map, the second feature map, and the first latent space feature map are respectively input into the latent space feature restoration module for feature restoration to obtain the first restored feature.

[0013] The first restored feature is input into the variational autodecoder for decoding to obtain the first restored image.

[0014] Thirdly, a model training device is provided, comprising:

[0015] The first acquisition module is used to acquire multiple data pairs, the data pairs including a first initial image and a second initial image, the first initial image being an image of a preset object in a static state, the second initial image being an image of the preset object in a moving state, and the facial area of ​​the preset object in the second initial image being blurred;

[0016] The generation module is used to generate multiple sets of training data that correspond one-to-one with the multiple data pairs. The training data includes an original image, a reference image, and a ground truth image. The original image is an image obtained by adding Gaussian noise of a first intensity to the first initial image and then darkening it. The reference image is an image obtained by adding Gaussian noise of a second intensity to the second initial image. The first intensity is greater than the second intensity. The ground truth image is the first initial image.

[0017] The training module is used to train the initial feature encoder based on the multiple sets of training data to obtain the face feature encoder, and to train the initial latent space feature restoration module based on the multiple sets of training data to obtain the latent space feature restoration module.

[0018] Fourthly, an image processing apparatus is provided, applied to the facial feature encoder and latent spatial feature restoration module described in the first aspect, the apparatus comprising:

[0019] The second acquisition module is used to acquire still images and motion-blurred images, wherein the still image is an image taken when the subject is stationary, and the motion-blurred image is an image taken when the subject is in motion, and the facial area of ​​the subject is blurred in the motion-blurred image;

[0020] The encoding module is used to input the still image and the motion-blurred image into the face feature encoder to obtain the corresponding first feature map and second feature map, and to input the still image into the variational autoencoder to obtain the first latent space feature map.

[0021] The restoration module is used to input the first feature map, the second feature map and the first latent space feature map into the latent space feature restoration module respectively to restore the features and obtain the first restored feature.

[0022] The decoding module is used to input the first restored feature into the variational autodecoder for decoding to obtain the first restored image.

[0023] Fifthly, an electronic device is provided, comprising a processor and a memory, wherein the memory stores a program or instructions executable on the processor, the program or instructions, when executed by the processor, perform the steps of the method as described in the first or second aspect.

[0024] In a sixth aspect, embodiments of this application provide a readable storage medium on which a program or instructions are stored, which, when executed by a processor, implement the steps of the method described in the first or second aspect.

[0025] In a seventh aspect, embodiments of this application provide a chip, the chip including a processor and a communication interface, the communication interface being coupled to the processor, the processor being used to run programs or instructions to implement the steps of the method described in the first or second aspect.

[0026] Eighthly, embodiments of this application provide a computer program product stored in a storage medium, which is executed by at least one processor to implement the steps of the method described in the first or second aspect.

[0027] In this embodiment, by adding Gaussian noise and darkening the image, images in various scenarios can be simulated as if taken in a dark environment, and training data can be generated based on this, which is beneficial for enriching the diversity of training data. Simultaneously, since a set of training data is generated based on images taken in a static state and motion-blurred images, during training, the initial feature encoder can perform feature encoding on the static and motion-blurred images respectively to obtain the latent features of the face region in the static and motion-blurred images. At the same time, the initial latent spatial feature restoration module can perform feature restoration based on the latent features of the face region in the static and motion-blurred images output by the initial feature encoder, thereby improving the face restoration effect in dark environment sample images. In this way, feature encoding and image restoration can be performed on images taken in dark environments based on the trained face feature encoder and latent spatial feature restoration module, thus facilitating the obtaining of highly realistic face images. Attached Figure Description

[0028] Figure 1 This is a schematic flowchart of a model training method provided in an embodiment of this application;

[0029] Figure 2 This is one of the schematic flowcharts of an image processing method provided in an embodiment of this application;

[0030] Figure 3 This is a second schematic flowchart of an image processing method provided in an embodiment of this application;

[0031] Figure 4 This is a schematic diagram of the processing flow of the spatial attention mechanism module in an embodiment of this application;

[0032] Figure 5 This is a schematic diagram of the processing flow of the potential spatial feature restoration module in the embodiments of this application;

[0033] Figure 6 This is a schematic diagram of the structure of a model training device provided in an embodiment of this application;

[0034] Figure 7This is a schematic diagram of the structure of an image processing device provided in an embodiment of this application;

[0035] Figure 8 Schematic diagrams of the structure of electronic devices provided for some embodiments of this application;

[0036] Figure 9 A schematic diagram of the hardware structure of an electronic device provided for some embodiments of this application. Detailed Implementation

[0037] The technical solutions of the embodiments of this application will be clearly described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application are within the scope of protection of this application.

[0038] The terms "first," "second," etc., used in this application are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such terms can be used interchangeably where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first" and "second" are generally of the same class, not limited in number; for example, the first object can be one or more. Furthermore, "or" in this application indicates at least one of the connected objects. For example, the scope of protection for "A or B" covers at least three scenarios: Scenario 1: including A but not B; Scenario 2: including B but not A; Scenario 3: including both A and B. In addition, the terms "A and / or B," "at least one of A and B," and "at least one of A or B" also cover at least the above three scenarios. The character " / " generally indicates that the preceding and following objects are in an "or" relationship.

[0039] For ease of understanding, some technical terms in the embodiments of this application are explained below:

[0040] Raw (RAW) images: The original image files generated when a camera or other device captures an image.

[0041] ISO (International Organization for Standardization) ISO sensitivity refers to the degree to which a camera is sensitive to light.

[0042] Variational Autoencoder (VAE): A generative model based on deep learning that uses probabilistic modeling to achieve data compression, feature extraction, and new sample generation.

[0043] Transformer model: a deep neural network model structure.

[0044] Stable Diffusion (SD) model: First, a clear image is gradually increased in noise until it becomes pure random noise; then, a neural network is trained to remove the noise step by step and restore the clear original image.

[0045] Prompt: When interacting with systems such as generative large language models, the input instructions or text provided by the user to the model are used to describe the task, provide context and constraints, and thus guide the model to generate the expected output.

[0046] Contrastive Language–Image Pre-training (Clip) model: A multimodal pre-trained neural network whose core idea is to learn the alignment relationship between vision and language by pre-training with a large amount of paired image and text data, thereby achieving cross-modal understanding.

[0047] The model training method, image processing method, apparatus, and program product provided in this application will be described in detail below with reference to the accompanying drawings and through some embodiments and application scenarios.

[0048] Please see Figure 1 , Figure 1 This is a flowchart illustrating the model training method provided in an embodiment of this application. The model training method includes the following steps:

[0049] Step 101: Acquire multiple data pairs, the data pairs including a first initial image and a second initial image, the first initial image being an image of a preset object in a static state, the second initial image being an image of the preset object in a moving state, and the facial area of ​​the preset object in the second initial image being blurred;

[0050] Step 102: Generate multiple sets of training data corresponding one-to-one with the multiple data pairs, wherein the training data includes the original image, the reference image and the ground truth image, the original image is the image obtained by adding Gaussian noise of the first intensity and darkening it after adding Gaussian noise of the first initial image, the reference image is the image obtained by adding Gaussian noise of the second initial image of the second intensity, the first intensity being greater than the second intensity, and the ground truth image is the first initial image.

[0051] Step 103: Train the initial feature encoder based on the multiple sets of training data to obtain the face feature encoder; and train the initial latent space feature restoration module based on the multiple sets of training data to obtain the latent space feature restoration module.

[0052] The data sources for the aforementioned data pairs can include the following two aspects: First, publicly available high-definition video data can be used to collect 2-3 seconds of data showing a face transitioning from stillness to motion; second, a camera's burst mode can be used to capture clean images of a face transitioning from stillness to motion. The camera can be any type of camera with a burst mode, such as a DSLR camera. Both data sources must ensure the presence of at least one frame of still face data and at least one frame of motion-blurred face data during acquisition. For example, for high-definition video data, all video frames within a short timeframe of a person's face transitioning from stillness to motion can be accurately located, and a still image and a motion-blurred image can be selected to generate a data pair. Since high-definition video data can include images of various skin tones, age groups, and various dimly lit scenes, it can enrich the diversity of the sample data. As another example, multiple images of a specific person transitioning from stillness to motion in a specific scene can be captured by a camera, and a still image and a motion-blurred image can be selected to generate a data pair. This allows for targeted training data, making the trained model more aligned with user needs. The above methods can be used to obtain the aforementioned data pairs.

[0053] To facilitate understanding, this application embodiment uses the generation of a corresponding set of training data from one of the data pairs as an example to further explain the implementation process of "generating multiple sets of training data that correspond one-to-one with the multiple data pairs":

[0054] First, a set of data pairs is randomly selected. The first initial image in this data pair is used as the ground truth for neural network training and named T. Then, a relatively strong Gaussian noise is randomly added to the first initial image and named the original image I. The relatively strong Gaussian noise is the first intensity Gaussian noise mentioned above. I is used to simulate the problem of a worse signal-to-noise ratio and stronger freeze-frame capability caused by shortening the exposure time. At the same time, the image after adding the relatively strong Gaussian noise is randomly darkened to simulate the problem of poor image brightness caused by shortening the exposure time. The darkened image is used as I. Then, a relatively weak Gaussian noise is added to the second initial image and named the reference image R. The relatively weak Gaussian noise is the second intensity Gaussian noise mentioned above. R is used to simulate the situation of a better signal-to-noise ratio and motion blur caused by extending the exposure time. To prevent information loss due to overexposure, the image is not brightened.

[0055] Here, I and R can be RAW format images used for subsequent model training. For easier observation, image T can be an RGB image.

[0056] The aforementioned initial feature encoder can be any type of image encoder. The initial feature encoder can be trained using the model training method provided in this application embodiment, so that the trained facial feature encoder has a better encoding effect on facial features.

[0057] The aforementioned initial latent space feature restoration module can be a single-step diffusion model used to restore features based on the received first feature map, second feature map, and first latent space feature map, to obtain the first restored feature.

[0058] During training, the original image and reference image are input into the initial feature encoder to obtain encoded features. These encoded features are then input into the initial latent space feature restoration module for restoration, resulting in a predicted restored image. A loss value is determined based on the difference between the predicted restored image and the ground truth image. The parameters of both the initial feature encoder and the initial latent space feature restoration module are then updated based on this loss value to optimize them.

[0059] In this implementation, by adding Gaussian noise and darkening the image, images from various scenarios can be simulated as if taken in a dark environment, and training data can be generated based on this, thus enriching the diversity of the training data. Simultaneously, since a set of training data is generated based on images taken in a static state and motion-blurred images, during training, the initial feature encoder can encode features in both static and motion-blurred images separately to obtain the latent features of the face region in both images. Meanwhile, the initial latent spatial feature restoration module can perform feature restoration based on the latent features of the face region in the static and motion-blurred images output by the initial feature encoder, thereby improving the face restoration effect in dark environment sample images. In this way, feature encoding and image restoration can be performed on images taken in dark environments based on the trained face feature encoder and latent spatial feature restoration module, resulting in highly realistic face images.

[0060] Optionally, training the initial feature encoder based on the multiple sets of training data to obtain the facial feature encoder includes:

[0061] Alignment and region segmentation are performed on the original image and the reference image in the first training data to obtain N first sub-images and N second sub-images. The first training data is any one of the multiple sets of training data. The N first sub-images are images of N different regions in the original image, and the N second sub-images are images of N different regions in the reference image. The N first sub-images and the N second sub-images correspond one-to-one.

[0062] Based on the N first sub-images and the N second sub-images, N corresponding sub-training data are generated, wherein the sub-training data includes a set of corresponding first sub-images and second sub-images;

[0063] The initial feature encoder is trained based on the N sub-training data to obtain the facial feature encoder.

[0064] The above-mentioned alignment and region segmentation processing of the original image and the reference image in the first training data may include: firstly aligning the original image and the reference image in the first training data, and then performing region segmentation processing on the aligned original image and reference image.

[0065] The alignment process described above can be based on various alignment methods in related technologies. The two aligned images are stacked together, and the original and reference images have the same image size and the same coordinates for all four vertices in the same coordinate system. Then, the aligned contiguous images can be cut into N blocks in the same direction, with each block overlapping in the coordinate system. One block from the original image serves as a first sub-image, and one block from the reference image serves as a second sub-image. Two blocks at the same position correspond to each other, and these two corresponding blocks form a sub-training data set. Thus, N sub-training data sets can be generated from each training data set. The generated N sub-training data sets can be represented as follows: , , ..., .

[0066] In some embodiments of this application, the above-described alignment and region segmentation processes can be performed on all generated data first, and all generated sub-training data can be used as training datasets. Then, the initial feature encoder can be iteratively trained using the sub-training data in the training datasets, and the finally trained model can be used as the face feature encoder.

[0067] In this embodiment, since the image content in the training data mainly includes images of faces transitioning from stillness to motion, the background content in the images is basically the same in the original and reference images, while there are significant differences between the original and reference images in the face region. Based on this, in this embodiment, the original and reference images are aligned and region-divided to generate sub-training data. Thus, the differences between two sub-images generated for the face region are larger, while the differences between two sub-images generated for other regions are smaller. This helps the model to narrow the distance between two face images in the same sub-training data during training, resulting in a better encoding effect for the face feature encoder.

[0068] Optionally, training the initial feature encoder based on the N sub-training data to obtain the facial feature encoder includes:

[0069] The first sub-image and the second sub-image in the first sub-training data are respectively input into the initial feature encoder for feature encoding to obtain the first sub-encoding information and the second sub-encoding information, wherein the first sub-training data is any one of the N sub-training data;

[0070] Calculate the loss between the first sub-encoded information and the second sub-encoded information to obtain the initial loss value;

[0071] The negative value of the initial loss value is used as the target loss value, and the parameters of the initial feature encoder are optimized based on the target loss value to obtain the face feature encoder.

[0072] In some embodiments of this application, during the training of the initial feature encoder, a sub-training data can be randomly selected and input into the initial feature encoder for encoding, resulting in two high-dimensional compressed feature maps output by the initial feature encoder. Typically, when shooting moving portraits at night, since the two images are double-exposed images of different brightness, the backgrounds in the two images are often not significantly different. Due to the movement of the subject, the position of the face in the image changes considerably. Furthermore, long exposures result in motion blur, leading to significant differences in the image representation of the face in the two images. Therefore, in the sub-training data, images containing the background have higher similarity, resulting in a relatively lower loss function, while images containing the face have lower similarity, resulting in a relatively higher loss function. Based on this, during training, the calculated loss value can be negative, thus encoding the face in the image pair. The distance between the face information in the two high-dimensional feature maps is reduced, while non-face parts are reduced, thereby increasing the weight of the face in the feature map and providing prior information for subsequent latent spatial feature restoration.

[0073] In this embodiment, the first sub-image and the second sub-image in the first sub-training data are encoded based on the initial feature encoder to obtain first sub-encoding information corresponding to the first sub-image and second sub-encoding information corresponding to the second sub-image. The first sub-training data is any one of the N sub-training data. The loss between the first sub-encoding information and the second sub-encoding information is calculated to obtain an initial loss value. The negative value of the initial loss value is used as the target loss value, and the parameters of the initial feature encoder are optimized based on the target loss value to obtain the face feature encoder. In this way, the trained face feature encoder can bring the distance between the face information in the two high-dimensional feature maps closer and pull away the parts that do not belong to the face, thereby increasing the weight of the face part in the feature map and providing prior information for subsequent latent spatial feature restoration.

[0074] Optionally, training the initial latent spatial feature restoration module based on the multiple sets of training data to obtain the latent spatial feature restoration module includes:

[0075] The original image and the reference image in the second training data are input into the face feature encoder to obtain the corresponding third feature map and fourth feature map. The original image in the second training data is input into the variational autoencoder to obtain the third latent space feature map. The second training data is any one of the multiple sets of training data.

[0076] Noise is added to the third latent space feature map to obtain the fourth latent space feature map;

[0077] The third feature map, the fourth feature map, and the fourth latent space feature map are respectively input into the initial latent space feature restoration module for feature restoration to obtain the second restored feature;

[0078] The second restored feature is input into the variational autodecoder for decoding to obtain the second restored image;

[0079] Calculate the loss information between the second restored image and the second training data, wherein the loss information includes reconstruction loss, perceptual loss and contrast loss;

[0080] Based on the loss information, the parameters of the initial latent spatial feature restoration module are optimized to obtain the latent spatial feature restoration module.

[0081] In some embodiments of this application, after the facial feature encoder has been trained, the following second training step can be performed: (e.g.) Figure 5 As shown, in each iteration of training, a set of portrait data pairs is input. The degraded, bright, moving image is labeled as the reference image R, the degraded, low-bright, still image is labeled as the original image I, and the original target image T is also input. The entire training adopts the SD model training method. During training, the reference image R and the original image I are processed by the face feature encoding module to generate feature maps R and I. At the same time, the original image I is processed by the VAE Encoder to generate a latent space feature map. ,Will Along with R and I, these are simultaneously input into the latent spatial feature restoration module. The latent spatial feature restoration module is a single-step diffusion model, and its flowchart is shown below. Figure 5 As shown. In the latent space feature restoration module, the latent space feature map is first... Add Gaussian noise, then add noise to the... The feature and facial feature encodings are simultaneously input into UNet, and after a single-step iteration, the first restored feature is obtained. In this embodiment, by introducing facial feature encoding into the latent space feature restoration module, the neural network can effectively utilize multi-frame information with different brightness levels to restore faces in dark areas, ensuring that the image is not distorted.

[0082] Finally, the first restored feature is reconstructed to the original image size by the VAE Decoder to obtain the generated image G. Reconstruction loss, perceptual loss, and contrastive loss are used during training. Reconstruction loss ensures that the generated image G is consistent with T, and can be implemented using L1 loss. Perceptual loss ensures the realism of the generated image G and T, and can be implemented using LPIPS loss. Contrastive loss ensures that the model can utilize contrastive learning to narrow down the feature representations of the two images, and can be implemented using contrastive loss.

[0083] The L1 Loss mentioned above is also known as the minimum absolute value bias loss, and its loss function is:

[0084] S=

[0085] in It is the target value, that is, the true value image; This is the estimated value, which is the second restored image obtained by decoding the second restored feature using the variational autodecoder, where n is the number of samples.

[0086] The calculation of the above LPIPS Loss can be performed through the following process:

[0087] Feature extraction: A pre-trained CNN network F(x) is used as the feature extractor. For two input images (x) and (y), feature maps are extracted at multiple deep layers (L) of the network, as shown in the formula:

[0088]

[0089] and respectively and Normalization is performed, and the L2 distance of the normalized feature map at each spatial location is calculated. Then, the average distance is calculated over the spatial dimensions to obtain the perceptual distance of that layer. Finally, the distances of all layers are weighted and summed to obtain the final LPIPS loss. The formula for calculating the LPIPS loss is as follows:

[0090] L =

[0091] Where i represents the i-th network depth layer.

[0092] III. The aforementioned Contrastive Loss is a contrastive loss, which can effectively handle the relationship between paired data in such networks. The calculation process of the Contrastive Loss is as follows:

[0093]

[0094] Where n is the number of samples, It is cosine similarity. This represents the overlapping image pair, that is, the original image and the reference image after alignment.

[0095] In this embodiment, the original image in the second training data is encoded based on the facial feature encoder to obtain a third feature map; a reference image in the second training data is encoded based on the facial feature encoder to obtain a fourth feature map; and the original image in the second training data is encoded based on the variational autoencoder to obtain a third latent space feature map, wherein the second training data is any one of the multiple sets of training data; noise is added to the third latent space feature map to obtain a fourth latent space feature map; the third feature map, the fourth feature map, and the fourth latent space feature map are respectively input to an initial latent space feature restoration module for feature restoration to obtain a second restored feature; the second restored feature is decoded based on the variational autodecoder to obtain a second restored image; loss information between the second restored image and the second training data is calculated, wherein the loss information includes reconstruction loss, perceptual loss, and contrast loss; the parameters of the initial latent space feature restoration module are optimized based on the loss information to obtain the latent space feature restoration module. Thus, the trained latent space feature restoration module can have better image restoration performance.

[0096] Please see Figure 2 , Figure 2 An image processing method provided in this application embodiment is applied to the face feature encoder and latent spatial feature restoration module described in the above embodiment. The method includes:

[0097] Step 201: Acquire a still image and a motion-blurred image, wherein the still image is an image taken when the subject is stationary, and the motion-blurred image is an image taken when the subject is in motion, and the face area of ​​the subject is blurred in the motion-blurred image;

[0098] Step 202: Input the still image and the motion-blurred image into the face feature encoder to obtain the corresponding first feature map and second feature map, and input the still image into the variational autoencoder to obtain the first latent space feature map;

[0099] Step 203: Input the first feature map, the second feature map, and the first latent space feature map into the latent space feature restoration module respectively for feature restoration to obtain the first restored feature;

[0100] Step 204: Input the first restored feature into the variational autodecoder for decoding to obtain the first restored image.

[0101] It is understood that the facial feature encoder and the latent spatial feature restoration module can be trained based on the model training method in the above embodiments.

[0102] The aforementioned still images and motion-blurred images can be images captured in various dark environments, such as images taken at night or during the day in relatively dim conditions. Of course, the image processing method described above can be applied not only to shooting scenarios in dark environments but also to other shooting scenarios with normal or excessively bright environments. It should be noted that the image processing method provided in this application embodiment has a good image processing effect on images captured in dark environments, resulting in a first restored image with both a high signal-to-noise ratio and good freeze-frame capability.

[0103] In some embodiments of this application, the still image and the motion-blurred image are images captured when the ambient brightness is less than a preset brightness threshold. The ambient brightness being less than the preset brightness threshold can mean that the ambient brightness is a dark environment. The preset brightness threshold can be a threshold determined based on actual conditions; when the ambient brightness is lower than the preset brightness threshold, it can be considered a dark environment. Correspondingly, when the ambient brightness is higher than or equal to the preset brightness threshold, it can be considered a bright environment. For example, in some embodiments of this application, the preset brightness threshold can range from 25 lux to 35 lux.

[0104] The aforementioned still images and motion-blurred images can be images continuously captured by an electronic device during a short period of time, from when the subject is at rest to when it begins to move. For example, they can be images captured within 2 to 3 seconds after the subject begins to move. The backgrounds in both still images and motion-blurred images are essentially the same; the difference lies in the state of the subject being captured. Specifically, the subject in a still image is stationary, thus the image captures the subject well. However, when the subject is captured in a dark environment, the signal-to-noise ratio (SNR) is relatively low due to the relatively short exposure time. Conversely, the exposure time of a motion-blurred image is longer than that of a still image, resulting in a higher SNR. However, because the subject is in motion during this process, the subject is blurred, especially when the face, which is of particular interest to the user, is blurred, which is generally unacceptable to the user. In other words, the aforementioned still images can be low-brightness still images, while the aforementioned motion-blurred images are high-brightness moving images.

[0105] The subject of the photograph can be a person or other animals. For ease of understanding, this application embodiment takes the user's face image as an example to further explain the above image processing method.

[0106] The aforementioned facial feature encoder can be any type of image encoder, and the facial feature encoder can be a pre-trained encoder that has a good encoding effect on facial features. Thus, the first feature map and the second image map obtained by encoding the still image and the motion-blurred image by the facial encoder can better represent the potential features of the subject's face.

[0107] The aforementioned variational autoencoder may include an encoder (VAE Encoder) and a decoder (VAE Decoder). Encoding the still image based on the variational autoencoder to obtain the first latent spatial feature map can mean: encoding the still image based on the VAE Encoder to obtain the first latent spatial feature map. Correspondingly, decoding the first restored feature based on the variational autodecoder to obtain the first restored image can mean: decoding the first restored feature based on the VAE Decoder to obtain the first restored image.

[0108] The aforementioned latent spatial feature restoration module can be a single-step diffusion model used to restore features based on the received first feature map, second feature map, and first latent spatial feature map to obtain the first restored feature.

[0109] It should be noted that the image processing method provided in this application embodiment, compared to the SD model in related technologies, does not require a prompt input, but instead utilizes the potential information of long and short exposure images already captured in the image capture chain. These long and short exposure images include the still image and the motion-blurred image.

[0110] In this embodiment, since still images and motion-blurred images acquired in the same continuous time period have similar structural features and high-dimensional semantic features of facial visual features, by acquiring still images and motion-blurred images, encoding the still images based on a facial feature encoder to obtain a first feature map, and encoding the motion-blurred images based on the facial feature encoder to obtain a second feature map, the potential information of the facial region in the still images and motion-blurred images can be obtained. The first feature map, the second feature map, and the first latent spatial feature map are then input into the latent spatial feature restoration module for feature restoration, which helps to improve the face restoration effect in images taken in dark environments, so as to obtain highly realistic face images.

[0111] Optionally, the facial feature encoder includes a feature extraction module, a spatial attention mechanism module, and a feature encoding module connected in sequence. The step of inputting the still image and the motion-blurred image into the facial feature encoder to obtain corresponding first and second feature maps includes:

[0112] The still image is input into the feature extraction module for feature extraction to obtain a first extracted feature, and the motion-blurred image is input into the feature extraction module for feature extraction to obtain a second extracted feature;

[0113] The first extracted feature is input into the spatial attention mechanism module to obtain the first attention feature, and the second extracted feature is input into the spatial attention mechanism module to obtain the second attention feature;

[0114] The first attention feature is input into the feature encoding module for encoding to obtain the first feature map, and the second attention feature is input into the feature encoding module for encoding to obtain the second feature map.

[0115] In some embodiments of this application, the aforementioned facial feature encoder may specifically be a human face feature encoder; please refer to [link to relevant documentation]. Figure 3 A still image I and a motion-blurred image R can be simultaneously input into a face feature encoder to obtain a first feature map and a second feature map output by the face feature encoder. These first and second feature maps represent high-dimensional face feature encoding information. The first latent space feature map, the first feature map, and the second feature map, encoded by the VAE Encoder from the still image I, are then fed into a latent space feature restoration module for feature restoration, yielding a first restored feature. Finally, the first restored feature is decoded by the VAE Decoder to obtain the reconstructed portrait.

[0116] The aforementioned face feature encoder can be obtained based on contrastive learning. Contrastive learning can narrow the distance between similar features by comparing the similarity and differences between samples. Therefore, it can improve the weight of face regions in an image, providing prior information for subsequent latent spatial feature restoration.

[0117] The face feature encoder consists of three modules: a feature extraction module, a spatial attention mechanism module, and a feature encoding module. The feature extraction module can be a feature extractor, and the feature encoding module can be a Transformer feature encoder.

[0118] The feature extraction module can be constructed from multiple layers of convolutional neural networks. The spatial attention mechanism module mainly consists of convolution and matrix operations, such as... Figure 4As shown. The input to the spatial attention mechanism module is a feature map, the size of which is 1 / 8 of the original image. In some embodiments of this application, this feature map can be represented as C×H×W, where C represents the number of channels, H represents the height of the feature map, and W represents the width of the feature map. Please refer to... Figure 4 B, C, and D represent feature map A. Then, reshape feature maps B, C, and D respectively, resulting in three matrices of size C×N. These three matrices correspond to Q, K, and V in the self-attention mechanism. Through the spatial attention module, the relative factors of pixel-to-pixel interactions in deep features can be obtained. This provides information on the changes in the positions of related pixels in subsequent similar feature predictions, making it easier to establish long-range relationships between pixels.

[0119] The Transformer Feature Encoder is a network architecture based on the Transformer design. This module consists of three parts: a self-attention mechanism, a feedforward neural network, and a multi-layer convolutional neural network. The self-attention mechanism is the core architecture of the Transformer, its core principle being to allow each element in the sequence to interact directly with all other elements to capture global dependencies. The self-attention mechanism can be described by the following formula, where Q, K, and V are feature vectors, is an intermediate variable in the model structure, and d... k It is a constant that does not change during training:

[0120]

[0121] In this embodiment, since the self-attention mechanism can further extract potential connections in features, the face feature encoder includes a feature extraction module, a spatial attention mechanism module, and a feature encoding module connected in sequence. The still image is input into the feature extraction module for feature extraction to obtain a first extracted feature; the motion-blurred image is input into the feature extraction module for feature extraction to obtain a second extracted feature; the first extracted feature is input into the spatial attention mechanism module to obtain a first attention feature; the second extracted feature is input into the spatial attention mechanism module to obtain a second attention feature; the first attention feature is input into the feature encoding module for encoding to obtain a first feature map; and the second attention feature is input into the feature encoding module for encoding to obtain a second feature map. This facilitates further extraction of potential features from both the still image and the motion-blurred image, thereby improving the feature encoding effect of the face feature encoder.

[0122] Optionally, the step of inputting the first feature map, the second feature map, and the first latent space feature map into the latent space feature restoration module for feature restoration to obtain the first restored feature includes:

[0123] Gaussian noise is added to the first latent spatial feature map to obtain the second latent spatial feature map;

[0124] The first feature map, the second feature map, and the second latent space feature map are respectively input into the latent space feature restoration module for feature restoration to obtain the first restored feature.

[0125] In related technologies, the basic process of a conventional diffusion model is through... The noise is added to diffuse the latent features Z of the input, and then the neural network is used to diffuse Z at each step. t noise Denoising the feature Z yields a clean feature Z0, where Z t The data after adding noise at step t; α t The scalar coefficients for retaining the proportion of the control signal typically satisfy 0 < α. t <1; Z represents the original clean data, usually Z0; β t A scalar coefficient is added to control the intensity of the noise, typically along with α. t Satisfy α t 2 +β t 2 =1, or β t =1-α t ; Standard Gaussian noise, i.e. In the denoising process, diffusion models in related technologies often lack sufficient prior information, leading to a high probability of mistakenly damaging key information in the original features and resulting in distorted images. However, the latent spatial feature restoration module in this embodiment incorporates the encoding result output by a face feature encoder. This allows the latent spatial feature restoration module to effectively utilize multi-frame information at different brightness levels to restore faces in dark areas, ensuring image integrity. The encoding result output by the face feature encoder is the aforementioned first feature map and second feature map.

[0126] The model architecture of the above diffusion model can be set as needed. For example, in some embodiments of this application, the above diffusion model can be a UNet model generated based on the UNet architecture.

[0127] In this embodiment, Gaussian noise is added to the first latent spatial feature map to obtain a second latent spatial feature map; the first feature map, the second feature map, and the second latent spatial feature map are respectively input to the latent spatial feature restoration module for feature restoration to obtain the first restored feature. In this way, by introducing the encoding result output by the face feature encoder into the input of the single-step diffusion model, the latent spatial feature restoration module can effectively utilize multi-frame information with different brightness to restore faces in dark areas, ensuring that the image is not distorted.

[0128] In some embodiments of this application, after completing the above-described model training process, when it is necessary to reconstruct an image based on the trained model, a set of bright moving images R and low-brightness still images I can be simultaneously input into the face feature encoding module to obtain face feature encoding. The low-brightness still image I to be reconstructed is then input into the VAE Encoder to obtain the latent features of the image. Then, the latent features of the image and the obtained face feature encoding result are simultaneously input into the latent spatial feature restoration module to restore the features of the input image I. The first restored feature is then processed by the VAE Encoder to obtain a clear dark area face image. This application embodiment can utilize the latent information in bright moving images and low-brightness still images to achieve high-fidelity dark area face image restoration. The method does not obtain the input prompt through a Clip model, but instead captures similar structural features, high-dimensional semantic features, and facial visual features in bright moving and low-brightness still images through uniformly captured multi-exposure images. A generative neural network is then used to obtain a more realistic face image more efficiently. Furthermore, the data used for model training in this invention is easier, faster, and less costly to obtain compared to existing methods. The method provided in this application embodiment ensures image freeze-frame capability during nighttime shooting of moving portraits, while simultaneously resolving issues such as facial anomalies, smearing, and noise caused by low exposure times. This method is based on an improved diffusion model. Compared to the traditional SD model, this application embodiment does not require a prompt input; the input is already captured images of varying exposures along the image capture path. Therefore, this invention provides the model with more accurate prior information, further improving its robustness.

[0129] The model training method provided in this application can be executed by a model training device. This application uses an example of a model training device executing the model training method to illustrate the model training device provided in this application.

[0130] Please see Figure 6 , Figure 6 A model training apparatus 600 provided in this application embodiment includes:

[0131] The first acquisition module 601 is used to acquire multiple data pairs, the data pairs including a first initial image and a second initial image, the first initial image being an image of a preset object in a static state, the second initial image being an image of the preset object in a moving state, and the facial area of ​​the preset object in the second initial image being blurred;

[0132] The generation module 602 is used to generate multiple sets of training data that correspond one-to-one with the multiple data pairs. The training data includes an original image, a reference image, and a ground truth image. The original image is an image obtained by adding Gaussian noise of a first intensity to the first initial image and then darkening it. The reference image is an image obtained by adding Gaussian noise of a second intensity to the second initial image. The first intensity is greater than the second intensity. The ground truth image is the first initial image.

[0133] Training module 603 is used to train an initial feature encoder based on the multiple sets of training data to obtain a face feature encoder, and to train an initial latent space feature restoration module based on the multiple sets of training data to obtain a latent space feature restoration module.

[0134] Optionally, the training module 603 includes:

[0135] The processing submodule is used to perform alignment processing and region segmentation processing on the original image in the first training data and the reference image in the first training data to obtain N first sub-images and N second sub-images. The first training data is any set of training data in the plurality of training data. The N first sub-images are images of N different regions in the original image, and the N second sub-images are images of N different regions in the reference image. The N first sub-images and the N second sub-images correspond one-to-one.

[0136] A generation submodule is used to generate N corresponding sub-training data based on the N first sub-images and the N second sub-images, wherein the sub-training data includes a set of corresponding first sub-images and second sub-images;

[0137] The training submodule is used to train the initial feature encoder based on the N sub-training data to obtain the facial feature encoder.

[0138] Optionally, the training submodule is configured to perform the following steps:

[0139] The first sub-image and the second sub-image in the first sub-training data are respectively input into the initial feature encoder for feature encoding to obtain the first sub-encoding information and the second sub-encoding information, wherein the first sub-training data is any one of the N sub-training data;

[0140] Calculate the loss between the first sub-encoded information and the second sub-encoded information to obtain the initial loss value;

[0141] The negative value of the initial loss value is used as the target loss value, and the parameters of the initial feature encoder are optimized based on the target loss value to obtain the face feature encoder.

[0142] Optionally, the training module 603 includes:

[0143] The encoding submodule is used to input the original image and the reference image in the second training data into the face feature encoder to obtain the corresponding third feature map and fourth feature map, and to input the original image in the second training data into the variational autoencoder to obtain the third latent space feature map, wherein the second training data is any one set of training data in the plurality of training data;

[0144] The noise submodule is used to add noise to the third latent spatial feature map to obtain the fourth latent spatial feature map.

[0145] The complex atom module is used to input the third feature map, the fourth feature map, and the fourth latent space feature map into the initial latent space feature restoration module for feature restoration to obtain the second restored feature;

[0146] The decoding submodule is used to input the second restored feature into the variational autodecoder for decoding to obtain the second restored image;

[0147] The calculation submodule is used to calculate the loss information between the second restored image and the second training data, wherein the loss information includes reconstruction loss, perceptual loss and contrast loss;

[0148] An optimization submodule is used to optimize the parameters of the initial latent spatial feature restoration module based on the loss information, thereby obtaining the latent spatial feature restoration module.

[0149] In this implementation, by adding Gaussian noise and darkening the image, images from various scenarios can be simulated as if taken in a dark environment, and training data can be generated based on this, thus enriching the diversity of the training data. Simultaneously, since a set of training data is generated based on images taken in a static state and motion-blurred images, during training, the initial feature encoder can encode features in both static and motion-blurred images separately to obtain the latent features of the face region in both images. Meanwhile, the initial latent spatial feature restoration module can perform feature restoration based on the latent features of the face region in the static and motion-blurred images output by the initial feature encoder, thereby improving the face restoration effect in dark environment sample images. In this way, feature encoding and image restoration can be performed on images taken in dark environments based on the trained face feature encoder and latent spatial feature restoration module, resulting in highly realistic face images.

[0150] The model training device 600 in this application embodiment can be an electronic device or a component within an electronic device, such as an integrated circuit or a chip. The electronic device can be a terminal or other devices besides a terminal. For example, the electronic device can be a mobile phone, tablet computer, laptop computer, PDA, in-vehicle electronic device, mobile internet device (MID), augmented reality (AR) / virtual reality (VR) device, robot, wearable device, ultra-mobile personal computer (UMPC), netbook, or personal digital assistant (PDA), etc. It can also be a server, network attached storage (NAS), personal computer (PC), television (TV), ATM, or self-service machine, etc. This application embodiment does not specifically limit the specific type of device.

[0151] The model training device 600 in this embodiment can be a device with an operating system. This operating system can be Android, iOS, or other possible operating systems; this embodiment does not specifically limit its use.

[0152] The model training device 600 provided in this application embodiment can achieve... Figure 1 The various processes implemented in the method embodiments achieve the same technical effect, and will not be described again here to avoid repetition.

[0153] Please see Figure 7 , Figure 7 An image processing apparatus 700 provided in this application embodiment is applied to the face feature encoder and latent spatial feature restoration module described in the above embodiment. The apparatus includes:

[0154] The second acquisition module 701 is used to acquire still images and motion-blurred images, wherein the still image is an image taken when the subject is stationary, and the motion-blurred image is an image taken when the subject is in motion, and the face area of ​​the subject is blurred in the motion-blurred image.

[0155] The encoding module 702 is used to input the still image and the motion-blurred image into the face feature encoder respectively to obtain the corresponding first feature map and second feature map, and to input the still image into the variational autoencoder to obtain the first latent space feature map.

[0156] The restoration module 703 is used to input the first feature map, the second feature map and the first latent space feature map into the latent space feature restoration module respectively to perform feature restoration and obtain the first restored feature.

[0157] The decoding module 704 is used to input the first restored feature into the variational autodecoder for decoding to obtain the first restored image.

[0158] Optionally, the facial feature encoder includes a feature extraction module, a spatial attention mechanism module, and a feature encoding module connected in sequence. The encoding module 702 includes:

[0159] The feature extraction submodule is used to input the still image into the feature extraction module for feature extraction to obtain a first extracted feature, and to input the motion-blurred image into the feature extraction module for feature extraction to obtain a second extracted feature;

[0160] A spatial attention submodule is used to input the first extracted feature into the spatial attention mechanism module to obtain a first attention feature, and to input the second extracted feature into the spatial attention mechanism module to obtain a second attention feature;

[0161] The encoding submodule is used to input the first attention feature into the feature encoding module for encoding to obtain the first feature map, and to input the second attention feature into the feature encoding module for encoding to obtain the second feature map.

[0162] Optionally, the restoration module 703 includes:

[0163] The noise submodule is used to add Gaussian noise to the first latent spatial feature map to obtain the second latent spatial feature map.

[0164] The complex atom module is used to input the first feature map, the second feature map and the second latent space feature map into the latent space feature restoration module respectively to restore the features and obtain the first restored feature.

[0165] In this embodiment, since still images and motion-blurred images acquired in the same continuous time period have similar structural features and high-dimensional semantic features of facial visual features, by acquiring still images and motion-blurred images, encoding the still images based on a facial feature encoder to obtain a first feature map, and encoding the motion-blurred images based on the facial feature encoder to obtain a second feature map, the potential information of the facial region in the still images and motion-blurred images can be obtained. The first feature map, the second feature map, and the first latent spatial feature map are then input into the latent spatial feature restoration module for feature restoration, which helps to improve the face restoration effect in images taken in dark environments, so as to obtain highly realistic face images.

[0166] The image processing device 700 in this embodiment can be an electronic device or a component within an electronic device, such as an integrated circuit or a chip. The electronic device can be a terminal or other devices besides a terminal. For example, the electronic device can be a mobile phone, tablet computer, laptop computer, PDA, in-vehicle electronic device, mobile internet device (MID), augmented reality (AR) / virtual reality (VR) device, robot, wearable device, ultra-mobile personal computer (UMPC), netbook, or personal digital assistant (PDA), etc. It can also be a server, network attached storage (NAS), personal computer (PC), television (TV), ATM, or self-service machine, etc. This embodiment does not specifically limit the specific type of device.

[0167] The image processing device 700 in this embodiment can be a device with an operating system. This operating system can be Android, iOS, or other possible operating systems; this embodiment does not specifically limit its use.

[0168] The image processing apparatus 700 provided in this application embodiment can achieve... Figure 2 The various processes implemented in the method embodiments achieve the same technical effect, and will not be described again here to avoid repetition.

[0169] In some embodiments, such as Figure 8 As shown, this application embodiment also provides an electronic device 800, including a processor 801, a memory 802, and a program or instructions stored in the memory 802 and executable on the processor 801. When the program or instructions are executed by the processor 801, they implement the various processes of the above-described image processing method embodiments and achieve the same technical effects. To avoid repetition, they will not be described again here.

[0170] Figure 9 A schematic diagram of the hardware structure of an electronic device according to an embodiment of this application.

[0171] The electronic device 900 includes, but is not limited to, components such as: radio frequency unit 901, network module 902, audio output unit 903, input unit 904, sensor 905, display unit 906, user input unit 907, interface unit 908, memory 909, and processor 910.

[0172] In some embodiments of this application, electronic device 900 can be used to implement Figure 1 In the model training method of the illustrated embodiment, the processor 910 is used to acquire multiple data pairs, the data pairs including a first initial image and a second initial image, the first initial image being an image of a preset object in a static state, the second initial image being an image of the preset object in a moving state, and the facial region of the preset object in the second initial image being blurred;

[0173] The processor 910 is used to generate multiple sets of training data that correspond one-to-one with the multiple data pairs. The training data includes an original image, a reference image, and a ground truth image. The original image is an image obtained by adding Gaussian noise of a first intensity to the first initial image and then darkening it. The reference image is an image obtained by adding Gaussian noise of a second intensity to the second initial image. The first intensity is greater than the second intensity. The ground truth image is the first initial image.

[0174] The processor 910 is configured to train an initial feature encoder based on the multiple sets of training data to obtain a face feature encoder, and to train an initial latent space feature restoration module based on the multiple sets of training data to obtain a latent space feature restoration module.

[0175] Optionally, the processor 910 is configured to perform alignment processing and region segmentation processing on the original image in the first training data and the reference image in the first training data to obtain N first sub-images and N second sub-images, wherein the first training data is any set of training data among the plurality of training data, the N first sub-images are images of N different regions in the original image, the N second sub-images are images of N different regions in the reference image, and the N first sub-images and the N second sub-images correspond one-to-one;

[0176] The processor 910 is configured to generate N corresponding sub-training data based on the N first sub-images and the N second sub-images, wherein the sub-training data includes a set of corresponding first sub-images and second sub-images;

[0177] The processor 910 is used to train the initial feature encoder based on the N sub-training data to obtain the facial feature encoder.

[0178] Optionally, the processor 910 is configured to input the first sub-image and the second sub-image in the first sub-training data into the initial feature encoder for feature encoding to obtain first sub-encoding information and second sub-encoding information, wherein the first sub-training data is any one of the N sub-training data;

[0179] The processor 910 is used to calculate the loss between the first sub-encoding information and the second sub-encoding information to obtain an initial loss value;

[0180] The processor 910 is used to take the negative value of the initial loss value as the target loss value, and optimize the parameters of the initial feature encoder based on the target loss value to obtain the face feature encoder.

[0181] Optionally, the processor 910 is configured to input the original image and the reference image in the second training data into the face feature encoder to obtain the corresponding third feature map and fourth feature map, and to input the original image in the second training data into the variational autoencoder to obtain the third latent space feature map, wherein the second training data is any one set of training data in the plurality of training data;

[0182] The processor 910 is used to add noise to the third latent space feature map to obtain a fourth latent space feature map.

[0183] The processor 910 is used to input the third feature map, the fourth feature map and the fourth latent space feature map into the initial latent space feature restoration module respectively for feature restoration to obtain the second restored feature;

[0184] The processor 910 is used to input the second restored feature into the variational autodecoder for decoding to obtain the second restored image;

[0185] The processor 910 is used to calculate loss information between the second restored image and the second training data, wherein the loss information includes reconstruction loss, perceptual loss and contrast loss;

[0186] The processor 910 is used to optimize the parameters of the initial latent spatial feature restoration module based on the loss information to obtain the latent spatial feature restoration module.

[0187] In some embodiments of this application, electronic device 900 can be used to implement Figure 2 The image processing method in the illustrated embodiment includes a processor 910 for acquiring still images and motion-blurred images. The still image is an image captured when the subject is stationary, and the motion-blurred image is an image captured when the subject is in motion. The motion-blurred image contains a blurred facial area of ​​the subject.

[0188] The processor 910 is configured to input the still image and the motion-blurred image into the face feature encoder to obtain corresponding first feature maps and second feature maps, and to input the still image into the variational autoencoder to obtain a first latent space feature map.

[0189] The processor 910 is used to input the first feature map, the second feature map and the first latent space feature map into the latent space feature restoration module respectively to restore the features and obtain the first restored feature.

[0190] The processor 910 is used to input the first restored feature into the variational autodecoder for decoding to obtain the first restored image.

[0191] Optionally, the processor 910 is configured to input the still image into the feature extraction module for feature extraction to obtain a first extracted feature, and to input the motion-blurred image into the feature extraction module for feature extraction to obtain a second extracted feature;

[0192] The processor 910 is configured to input the first extracted feature into the spatial attention mechanism module to obtain a first attention feature, and to input the second extracted feature into the spatial attention mechanism module to obtain a second attention feature;

[0193] The processor 910 is configured to input the first attention feature into the feature encoding module for encoding to obtain the first feature map, and to input the second attention feature into the feature encoding module for encoding to obtain the second feature map.

[0194] Optionally, the processor 910 is configured to add Gaussian noise to the first latent spatial feature map to obtain a second latent spatial feature map;

[0195] The processor 910 is used to input the first feature map, the second feature map and the second latent space feature map into the latent space feature restoration module respectively to restore the features and obtain the first restored feature.

[0196] Those skilled in the art will understand that the electronic device 900 may also include a power supply (such as a battery) for supplying power to various components. The power supply may be logically connected to the processor 910 through a power management system, thereby enabling functions such as managing charging, discharging, and power consumption through the power management system. Figure 9 The electronic device structure shown does not constitute a limitation on the electronic device. The electronic device may include more or fewer components than shown, or combine certain components, or have different component arrangements, which will not be elaborated here.

[0197] It should be understood that, in this embodiment, the input unit 904 may include a graphics processing unit (GPU) 9041 and a microphone 9042. The GPU 9041 processes image data of still images or videos obtained by an image capture device (such as a camera) in video capture mode or image capture mode. The display unit 906 may include a display panel 9061, which may be configured in the form of a liquid crystal display, an organic light-emitting diode, or the like. The user input unit 907 includes a touch panel 9071 and other input devices 9072. The touch panel 9071 is also called a touch screen. The touch panel 9071 may include a touch detection device and a touch controller. Other input devices 9072 may include, but are not limited to, a physical keyboard, function keys (such as volume control buttons, power buttons, etc.), a trackball, a mouse, and a joystick, which will not be described in detail here.

[0198] The memory 909 can be used to store software programs and various data. The memory 909 may primarily include a first storage area for storing programs or instructions and a second storage area for storing data. The first storage area may store the operating system, application programs or instructions required for at least one function (such as sound playback, image playback, etc.). Furthermore, the memory 909 may include volatile memory or non-volatile memory, or both. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be random access memory (RAM), static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDRSDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous link dynamic random access memory (SLDRAM), and direct memory bus RAM (DRRAM). The memory 909 in the embodiments of this application includes, but is not limited to, these and any other suitable types of memory.

[0199] Processor 910 may include one or more processing units; optionally, processor 910 integrates an application processor and a modem processor, wherein the application processor mainly handles operations involving the operating system, user interface, and applications, and the modem processor mainly handles wireless communication signals, such as a baseband processor. It is understood that the aforementioned modem processor may also not be integrated into processor 910.

[0200] This application also provides a readable storage medium storing a program or instructions. When the program or instructions are executed by a processor, they implement the various processes of the above-described model training method or image processing method embodiments and achieve the same technical effect. To avoid repetition, they will not be described again here.

[0201] The processor is the processor in the electronic device described in the above embodiments. The readable storage medium includes computer-readable storage media, such as computer read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk.

[0202] This application embodiment also provides a chip, which includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the various processes of the above-described model training method or image processing method embodiments, and can achieve the same technical effect. To avoid repetition, it will not be described again here.

[0203] It should be understood that the chip mentioned in the embodiments of this application may also be referred to as a system-on-a-chip, system chip, chip system, or system-on-a-chip, etc.

[0204] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element. Furthermore, it should be noted that the scope of the methods and apparatuses in the embodiments of this application is not limited to performing functions in the order shown or discussed, but may also include performing functions substantially simultaneously or in the reverse order, depending on the functions involved. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

[0205] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0206] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.

Claims

1. A model training method, characterized in that, include: Multiple data pairs are acquired, including a first initial image and a second initial image. The first initial image is an image of a preset object taken in a static state, and the second initial image is an image of the preset object taken in a moving state. The facial area of ​​the preset object in the second initial image is blurred. Generate multiple sets of training data that correspond one-to-one with the multiple data pairs. The training data includes an original image, a reference image, and a ground truth image. The original image is an image obtained by adding Gaussian noise of a first intensity to the first initial image and then darkening it. The reference image is an image obtained by adding Gaussian noise of a second intensity to the second initial image. The first intensity is greater than the second intensity. The ground truth image is the first initial image. The initial feature encoder is trained based on the multiple sets of training data to obtain the face feature encoder, and the initial latent space feature restoration module is trained based on the multiple sets of training data to obtain the latent space feature restoration module.

2. The method according to claim 1, characterized in that, The step of training the initial feature encoder based on the multiple sets of training data to obtain the facial feature encoder includes: Alignment and region segmentation are performed on the original image and the reference image in the first training data to obtain N first sub-images and N second sub-images. The first training data is any one of the multiple sets of training data. The N first sub-images are images of N different regions in the original image, and the N second sub-images are images of N different regions in the reference image. The N first sub-images and the N second sub-images correspond one-to-one. Based on the N first sub-images and the N second sub-images, N corresponding sub-training data are generated, wherein the sub-training data includes a set of corresponding first sub-images and second sub-images; The initial feature encoder is trained based on the N sub-training data to obtain the facial feature encoder.

3. The method according to claim 2, characterized in that, The step of training the initial feature encoder based on the N sub-training data to obtain the facial feature encoder includes: The first sub-image and the second sub-image in the first sub-training data are respectively input into the initial feature encoder for feature encoding to obtain the first sub-encoding information and the second sub-encoding information, wherein the first sub-training data is any one of the N sub-training data; Calculate the loss between the first sub-encoded information and the second sub-encoded information to obtain the initial loss value; The negative value of the initial loss value is used as the target loss value, and the parameters of the initial feature encoder are optimized based on the target loss value to obtain the face feature encoder.

4. The method according to claim 1, characterized in that, The process of training the initial latent spatial feature restoration module based on the multiple sets of training data to obtain the latent spatial feature restoration module includes: The original image and the reference image in the second training data are input into the face feature encoder to obtain the corresponding third feature map and fourth feature map. The original image in the second training data is input into the variational autoencoder to obtain the third latent space feature map. The second training data is any one of the multiple sets of training data. Noise is added to the third latent space feature map to obtain the fourth latent space feature map; The third feature map, the fourth feature map, and the fourth latent space feature map are respectively input into the initial latent space feature restoration module for feature restoration to obtain the second restored feature; The second restored feature is input into the variational autodecoder for decoding to obtain the second restored image; Calculate the loss information between the second restored image and the second training data, wherein the loss information includes reconstruction loss, perceptual loss and contrast loss; Based on the loss information, the parameters of the initial latent spatial feature restoration module are optimized to obtain the latent spatial feature restoration module.

5. An image processing method, characterized in that, The method, applied to the facial feature encoder and latent spatial feature restoration module according to any one of claims 1-4, comprises: Acquire still images and motion-blurred images, wherein the still image is an image taken when the subject is stationary, and the motion-blurred image is an image taken when the subject is in motion, and the facial area of ​​the subject is blurred in the motion-blurred image; The still image and the motion-blurred image are respectively input into the face feature encoder to obtain the corresponding first feature map and second feature map; and the still image is input into the variational autoencoder to obtain the first latent space feature map. The first feature map, the second feature map, and the first latent space feature map are respectively input into the latent space feature restoration module for feature restoration to obtain the first restored feature. The first restored feature is input into the variational autodecoder for decoding to obtain the first restored image.

6. The method according to claim 5, characterized in that, The facial feature encoder includes a feature extraction module, a spatial attention mechanism module, and a feature encoding module connected in sequence. The step of inputting the still image and the motion-blurred image into the facial feature encoder to obtain corresponding first and second feature maps includes: The still image is input into the feature extraction module for feature extraction to obtain a first extracted feature, and the motion-blurred image is input into the feature extraction module for feature extraction to obtain a second extracted feature; The first extracted feature is input into the spatial attention mechanism module to obtain the first attention feature, and the second extracted feature is input into the spatial attention mechanism module to obtain the second attention feature; The first attention feature is input into the feature encoding module for encoding to obtain the first feature map, and the second attention feature is input into the feature encoding module for encoding to obtain the second feature map.

7. The method according to claim 5, characterized in that, The step of inputting the first feature map, the second feature map, and the first latent space feature map into the latent space feature restoration module for feature restoration to obtain the first restored feature includes: Gaussian noise is added to the first latent spatial feature map to obtain the second latent spatial feature map; The first feature map, the second feature map, and the second latent space feature map are respectively input into the latent space feature restoration module for feature restoration to obtain the first restored feature.

8. A model training device, characterized in that, include: The first acquisition module is used to acquire multiple data pairs, the data pairs including a first initial image and a second initial image, the first initial image being an image of a preset object in a static state, the second initial image being an image of the preset object in a moving state, and the facial area of ​​the preset object in the second initial image being blurred; The generation module is used to generate multiple sets of training data that correspond one-to-one with the multiple data pairs. The training data includes an original image, a reference image, and a ground truth image. The original image is an image obtained by adding Gaussian noise of a first intensity to the first initial image and then darkening it. The reference image is an image obtained by adding Gaussian noise of a second intensity to the second initial image. The first intensity is greater than the second intensity. The ground truth image is the first initial image. The training module is used to train the initial feature encoder based on the multiple sets of training data to obtain the face feature encoder, and to train the initial latent space feature restoration module based on the multiple sets of training data to obtain the latent space feature restoration module.

9. An image processing apparatus, characterized in that, The apparatus for use with the facial feature encoder and latent spatial feature restoration module according to any one of claims 1-4, the apparatus comprising: The second acquisition module is used to acquire still images and motion-blurred images, wherein the still image is an image taken when the subject is stationary, and the motion-blurred image is an image taken when the subject is in motion, and the facial area of ​​the subject is blurred in the motion-blurred image; The encoding module is used to input the still image and the motion-blurred image into the face feature encoder to obtain the corresponding first feature map and second feature map, and to input the still image into the variational autoencoder to obtain the first latent space feature map. The restoration module is used to input the first feature map, the second feature map and the first latent space feature map into the latent space feature restoration module respectively to restore the features and obtain the first restored feature. The decoding module is used to input the first restored feature into the variational autodecoder for decoding to obtain the first restored image.

10. A computer program product, characterized in that, It includes computer instructions that, when executed by a processor, implement the steps of the model training method as described in any one of claims 1-4, or implement the steps of the image processing method as described in any one of claims 5-7.