Context-aware image enhancement by an electronic device

A lightweight model leveraging global and local context for image enhancement addresses the challenge of balancing detail and naturalness in high-zoom scenarios, achieving superior image restoration and visual coherence.

WO2026141777A1PCT designated stage Publication Date: 2026-07-02SAMSUNG ELECTRONICS CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
SAMSUNG ELECTRONICS CO LTD
Filing Date
2025-03-25
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing image enhancement methods, particularly in high-zoom scenarios, fail to balance detailed restoration with natural image quality, leading to artifacts and unnatural textures, especially in human faces, due to limitations in current neural network models.

Method used

A lightweight model that integrates both global and local contextual information for adaptive enhancement and restoration, using a context-aware image enhancement controller to generate locally and globally adjusted features, determine confidence scores, and combine them for a cohesive output.

Benefits of technology

The model effectively preserves natural textures and details, ensuring high-quality image restoration that maintains the integrity and authenticity of facial features while enhancing both the subject and background.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure KR2025003749_02072026_PF_FP_ABST
    Figure KR2025003749_02072026_PF_FP_ABST
Patent Text Reader

Abstract

According to an embodiment of the disclosure, a method may include obtaining an input image comprising a first region. The method may include generating a first image comprising a first set of locally adjusted features corresponding to the first region, based on the input image. The method may include generating a second image comprising a second set of globally adjusted features over the input image, based on the first set of locally adjusted features and the input image. The method may include determining a first confidence score for the first image and a second confidence score for the second image, based on the input image. The method may include generating an output image by combining the first image and the second image, based on the first confidence score for the first image and the second confidence score for the second image.
Need to check novelty before this filing date? Find Prior Art

Description

CONTEXT-AWARE IMAGE ENHANCEMENT BY AN ELECTRONIC DEVICE

[0001] The present disclosure relates embodiments relate to image processing technology. More particularly, the present disclosure relates to context-aware image enhancement by an electronic device.

[0002] The rapid advancement of mobile camera technology has significantly altered user expectations regarding image quality, particularly in terms of digital zoom capabilities. In the past, the primary focus of mobile cameras was on basic image capture. However, modern consumers now demand high-quality images even when zoomed in, especially when photographing human subjects. This shift in expectations has underscored the importance of image enhancement and restoration processes within the camera capture pipeline, as these processes greatly influence the final image quality.

[0003] Digital images are typically represented as arrays of pixels, with each pixel containing information such as intensity and color. Enhancing and restoring these images is achieves superior mobile camera quality. Existing methods predominantly leverage advanced neural networks to mitigate issues such as blur and low texture resolution in camera data. However, these methods highlight the ongoing need for innovative solutions capable of effectively enhancing and restoring images while preserving their natural quality, particularly at high zoom levels.

[0004] Traditional convolutional neural network (CNN) models, while efficient, encounter difficulties when applied to high-zoom scenarios. These challenges often result in artifacts and unnatural textures in the processed images. Further generative models can produce visually appealing images but face limitations concerning inference times and adaptability to new types of images. Transformer-based models, known for their ability to capture complex relationships within images, are not well-suited for real-time camera processing due to their inherent complexity.

[0005] A specific challenge arises when digitally zoomed images involve human faces. The unique features of human faces often lead to poorly restored facial images, as lightweight discriminative models employed in capture scenarios may suffer from texture loss or over-sharpening, leading to unnatural artifacts. This highlights the inadequacy of current models in balancing the need for detailed restoration with the maintenance of natural image quality.

[0006] Therefore, there is a pressing need for a lightweight model that can effectively comprehend both global and local contexts within an image. Such a model would help achieve a balance between contextual awareness and efficient processing, thereby enhancing image restoration capabilities. This would ultimately provide users with high-quality results that preserve natural texture and detail, meeting the growing demand for superior digital zoom performance in mobile devices.

[0007] According to an embodiment of the disclosure, a method for image enhancement by an electronic device may be provided. The method may include obtaining, by the electronic device, an input image comprising a first region. The method may include generating, by the electronic device, a first image comprising a first set of locally adjusted features corresponding to the first region, based on the input image. The method may include generating, by the electronic device, a second image comprising a second set of globally adjusted features over the input image, based on the first set of locally adjusted features and the input image. The method may include determining, by the electronic device, a first confidence score for the first image and a second confidence score for the second image, based on the input image. The method may include generating, by the electronic device, an output image by combining the first image and the second image, based on the first confidence score for the first image and the second confidence score for the second image.

[0008] According to an embodiment of the disclosure, an electronic device for context-aware image enhancement may be provided. The electronic device may comprise a memory storing one or more instructions, and at least one processor. The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to obtain an input image comprising a first region. The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to generate a first image comprising a first set of locally adjusted features corresponding to the first region, based on the input image. The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to generate a second image comprising a second set of globally adjusted features over the input image, based on the second set of locally adjusted features and the input image. The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to determine a first confidence score for the first image and a second confidence score for the second image, based on the input image. The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to generate an output image by combining the first image and the second image, based on the first confidence score for the first image and the second confidence score for the second image.

[0009] According to an embodiment of the disclosure, a computer-readable medium storing one or more instructions may be provided. The one or more instructions, when executed by at least one processor, may cause the at least one processor of an electronic device to perform operation corresponding to the method.

[0010] These and other aspects of the embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood,however, that the following descriptions, while indicating preferred embodiments and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications be made within the scope of the embodiments herein.

[0011] These and other features, aspects, and advantages of the present embodiments are illustrated in the accompanying drawings, throughout which like reference letters indicate corresponding parts in the various figures. The embodiments herein will be better understood from the following description with reference to the drawings, in which:

[0012] FIG. 1A and FIG. 1B illustrate examples of limitations of lightweight DNN models in image enhancement according to the prior art.

[0013] FIG. 2A and FIG. 2B illustrate examples of limitations of generative models according to the prior art.

[0014] FIG. 3 illustrates an example image captured using a high-scale zoom with a telephoto camera sensor featuring a human subject according to the prior art.

[0015] FIG. 4 is a block diagram that illustrates the working operation of the electronic device according to the embodiment disclosed herein.

[0016] FIG. 5 is a block diagram that illustrates the working operation of a context-aware image enhancement to provide enhanced output according to the embodiment disclosed herein.

[0017] FIG. 6 is a block diagram that illustrates the working operation of a text-aware image enhancement model according to the embodiment as disclosed herein.

[0018] FIG. 7 is a flowchart that illustrates the working operation of the proposed model according to the embodiment as disclosed herein.

[0019] FIG. 8 is a block diagram that illustrates facial feature localization and context understanding according to the embodiment as disclosed herein.

[0020] FIG. 9 is a block diagram that illustrates the working operation of a face restoration and beautification model by using localized facial features according to the embodiment as disclosed herein.

[0021] FIG. 10 illustrates an example output of facial feature enhancement using a face restoration and beautification model according to the embodiment as disclosed herein.

[0022] FIG. 11 is a block diagram that illustrates the working operation of a context-aware detail enhancement engine according to the embodiment as disclosed herein.

[0023] FIG. 12 illustrates the example output of a context-aware enhancement engine according to the embodiment as disclosed herein.

[0024] FIG. 13 is a block diagram that illustrates a method of determination of a face confidence score according to the embodiment as disclosed herein.

[0025] FIG. 14 is a block diagram that illustrates the working and learning mechanism of a regression weights analyzer according to the embodiment as disclosed herein.

[0026] FIG. 15 is a block diagram that illustrates the working operation of a combiner according to the embodiment as disclosed herein.

[0027] FIG. 16 illustrates a combined output image according to the embodiment as disclosed herein.

[0028] FIGS 17A-17C illustrate the comparison of human image enhancement between the existing technology and the proposed disclosure applied to the input images according to the embodiment as disclosed herein.

[0029] FIGS 18A-18C illustrate the comparison of text image enhancement between the existing technology and the proposed disclosure applied to the input images according to the embodiment as disclosed herein.

[0030] FIGS 19A-19B illustrate the comparison between the input image and the output image generated by the proposed disclosure according to the embodiment as disclosed herein.

[0031] It may be noted that to the extent possible, like reference numerals have been used to represent like elements in the drawing. Further, those of ordinary skill in the art will appreciate that elements in the drawing are illustrated for simplicity and may not have been necessarily drawn to scale. For example, the dimension of some of the elements in the drawing may be exaggerated relative to other elements to help to improve the understanding of aspects of the disclosure. Furthermore, the elements may have been represented in the drawing by conventional symbols, and the drawings may show only those specific details that are pertinent to the understanding the embodiments of the disclosure so as not to obscure the drawing with details that will be readily apparent to those of ordinary skill in the art having benefit of the description herein.

[0032] The embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. Also, the various embodiments described herein are not necessarily mutually exclusive, as some embodiments can be combined with one or more other embodiments to form new embodiments. The term "or" as used herein, refers to a non-exclusive or, unless otherwise indicated. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein can be practiced and to further enable those skilled in the art to practice the embodiments herein. Accordingly, the examples are not be construed as limiting the scope of the embodiments herein.

[0033] As is traditional in the field, embodiments are described and illustrated in terms of blocks that carry out a described function or functions. These blocks, which referred to herein as managers, units, modules, hardware components or the like, are physically implemented by analog and / or digital circuits such as logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive electronic components, active electronic components, optical components, hardwired circuits and the like, and optionally be driven by firmware and software. The circuits, for example, be embodied in one or more semiconductor chips, or on substrate supports such as printed circuit boards and the like. The circuits constituting a block be implemented by dedicated hardware, or by a processor (e.g., one or more programmed microprocessors and associated circuitry), or by a combination of dedicated hardware to perform some functions of the block and a processor to perform other functions of the block. Each block of the embodiments be physically separated into two or more interacting and discrete blocks without departing from the scope of the proposed method. Likewise, the blocks of the embodiments be physically combined into more complex blocks without departing from the scope of the proposed method.

[0034] The accompanying drawings are used to help easily understand various technical features and it is understood that the embodiments presented herein are not limited by the accompanying drawings. As such, the proposed method is construed to extend to any alterations, equivalents and substitutes in addition to those which are particularly set out in the accompanying drawings. Although the terms first, second, etc. used herein to describe various elements, these elements are not be limited by these terms. These terms are generally used to distinguish one element from another.

[0035] The principal object of the embodiments herein is to generate a facially enhanced input image by enhancing one or more facial features of the human face in the input image using a semantic facial image. Another object of the disclosure is to generate a facially enhanced input image by enhancing one or more facial features of the human face in the input image using the semantic facial image.

[0036] According to an embodiment of the disclosure, the object of the disclosure is to generate a globally enhanced input image using the facially enhanced input image and a semantic global context map. This global context enhanced input image restores the naturalness of the human face in the facially enhanced input image.

[0037] According to an embodiment of the disclosure, the object of the disclosure is to determine a confidence score for each of the facially enhanced input image and the globally enhanced input image based on the skin texture and face structure of the human face in the facially enhanced input image and the globally enhanced input image.

[0038] According to an embodiment of the disclosure, the object of the disclosure is to generate an enhanced input image by combining the facially enhanced input image, the globally enhanced input image, and their respective estimated face confidence scores.

[0039] The limitations of lightweight Deep Neural Network (DNN) models in image enhancement, as depicted in FIGS 1A-1B, underscore the challenges faced by current systems in achieving natural image reproduction. FIG. 1A demonstrates how a human face processed by a lightweight DNN model integrated within a camera's Image Signal Processing (ISP) system results in over-enhanced images. This over-enhancement leads to unnatural image reproduction due to the model's inability to comprehend the scene and local image patch context. The highlighted portion (101) in FIG. 1A reveals facial features that appear overly smooth, lacking the natural texture and understanding necessary for realistic image restoration. This results in a "paint-like" quality, which, although satisfactory in restoring the background, fails to maintain the natural appearance of human faces, thereby highlighting the model's limitations in understanding local context.

[0040] In FIG. 1B, further limitations are evident in the improper texture transitions, particularly in the hair (102). The image suffers from a lack of clarity and a poor-quality, overly smoothed effect, emphasizing the need for a more advanced approach to ensure natural image restoration while preserving the integrity of facial features. These issues illustrate the necessity for models that can effectively balance enhancement with the preservation of natural textures and details, especially in complex areas like human hair.

[0041] The challenges extend to generative models, as illustrated in FIGS 2A-2B. These models face issues such as mode collapse and non-convergence during training, and their generalization ability is not guaranteed. This often leads to the generation of unnecessary details not present in the original images. FIG. 2A serves as an input image fed to the generative model, highlighting natural facial details, including the eyes (201), skin texture, and the contours of spectacle glasses (202). This image acts as a baseline for evaluating the generative model's effectiveness. However, FIG. 2B reveals the output image from the generative model, where alterations in facial details, particularly around the spectacles, are evident. The rim of the glasses appears altered (203), lacking precise lines, and the eyes may exhibit unnatural enhancements (204) with colors or textures deviating from the original. These changes underscore the difficulties in maintaining the authenticity of the input image while producing visually captivating outcomes, as the generative model introduces unnecessary details instead of accurately restoring the original features.

[0042] To address these challenges, the present disclosure proposes an innovative image enhancement model architecture that leverages both global and local contextual information for adaptive enhancement and restoration. By integrating a more comprehensive understanding of the scene and local image patches, the proposed model aims to overcome the limitations of both lightweight DNN and generative models. This approach ensures that image enhancement is performed with a nuanced understanding of the context, resulting in more natural and realistic image reproductions. The model's ability to adaptively enhance images while preserving details and textures marks a significant advancement in the field of image processing, offering a solution that bridges the gap between current limitations and the need for high-quality, authentic image restoration.

[0043] The FIG. 3 illustrates an example of an input image. The input image is obtained with defined zoom. High-scale zoom may refer to a zoom level greater than the threshold magnification, the threshold magnification may represent the magnification that requires the use of a telephoto lens. For example, the threshold magnification can be 10x zoom. The image is captured using a high-scale zoom with a telephoto camera sensor. The figure illustrates that due to the significant or defined zoom level, the highlighted portion (302) in the captured image discloses noticeable blur and a lack of detail in the image. The human subject, specifically the face Region-of-Interest (ROI), occupies a small area within the frame, resulting in flattened facial features (301) that may not accurately represent the subject's characteristics. For example, the captured image may represent the reduced detail and depth compared to a general image. For example, texture reduction may cause details to blur, making the image appear flat, contrast and depth may decrease, reducing three-dimensionality, or the subject may appear blurred, leading to a loss of sharpness. In addition, the highlighted portion (302) in the background is not enhanced properly, contributing to the overall lack of sharpness. This highlights the challenges associated with high-zoom photography, where maintaining clarity and detail in facial features becomes increasingly challenging.

[0044] In high-zoom photography, the primary objective is often to capture distant subjects with as much detail and clarity as possible. However, as demonstrated in FIG. 3, achieving this goal is fraught with difficulties. The compression of the subject into a small area within the frame can lead to a loss of the facial details, making it difficult to preserve the subject's true likeness. This problem is exacerbated by the tendency of existing systems to focus enhancement efforts solely on facial regions, often at the expense of the background. As a result, the image can appear unbalanced, with a sharp foreground and a blurred, less detailed background. This imbalance can detract from the overall quality of the photograph, making it less visually appealing and less representative of the scene as a whole.

[0045] The present disclosure addresses these challenges by proposing a lightweight model that comprehends both the overall image context and the local context surrounding specific objects. Unlike existing systems, this disclosure is capable of learning what and where objects are within the scene or image, thereby enabling adaptive enhancement and restoration of images based on their characteristics. By integrating both global and local contexts, the lightweight model provides a complete understanding of the image, ensuring that enhancements are both meaningful and visually appealing. This dual-context capability is the core need addressed by the proposed disclosure, facilitating natural and high-quality image restoration without compromising performance. The disclosure's ability to balance the enhancement of both the subject and the background ensures that the final image is cohesive and representative of the original scene, offering a significant improvement over traditional high-zoom photography techniques.

[0046] FIG. 4 is the block diagram illustrating the operation of the electronic device. The electronic device (401) includes the at least one processor (402), the I / O interface (403), the memory (404), and a context-aware image enhancement controller (405). The electronic device (401) connects with the context-aware image enhancement controller (405). For example, the electronic device (401) can include, but is not limited to, a mobile phone, a smartphone, tablets, laptops, Internet of Things (IoT) devices. Further, the processor (402) of the electronic device (401) communicates with the memory (404), the I / O interface (403), and the context-aware image enhancement controller (405). The processor (402) is configured to execute instructions stored in the memory (404) and to perform various processes. The processor (402) can include one or a plurality of processors, can be a general-purpose processor such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and / or an Artificial Intelligence (AI) dedicated processor such as a neural processing unit (NPU).

[0047] The processor (402) may include various processing circuitry and / or multiple processors. For example, as used herein, including the claims, the term "processor" may include various processing circuitry, including at least one processor, wherein one or more of at least one processor, individually and / or collectively in a distributed manner, may be configured to perform various functions described herein. As used herein, when "a processor", "at least one processor", and "one or more processors" are described as being configured to perform numerous functions, these terms cover situations, for example and without limitation, in which one processor performs some of recited functions and another processor(s) performs other of recited functions, and also situations in which a single processor may perform all recited functions. Additionally, the at least one processor may include a combination of processors performing various of the recited / disclosed functions, e.g., in a distributed manner. At least one processor may execute program instructions to achieve or perform various functions.

[0048] Further, the memory (404) of the electronic device (401) includes storage locations to be addressable through the processor (402). The memory (404) is not limited to a volatile memory and / or a non-volatile memory. Further, the memory (404) can include one or more computer-readable storage media. The memory (404) can include non-volatile storage elements. For example, non-volatile storage elements can include magnetic hard disks, optical disks, floppy disks, flash memories, or forms of electrically programmable memories (EPROM) or electrically erasable and programmable (EEPROM) memories.

[0049] The I / O interface (403) transmits the information between the memory (404), electronic device (401), and external peripheral devices. The peripheral devices are the input-output devices associated with the electronic device (401).

[0050] The context-aware image enhancement controller (405) communicates with the I / O interface (403) and the memory (404). The context-aware image enhancement controller (405) is an innovative hardware that is realized through the physical implementation of both analog and digital circuits, including logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive and active electronic components, as well as optical components. Also, the context-aware image enhancement controller (405) is realized through the physical implementation of both analog and digital circuits, including logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive and active electronic components, as well as optical components.

[0051] The context aware detail enhancement controller (405) illustrated in FIG. 4 may actually be configurations implemented by at least one processor (402) included in the electronic device (401) executing a program or command stored in memory (404) included in the electronic device (401). Accordingly, the operations described below as being performed by the context aware detail enhancement controller (405) of the electronic device (401) may actually be performed by at least one processor (402) included in the electronic device (401).

[0052] The context-aware image enhancement controller (405) may receive the input image (501) which includes a RoI. The input image (501) may be captured at the defined zoom level, ensuring that the RoI is prominently featured and can be accurately processed. The context-aware image enhancement controller (405) may generate the first image (903), which includes a set of locally enhanced features corresponding to the RoI. The context-aware image enhancement may involve applying advanced image processing techniques such as edge detection and contrast adjustment specifically within the RoI to highlight details. Further, the context-aware image enhancement controller (405) may generate the second image (111) based on the set of locally enhanced features and the input image (501). The second image (111) may include a set of globally enhanced features over the input image (501), achieved by applying global filters that adjust at least one of brightness and saturation across the entire image. The context-aware image enhancement controller (405) may determine the confidence score for each of the set of locally enhanced features based on the first image (903) and the input image (501) and the confidence score for each of the set of globally enhanced features based on the second image (111), and the input image (501). The confidence scores may be determined using a probabilistic model that evaluates the enhancement quality and consistency. The context-aware image enhancement controller (405) may generate the output image (509) having the RoI enhanced over the input image based on the confidence score for each of the set of locally enhanced features and the confidence score for each of the set of globally enhanced features.

[0053] According to an embodiment of the disclosure, the ROI may be enhanced based on confidence score. The remaining enhanced regions may be directly extracted from the second image. The output image is then optimized for visual clarity and stored in the memory (404) of the electronic device (401), ensuring quick access and retrieval for future use.

[0054] The context-aware image enhancement controller (405) may be configured to generate the first image (903) having the set of locally enhanced features corresponding to the RoI from the input image (501). This process may involve segmenting the RoI using advanced image segmentation techniques that accurately delineate the boundaries of the RoI. The context-aware image enhancement controller (405) may input the input image (501) to the pre-trained RoI fine semantic model (502) The context-aware image enhancement controller (405) may determine the semantic RoI map. The semantic RoI map may localize RoI portions corresponding to the RoI in the input image (501), providing a detailed representation of the RoI's spatial characteristics. The context-aware image enhancement controller (405) may determine the set of locally enhanced features corresponding to the features in the input image (501) by enhancing the at least one of sharpness and texture of the localized RoI portions indicated in the semantic RoI map using the RoI restoration and restoration model (503). The restoration model may employ machine learning techniques to adaptively enhance features while maintaining the natural appearance of the image. The enhancement of the at least one of sharpness and texture of the localized RoI portions may be performed while preserving naturalness, ensuring that the enhanced image does not appear artificial. The context-aware image enhancement controller (405) may generate the first image (903) including the set of locally enhanced features, which is regionally enhanced over the input image (501), providing a visually appealing and detailed representation of the RoI.

[0055] The context-aware image enhancement controller (405) may be configured to generate the second image (111) having the set of globally enhanced features. This involves applying a series of image processing techniques that enhance the overall image quality, such as noise reduction and detail enhancement. The context-aware image enhancement controller (405) may input the input image (501) into the pre-trained coarse semantic model (504) to determine the semantic global context map. The semantic global context map may categorize the pixels in the input image (501) into a class of the RoI from the plurality of classes, allowing for targeted enhancement strategies. The context-aware image enhancement controller (405) may determine the set of globally enhanced features corresponding to the features in the input image (501) by performing context-aware detail enhancement based on the semantic global context map, the input image (501), local enhancement for pixels inside the RoI, and global enhancement for pixels inside and outside the RoI. This ensures that the image is uniformly enhanced while preserving the integrity of the RoI. The context-aware detail enhancement may be performed with minimal changes to details within the RoI portions, maintaining the original characteristics of the RoI. The context-aware image enhancement controller (405) may generate the second image (111), which includes the set of globally enhanced features, which is globally enhanced over the input image (501), resulting in a balanced and visually coherent image.

[0056] The context-aware image enhancement controller (405) may be configured to generate the output image (509). This involves integrating the locally and globally enhanced features into a single cohesive image. The context-aware image enhancement controller (405) may generate regression weights based on the face confidence score feature map. These regression weights may be calculated using a machine learning model that assesses the importance of each feature in the context of the overall image. The regression weights may ensure the seamless transition in restoration properties between RoI portions and non-RoI portions of the input image, providing a smooth and natural appearance. The context-aware image enhancement controller (405) may combine the set of locally enhanced features weighted by the first parameter ( ) and regression weights with the set of globally enhanced features in the second image (111) weighted by the second parameter ( ) and the regression weights. The first parameter and the second parameter may be configured to restrict the combined output to lie within the predefined domain, ensuring that the final image adheres to specific quality standards. The context-aware image enhancement controller (405) may generate the output image (509) having the RoI based on the combination of the set of locally enhanced features and the set of globally enhanced features, resulting in an image that is both detailed and aesthetically pleasing.

[0057] The context-aware image enhancement controller (405) may be configured to determine the confidence score. This score is assessing the effectiveness of the enhancement process and guiding further adjustments. The context-aware image enhancement controller (405) may compare the set of locally enhanced features in the first image (903) and the set of globally enhanced features in the second image (111) and the input image (501) using the pre-trained deep neural network (DNN) model. This model may be designed to evaluate the quality of enhancements by analyzing feature consistency and visual coherence. The context-aware image enhancement controller (405) may determine the confidence score for each of the set of locally enhanced features and the confidence score for each of the set of globally enhanced features based on the comparison. The confidence score may provide H*W*n values, where H represents the height of the RoI feature, W represents the width of the RoI features, and n represents the number of features corresponding to the RoI These values may be used to fine-tune the enhancement process, ensuring that the final image meets the desired quality criteria.

[0058] The context-aware image enhancement controller (405) may be configured to train the RoI restoration model. The training may include receiving the training image, which is pre-processed to normalize lighting conditions and remove noise artifacts that could interfere with model learning. Further, the context-aware image enhancement controller (405) may train the RoI restoration model (503) (the term RoI restoration model is interchangeably referred to as face restoration model) to enhance only the RoI by focusing on features that differ in edge characteristics, details, and texture from the rest of the training images. This involves using advanced convolutional neural networks (CNNs) that can discern subtle variations in facial features, such as wrinkles and blemishes, and apply targeted enhancements. Further, the context-aware image enhancement controller (405) may utilize the fine segmentation map and features from intermediate layers of the fine semantic segmentation model (502) as the attention guide to selectively enhance the RoI. The segmentation map may be refined using a multi-scale approach to ensure that even the facial features are accurately captured and enhanced. Further, the context-aware image enhancement controller (405) may apply custom loss functions during the training. The custom loss functions may assign higher weightage to enhancements within the RoI compared to other regions of the training images, ensuring that the model prioritizes facial features while maintaining a natural appearance.

[0059] The context-aware image enhancement controller (405) may be configured to train a context-aware detail enhancement model (505). Further, the training may include obtaining the generic image dataset that is independent of specific camera sensors, ensuring that the model is robust across various imaging conditions and devices. Further, the context-aware image enhancement controller (405) may train the context-aware detail enhancement model (505) using the generic image dataset to enhance images by performing at least one of improving sharpness, restoring details, and denoising. This involves employing advanced techniques that can differentiate between noise and actual image details, allowing for precise enhancement without over-processing. Further, the context-aware image enhancement controller (405) may apply custom loss functions for minimal enhancement to the RoI within the images. The custom loss functions may prioritize structural fidelity across the entire image while minimizing alterations in the RoI, ensuring that the enhancements do not distort the original intent of the image. Further, the context-aware image enhancement controller (405) may guide the training of the context-aware detail enhancement model (505) using segmentation maps and feature maps derived from the pre-trained RoI fine semantic model (502) and the pre-trained coarse semantic model (504). The segmentation maps and feature maps may act as attention maps to direct focus during the training, allowing the model to selectively enhance areas that contribute to the perceived image quality.

[0060] The context-aware image enhancement controller (405) may be configured to train the face confidence score evaluator (506). The training may include receiving the training images and identifying the RoI within the training images, using advanced facial recognition techniques to ensure accurate localization of facial features. The process involves analyzing the training images to generate the RoI-enhanced image and the overall enhanced image, using a combination of machine learning techniques to optimize feature enhancement. Further, the context-aware image enhancement controller (405) may compare the training images, the RoI-enhanced image, and the overall enhanced image to determine the reconstruction fidelity of features within the RoI. This comparison is facilitated by a sophisticated metric that evaluates the similarity between the original and enhanced images, focusing on facial landmarks. Further, the context-aware image enhancement controller (405) may generate the confidence score map and assigns the weight to each feature within the RoI. The weight may range from 0 to 1 and may be based on the determined reconstruction fidelity of the respective feature, providing a quantitative measure of enhancement quality. Further, the context-aware image enhancement controller (405) may generate the application of face restoration with two sets of confidence score maps, one corresponding to the RoI restoration model (503) output and the other corresponding to the context-aware detail enhancement model (505) output. Each set may include score maps for features including at least one of eyes, nose, mouth, and skin texture, with pixel weightages indicating the reconstruction quality of each feature, allowing for targeted adjustments based on confidence levels.

[0061] The context-aware image enhancement controller (405) may include generating the first image (903) with the set of locally enhanced features corresponding to the RoI from the input image (501). This involves applying a series of enhancement filters that are specifically tuned to improve the visual quality of facial features while maintaining their natural appearance. Further, the context-aware image enhancement controller (405) may input the input image (501) to the pre-trained RoI fine semantic model (502) and may determine the semantic RoI map. The semantic RoI map may be generated using a deep learning model that has been trained on a diverse dataset to accurately identify and segment facial regions. Further, the electronic device (401) may determine the set of locally enhanced features in the input image (501) by enhancing at least one of the sharpness and texture of the localized RoI portions indicated in the semantic RoI map using the RoI restoration model (503). The enhancement of the sharpness and texture of the localized RoI portions may be performed while preserving naturalness, using a combination of edge-preserving filters and texture synthesis techniques. Further, the electronic device (401) may generate the first image (903) that includes the set of locally enhanced features, which is regionally enhanced over the input image (501), ensuring that the enhancements are seamlessly integrated with the surrounding image content.

[0062] The context-aware image enhancement controller (405) may generate the second image (111) with the set of globally enhanced features. This involves applying a global enhancement technique that adjusts at least one of the overall brightness, contrast, denoising, texture, pattern restoration and sharpness improvement of the image to improve its visual appeal. Further, the context-aware image enhancement controller (405) may input the input image (501) into the pre-trained coarse semantic model (504) to determine the semantic global context map. The semantic global context map may be generated using a hierarchical approach that categorizes image regions based on their contextual importance. Further, the semantic global context map may categorize each of the pixels in the input image (501) into one class of the RoI from the plurality of classes, allowing for targeted enhancement strategies. Further, the context-aware image enhancement controller (405) may determine the set of globally enhanced features corresponding to the features in the input image (501) by performing context-aware detail enhancement based on the semantic global context map, the input image (501), local enhancement for pixels inside the RoI, and global enhancement for pixels inside and outside the RoI. The context-aware detail enhancement may be performed with minimal changes to details within the RoI portions, ensuring that the enhancements do not compromise the integrity of the original image. Further, the context-aware image enhancement controller (405) may generate the second image (111), which includes the set of globally enhanced features applied across the input image over the input image (501), resulting in a visually cohesive output.

[0063] The context-aware image enhancement controller (405) may generate the output image (509) by including the regression weights based on the face confidence score feature map. The regression weights maybe calculated using a machine learning model that has been trained to optimize the balance between local and global enhancements. Further, the context-aware image enhancement controller (405) may combine the set of locally enhanced features weighted by the first parameter ( ) and regression weights with the set of globally enhanced features in the second image weighted by the second parameter ( ) and the regression weights. The first parameter and the second parameter may be configured to restrict the combined output to lie within the predefined domain, ensuring that the final image maintains a natural appearance. Further, the context-aware image enhancement controller (405) may generate the output image (509), which includes the RoI based on the combination of the set of locally enhanced features and the set of globally enhanced features. This process involves a sophisticated blending technique that seamlessly integrates the enhancements, resulting in an image that is both visually appealing and true to the original content.

[0064] The context-aware image enhancement controller (405) may determine the confidence score by comparing the set of locally enhanced features in the first image (903) with the features in the input image (501) and the set of globally enhanced features in the second image (111) with the features in the input image (501) using a pre-trained deep neural network (DNN) model. The DNN model may be designed to evaluate the quality of enhancements by analyzing the consistency and coherence of features across different image versions. Further, the context-aware image enhancement controller (405) may determine the confidence score for each of the set of locally enhanced features and the confidence score for each of the set of globally enhanced features based on the comparison. The confidence score may provide H*W*n values, where H represents the height of an RoI feature, W represents the width of the RoI features, and n represents the number of features corresponding to the RoI. This multi-dimensional scoring system allows for a detailed assessment of enhancement quality, enabling targeted adjustments to improve the final output.

[0065] The context-aware image enhancement controller (405) processes the set of locally enhanced features, which may include at least one of the texture of the RoI, the main structure of the RoI, the sub-structure of the RoI, the type of the RoI, the size of the RoI, the color of the RoI, and textual regions. This involves using advanced image processing techniques to enhance each aspect of the RoI, ensuring that the enhancements are both subtle and effective. Further, the set of globally enhanced features may include at least one of background regions, foreground regions, the texture of the RoI, the main structure of the RoI, the sub-structure of the RoI, the type of the RoI, the size of the RoI, the color of the RoI, and textual regions. The global enhancement process may be designed to improve the overall image quality while maintaining the integrity of the original content, using a combination of, sharpness, texture enhancement, contrast adjustment, and noise reduction techniques.

[0066] FIG.5 illustrates the working operation of the context-aware image enhancement, detailing the interaction between various models and components that process the input image (501) to produce the enhanced output. The context-aware image enhancement controller (405) includes the input image (501), a fine face semantic model (502) (the term "fine face semantic model" interchangeably referred to as "pre-trained RoI fine semantic model" ), the face restoration and beautification model (503), a pre-trained coarse semantic model (504), a context-aware detail enhancement engine (505), a face confidence score evaluator (506), a regression weight analyzer (507), a combiner (508), and the output image (509). Each component is designed to perform specific tasks that contribute to the overall enhancement process, ensuring that the final output image may be both visually appealing and contextually accurate. The integration of these models allows for a comprehensive approach to image enhancement, addressing both local and global features.

[0067] The context-aware image enhancement controller (405) receives the input image (501), including the RoI. The Region-of-Interest (ROI) may be referred to as a first region. The input image may be captured at the defined zoom level. The input image is processed to ensure that the RoI may be clearly visible and can be accurately processed. The context-aware image enhancement controller (405) may determine a region of interest in the input image, for example, a main subject of the input image. For example, the context-aware image enhancement controller (405) may select a region of interest in the input image that includes a face of the subject. Additionally, the context-aware image enhancement controller (405) may select a region of interest in an input image that includes text.

[0068] According to an embodiment of the disclosure, the context-aware image enhancement controller (405) may identify the main subject by considering the location of the subject included in the input image. For example, the context-aware image enhancement controller (405) may select a subject located in the center of the input image as the main subject.

[0069] The RoI may be a specific area within the image or dataset that is selected for detailed analysis. For example, a geometric shape of RoI may be a rectangle shape that may represent the boundary of an object in the image. The selection of the RoI may determine the focus of the enhancement process, allowing the system to prioritize certain areas over others based on their importance or relevance to the overall image context.

[0070] Further, the input image (501) may contain a plurality of the subjects, such as the human face, text, animals, or other entities, including but not limited to objects, flowers, plants, trees, cars, toys, buildings, vehicles, food, and more. Depending on the type of subject contained in the input image, the category of the input image can be classified. Each type of input image may be enhanced using the proposed model, allowing for adjustments that improve clarity, detail, and overall visual appeal regardless of the subject matter. The system may be designed to recognize and adapt to various types of content, applying specific enhancement techniques that are best suited for each category. This adaptability may ensure that the enhancement process is both efficient and effective, regardless of the diversity of the input images.

[0071] The fine face semantic model (502) (the term "fine face semantic model" is interchangeably referred to as "RoI semantic model") may be a pre-trained model that localizes one or more features of the RoI. For example, the fine face semantic model (502) may localize one or more facial RoI features within the human face in the input image (501), such as, the position of eyes, nose or mouth. The output of the fine semantic model may be the first semantic map. The first semantic map may include the localization information of a plurality of portions of the RoI in the input image. For example, if the RoI is the face, the first semantic map may include position data of the facial features. Thus, the first semantic map may be referred to as the face semantic map. The face semantic map may be generated by detecting facial features, which detect or map facial features like eyes, nose, and mouth. Further, each RoI in the map may be named according to the corresponding facial feature and creates a detailed representation of the face. The fine face semantic model (502) may help to enhance the specific facial attributes based on their spatial relationships. This model may use advanced machine learning techniques to accurately identify and map facial features, ensuring that the enhancement process is precise and tailored to the characteristics of each face.

[0072] The face restoration model (503) (the term "face restoration model" interchangeably referred as "RoI restoration model") may receive the first semantic map from the fine face semantic model (502). Based on the semantic map such as the localization information, the face restoration model (503) applies specific smoothing and filtering techniques to improve the appearance of the features included in RoI (e.g., face), providing the output as a first image. The first image may be also referred to as face restoration image or facially enhanced image. The features included in the ROI may be adjusted. For example, smoothing may be applied to reduce noise, or filtering may be applied to emphasize specific frequencies or features. Since the features included in the ROI are adjusted among the entire input image, the features may be locally adjusted. Therefore, the adjusted features may be the locally adjusted features, and the set of adjusted features may be referred to as the first set of locally adjusted features. Further, the first image may include the set of locally enhanced features corresponding to the RoI from the input image (501). The model may employ a combination of convolutional neural networks and image processing techniques to achieve a natural and aesthetically pleasing result, enhancing features such as skin texture, eye brightness, and overall facial symmetry.

[0073] The locally enhanced features may include the texture of the RoI. The main structure of the RoI may represent the entire structure of the subject included in the ROI. The sub-structure may represent a part of the structure of the subject included in the ROI. For example, the main structure of the RoI may contain the face, while the substructures within the RoI may include the positions of pixels corresponding to the eyes, nose, mouth, ears, facial contours, and skin texture. The type of RoI, size of the RoI (represented by the number of pixels covering the substructures), color of the RoI, and any associated textual regions are also part of the enhanced features. The model may be capable of distinguishing between different facial features and applying targeted enhancements to each, ensuring that the final image retains a natural appearance while highlighting attributes.

[0074] For non-facial objects, the main structure of the RoI may contain text, while the substructures within the RoI may include the position of pixels indicating alphabets, font size, color of text, and script. The size of the RoI in this case may be represented by the font size or the overall text layout. The model may be designed to handle a wide range of non-facial content, applying specific enhancement techniques that are tailored to the characteristics of each type of object. This may ensure that the final image may be both visually appealing and contextually accurate, regardless of the diversity of the input content.

[0075] Additionally, the face restoration model (503) adjusts sharpness, reducing noise and texture while preserving naturalness, thereby providing a visually appealing result. The model may use advanced noise reduction techniques to minimize unwanted artifacts, while simultaneously enhancing the sharpness and clarity of key features. This ensures that the final image may be both clear and detailed, with a natural appearance that is free from distortion or unnatural artifacts.

[0076] In an embodiment, the face restoration model (503) may use face enhancement techniques such as skin smoothing, color correction, detail enhancement, facial feature refinement, brightness and contrast adjustment, makeup simulation, eye brightening, eyebrows' filters, face restoration, etc. These techniques may be applied in a context-aware manner, ensuring that each enhancement is tailored to the unique characteristics of the input image. This may allow for a highly personalized enhancement process, resulting in a final image that is both visually appealing and contextually accurate.

[0077] The pre-trained coarse semantic model (504) may be the pre-trained coarse scene classifier which receives the input image (501) to determine the semantic global context map. The semantic global context map may be referred to as the second semantic map. The second semantic map may include category information corresponding to a plurality of pixels of the input image (501). The semantic global context map may include the result of categorization corresponding to the pixels in the input image (501) into a class of the RoI from the plurality of classes and the result of analyzation corresponding to the content in the images. Further, the pre-trained coarse semantic model (504) may map the input image (501) to broader categories or plurality classes such as indoor or outdoor scenes, sky, grass, trees, common objects, human, pets, etc., as the globally enhanced features. This pre-trained coarse semantic model (504) may help to understand the overall context of the image, which may be used for subsequent analysis. The model uses deep learning techniques to accurately classify and map the content of the image, providing a comprehensive understanding of the overall scene.

[0078] Further, the first set of locally adjusted features from the face restoration model (503) may be sent to the context-aware detail enhancement engine (505) for applying attention. Simultaneously, the globally enhanced features from the pre-trained coarse semantic model (504) and input image (501) are sent to the context-aware detail enhancement engine (505) to provide a better understanding of context for selective area enhancement in the image. The engine uses advanced attention mechanisms to selectively enhance specific areas of the image, ensuring that the final output is both visually appealing and contextually accurate.

[0079] Attention mechanisms may include operations for obtaining feature data, referred to as query, key, and value, calculating weights corresponding to correlations between the query and the key, and applying weights to the values. For example, the first set of locally adjusted features may be used in the attention mechanism. The weights for the attention may be adjusted different between features corresponding to the ROI region and non-ROI region. Thus, the image characteristics of the input image may be changed.

[0080] In addition, the intermediate facial engine or context-aware detail enhancement engine (505) may focus on improving the non-RoI areas (e.g., the remaining areas of the image apart from facial features) by applying denoising and sharpness enhancement in a context-aware approach and provide output as a second image. The second image may be also referred to as context-aware enhancement output. Moreover, the context-aware detail enhancement engine (505) understands scene types and components along with a guide for facial enhancement to produce a sharper, higher resolution image while maintaining naturalness and beautification of the RoI (e.g., including face). The engine may use a combination of machine learning techniques and image processing techniques to achieve a balanced enhancement process, ensuring that both facial and non-facial features are accurately represented in the final image.

[0081] In other words, the context-aware detail enhancement engine (505) may provide the second image. The second image may include the set of globally enhanced features over the first image and the input image (501). The engine may use advanced blending techniques to seamlessly integrate the locally and globally enhanced features, ensuring that the final image is both cohesive and visually appealing.

[0082] Further, the set of globally enhanced features may include background regions, foreground regions, the texture of the RoI, the main structure of the RoI, a sub-structure of the RoI, a type of the RoI, a size of the RoI, a color of the RoI, and textual regions. The engine may be capable of accurately identifying and enhancing a wide range of features, ensuring that the final image is both detailed and contextually accurate.

[0083] The face confidence score evaluator (506) may determine the first confidence score for the first image. For example, the face confidence score evaluator (506) may determine the face confidence score(s) by assigning weightage to each facial feature using a pre-trained DNN model. The face confidence score evaluator (506) determines the scores for face structure integrity and skin texture given for face restored model output and context-aware enhancement output. Simultaneously, the face confidence score evaluator (506) compares between input images, face-only restored output (e.g., first image), and full scene enhanced outputs (e.g., second image), thereby determining the face confidence score(s). The evaluator may use advanced machine learning techniques to accurately assess the quality and integrity of the enhanced features, ensuring that the final image is both visually appealing and contextually accurate.

[0084] Further, the confidence score may contain H*W*n values for the output of face restoration model (503) and the output of context-aware detail enhancement model (505). The H represents height of the image (e.g., the number of vertical pixels corresponding to the image), W represents width of the image (e.g., the number of horizontal pixels corresponding to the image), and n represents channel of the image (e.g., features corresponding to facial RoI like eyes, nose, mouth, skin texture, and so on). The evaluator may use these values to accurately assess the quality and integrity of the enhanced features, ensuring that the final image is both visually appealing and contextually accurate.

[0085] The regression weight analyzer (507) may generate the regression weight for each of the output of face restoration model (503) and the output of context-aware detail enhancement model (505). The regression weights are calculated based on the face confidence score feature map. The face confidence score feature map is used to compute weights for the combination of the first image (e.g., the output of face restored model) and the second image (e.g., the output of context-aware enhancement model). The regression weight ensures the seamless transition in restoration properties between RoI portions and non-RoI portions of the input image. The analyzer uses advanced machine learning techniques to accurately calculate the regression weights, ensuring that the final image is both visually appealing and contextually accurate.

[0086] The combiner (508) may combine the first image and the second image. For example, the combiner (508) may blend the facial features using a pre-defined high confidence scores from the first image and the second images to generate the output image (509). This output image (509) may include the RoI based on the combination of locally enhanced features and the globally enhanced features. The combiner (508) may use advanced blending techniques to seamlessly integrate the locally and globally enhanced features, ensuring that the final image is both cohesive and visually appealing. The output image may be the result of a comprehensive enhancement process, combining the strengths of each model to produce a final image that is both visually appealing and contextually accurate.

[0087] FIG. 6 illustrates the working operation of the text-aware image enhancement model according to the embodiment as disclosed herein. The text-aware image enhancement controller includes a text-input image (601), a fine text semantic model (602), a text enhancement model (603), pre-trained coarse semantic model (504), the context-aware detail enhancement engine (505), a text confidence evaluator (606), the regression weight analyzer (507), the combiner (508), and the output text image (609).

[0088] The text image enhancement controller may receive the text-input image (601), which includes the RoI containing text. The text-input image (601) may be captured at the defined zoom level as indicated in FIG. 18A. The highlighted portion in FIG. 18A indicates text present in the text-input images (601).

[0089] The fine text semantic model (602) may detect the textural region and localizes the semantic RoI feature within the text in the text-input image (601). The text semantic map may be generated by detecting font style, size, and orientation, etc. The fine text semantic model (602) may identify the textual regions and may output a text semantic map with bounding boxes for these regions, which are subsequently enhanced by the text enhancement model (603).

[0090] The text enhancement model (603) may receive the semantic map with bounding boxes for the text regions from the fine text semantic model (602). Based on the received semantic map, the text enhancement model (603) may apply specific smoothing and filtering techniques to improve the appearance of the text and may generate the output as the first text image.

[0091] The pre-trained coarse semantic model (504) may be the pre-trained coarse scene classifier, which receives the text-input image (601) to categorize each pixel or RoI and analyzes the content in the images. Further, the pre-trained coarse semantic model (504) may map the image to broader categories such as indoor or outdoor scenes, sky, grass, trees, common objects, humans, pets, etc., as globally enhanced features.

[0092] As disclosed above, the context-aware detail enhancement engine (505) focuses on improving the remaining areas of the image apart from textual regions by applying denoising and sharpness enhancement in the context-aware approach and provides output as the second image.

[0093] The text confidence evaluator (606) may calculate a pixel-wise confidence score for each character and assigns the learning weights to each character accordingly. The regression weights analyzer (507) generates regression weights for the text restored or first text image and second image or the context-aware detail enhancement engine (505) output.

[0094] As discussed above, the combiner (508) may combine the first text image and the second text image. For example, the combiner (508) may blend the text features using the pre-defined high confidence scores from the first text image and the second images to generate an enhanced output text image (609). The enhanced output text image (609) is as shown in FIG. 18C. The highlighted portion in FIG. 18C represents the enhanced output text image (609) over the text-input image (601).

[0095] FIG. 7 is the flowchart that working operation of proposed model according to the embodiment as disclosed herein. The context-aware image enhancement model may receive the input image, which includes the RoI, wherein the input image is captured at a defined zoom level (701).

[0096] At step 701, the context-aware detail enhancement model may receive the input image. The input image (501) may include the RoI, which is captured at the defined zoom level. The zoom level may determine the resolution and detail available for processing, allowing the model to focus on specific areas with greater precision. The model may be designed to handle varying zoom levels by dynamically adjusting its processing parameters to maintain consistent enhancement quality across different image scales. This adaptability may ensure that the model can effectively enhance images captured under diverse conditions, such as varying lighting and focus settings.

[0097] At step 702 and 703, the context-aware detail enhancement model may compute the localized one or more facial RoI features using the fine face semantic model (502) along with the globally enhanced features (703) using the pre-trained coarse semantic model (504). The fine face semantic model may be optimized to detect subtle facial features, such as skin texture variations and micro-expressions, which are used for realistic enhancement. Meanwhile, the pre-trained coarse semantic model (504) may provide a broader context by identifying and enhancing background elements, ensuring that the overall scene remains coherent and visually appealing. This dual approach allows the model to balance detail enhancement with contextual awareness, resulting in a more natural and aesthetically pleasing output.

[0098] At step 704, the facial region may be enhanced based on the localized facial RoI features and the input image (501), generating the first image that focuses on improving facial details. The enhancement process may involve applying advanced image processing techniques, such as edge sharpening and noise reduction, specifically tailored to facial features. These techniques may be guided by the semantic map generated in the previous step, ensuring that enhancements are applied precisely where needed. The result may be a first image that exhibits improved clarity and detail in facial regions, making it suitable for applications like portrait photography and video conferencing.

[0099] At step 705, an entire or complete scene may be enhanced by applying the globally enhanced features to the first image and the input image (501), generating the second image with overall scene improvements. This step may involve integrating enhancements from both the facial and background regions, ensuring a harmonious balance between the two. The model may use a blending technique to seamlessly merge these enhancements, preserving the natural look of the scene while enhancing its visual appeal. The second image thus may represent a comprehensive improvement over the original, with both localized and global enhancements contributing to its quality.

[0100] At step 706, the confidence score may be generated for both the locally and the globally enhanced features. The first confidence score may be generated based on the first image, and the input image (501), facilitating the determination of the quality of enhancement corresponding to RoI. The second confidence score may be generated based on the second image, and the input image (501), facilitating the determination of the quality of enhancement corresponding to entire region of the image. The confidence score is calculated using a combination of statistical analysis and machine learning techniques, which assess the effectiveness of the enhancements in terms of clarity, detail, and naturalness. The confidence score helps in evaluating the performance of the enhancement model and can be used to fine-tune the parameters corresponding to the confidence score for future iterations. By providing a quantitative measure of enhancement quality, the confidence score aids in ensuring consistent and reliable results.

[0101] At step 707, the context-aware detail enhancement model may combine the RoI enhancements based on confidence scores for each set of features (707), creating the output image (509) that incorporates the enhancements from both regions. In other words, the context-aware detail enhancement model may combine the first image and the second image. For example, the context-aware detail enhancement model may blend the first set of locally adjusted features and the second set of globally adjusted features, by using the first confidence score from the first image and the second confidence score from second image to generate the output image (509). The first confidence score may be a pixel weight indicating the reconstruction quality (e.g., similarity of the structure or the texture) between the first image and the input image. The second confidence score may be a pixel weight indicating the reconstruction quality (e.g., similarity of the structure or the texture) between the second image and the input image. The first image may be weighted by the first confidence score, the second image may be weighted by the second confidence score. The weighted first image and the weighted second image may be combined to generate the output image. This step involves a weighted combination of enhancements, where features with higher confidence scores are given more prominence in the final output. The model uses an optimization technique to determine the optimal blend of enhancements, ensuring that the output image is both visually appealing and true to the original scene. The result is an output image that effectively balances detail and context, providing a superior visual experience.

[0102] At step 708, the generated output image (509) may be enhanced over the input image (501). The enhancement process may be designed to be non-destructive, preserving the original image's integrity while adding value through improved detail and context. The output image may be suitable for a wide range of applications, from professional photography to consumer electronics, where high-quality image enhancement is desired. By leveraging advanced machine learning techniques, the model ensures that the enhancements are both effective and efficient, making it a valuable tool for image processing tasks.

[0103] FIG. 8 may be a diagram that illustrates the facial feature localization and context understanding according to the embodiment as disclosed herein.

[0104] The input image (501) may be provided to pre-trained the fine face semantic model (502) to localize various RoI regions of the face, namely eyes, nose, mouth, and skin. The fine face semantic model (502) generates only a facial map that highlights only the facial RoI region (802) while the background remains unchanged. This selective focus allows the model to concentrate computational resources on the areas for enhancement, improving both efficiency and effectiveness. The facial map serves as a guide for subsequent enhancement steps, ensuring that improvements are applied precisely where they are most needed.

[0105] The fine face semantic model (502) may be trained to identify and distinguish features within the RoI for targeted enhancement. During the training process, the fine face semantic model (502) may learn to recognize and semantically classify distinguishable features within the RoI, such as eyes, nose, mouth, skin texture, and facial hair (if required), by analyzing annotated training data. The training dataset may include images with labeled features, ensuring the model can accurately map these features to their corresponding regions. This training process involves the use of advanced machine learning techniques, such as CNN, which are well-suited for image recognition tasks. By leveraging these techniques, the model can achieve high accuracy in feature localization, even in challenging conditions such as low lighting or occlusions.

[0106] The objective of the fine face semantic model (502) training involves minimizing the loss function that measures the accuracy of feature identification and localization within the RoI. This ensures that the semantic map produced by the model correctly reflects the spatial layout and characteristics of the identified features. For applications like face-aware image restoration, this capability may allow the model to focus on enhancing specific areas of interest while preserving or improving their natural appearance. The loss function may be carefully designed to penalize both false positives and false negatives, ensuring that the model achieves a balanced performance across different types of facial features. By refining the model through iterative training, the developers can ensure that it remains robust and reliable in real-world applications.

[0107] Further, the input image (501) may be provided to the pre-trained coarse semantic model (504), which is a pre-trained coarse scene classifier used to categorize each pixel or RoI in high-level classes such as sky, grass, trees, and common objects, human, pets, etc. These categorizations act as the globally enhanced features. The pre-trained coarse semantic model (504) may be trained to generate a detailed segmentation map by identifying and labeling all objects within the image. The training process may involve using datasets with object-level labels, enabling the model to learn to differentiate between various objects and recognize their spatial relationships to the RoI. The objective is to ensure the pre-trained coarse semantic model (504) accurately segments objects and assigns them to distinct channels or bounding boxes. During training, the loss function evaluates the precision of the segmentation and object labeling to refine the pre-trained coarse semantic model (504) ability to produce outputs for guiding image restoration processes. The segmentation map serves as a foundation for global enhancements, allowing the model to apply context-aware improvements that enhance the overall scene without compromising the integrity of individual objects.

[0108] FIG. 9 is the block diagram that illustrates the working mechanism of the face restoration model by using localized facial RoI features according to the embodiment as disclosed herein. As discussed above, the input image (501) may be processed by the pre-trained fine face semantic model (502) to localize various facial RoI regions (802), including face, nose, eyes, eyebrows, lips, and others. Further, the face restoration model (503) receives both the input image (501) and output from the fine face semantic model (502), which includes various facial RoI regions. The face restoration model (503) enhances the facial features and performs face beautification for each facial region, such as face, nose, eyes, eyebrows, and lips.

[0109] The face restoration model (503) may be trained using training images as the input image (501). The face restoration model (503) focuses on enhancing the RoI by emphasizing features that are different in edge characteristics, details, and texture from the rest of the image. During training, the fine segmentation map and intermediate features from the fine semantic segmentation model help guide the model's attention to selectively enhance the RoI. Custom loss functions are used to give more importance to enhancements in the RoI, ensuring that improvements are prioritized in these areas while maintaining overall image coherence.

[0110] For example, in the input image (501), the highlighted portion (905) indicates a blurred face, poor skin texture, and lack of clarity in the eyes and eyebrows. The face restoration model (503) identifies these facial regions and enhances them by applying filters and adjustments to improve clarity, improve skin texture, and the appearance of the eyes, specifically targeting only the affected facial areas. This process results in the generation of the facially enhanced image or the first image (903). Additionally, an intermediate facial map (904) representing facial features and localized context is passed to the context-aware detail enhancement engine (505).

[0111] FIG. 10 illustrates the example output of facial feature enhancement using the face restoration model (503) according to the embodiment as disclosed herein. The highlighted portion (903 and 906) indicates the face-restored part over the input image (501) and non-sharp background, respectively. Here, the face restoration model (503) is only responsible for enhancing the facial RoI region. The model employs advanced neural network architectures, such as convolutional neural networks (CNNs), to detect and enhance specific facial features like eyes, nose, and mouth. It utilizes a multi-layered approach to refine skin texture and tone, ensuring a natural look. The model is trained on a diverse dataset of facial images to learn various facial structures and expressions, allowing the model to adaptively enhance features while maintaining the individual's characteristics.

[0112] FIG. 11 illustrates the working operation of the context-aware detail enhancement engine (505). As disclosed above, the first image generated from the face restoration model (503) and the semantic global context map from the pre-trained coarse semantic model (504) are passed to the context-aware detail enhancement engine (505) for a better understanding of context for selective area enhancement in the image. The engine (505) may employ a combination of semantic segmentation and attention mechanisms to identify and prioritize areas of interest within the image. The engine (505) may use a hierarchical approach to process different image layers, ensuring that both macro and micro details are enhanced appropriately. The engine (505) may be capable of distinguishing between various scene elements, such as foreground and background, and may apply context-specific enhancements to each, resulting in a more cohesive and visually appealing image.

[0113] The context-aware detail enhancement engine (505) may identify scene types and components. Additionally, the context-aware detail enhancement engine (505) may enhance the entire scene by utilizing the global context and the improved facial image from the face restoration model (503), resulting in the second image (111). The second image may be referred to as the globally enhanced image. The engine (505) may incorporate machine learning techniques that analyze the spatial relationships between different image components, allowing the engine (505) to apply enhancements that are consistent with the overall scene composition. The engine (505) may use edge detection and texture analysis techniques to refine details in both the RoI region and non-RoI region (e.g., facial and non-facial regions), ensuring a balanced enhancement across the entire image. The engine's ability to adapt to various lighting conditions and color schemes further enhances its effectiveness in producing high-quality images.

[0114] The context-aware detail enhancement engine (505) may be trained using a generic image dataset that is not specific to any particular camera sensors. The dataset may be used to train the context-aware detail enhancement engine (505) to enhance images by improving sharpness, restoring details, and reducing noise. Custom loss functions may be applied during training to focus on making subtle enhancements to the RoI in the images. The loss functions may prioritize maintaining the overall structure of the image, ensuring that changes to the RoI are kept to a minimum. In addition, segmentation maps and feature maps from both the fine face semantic model (502) and the pre-trained coarse semantic model (504) are used to guide the training process. The maps may act as attention maps, helping the model focus on specific areas during training. The maps ensure that areas outside the RoI receive more significant enhancements while the RoI is preserved with minimal changes. This training method aims to strike a balance between enhancing less critical areas and preserving the integrity of important regions, resulting in a natural and well-balanced second image. The training process also involves data augmentation techniques to simulate various real-world scenarios, enhancing the model's robustness and adaptability.

[0115] FIG. 12 illustrates the example output of the context-aware detail enhancement engine (505) according to the embodiment as disclosed herein. The highlighted portion (121a) indicates an enhanced face. However, the enhancement is not fully applied as the context-aware detail enhancement engine (505) maintains the balance between face and non-face enhancement. The highlighted portion (121b) indicates sharper background enhancement over the input image (501). The engine uses a dynamic thresholding technique to determine the level of enhancement required for different image regions, ensuring that the enhancements are subtle yet effective. It employs a feedback loop mechanism to iteratively refine the enhancements, allowing for real-time adjustments based on the image content. This approach ensures that the final output maintains a natural appearance while significantly improving image quality.

[0116] FIG. 13 illustrates the method of determination of the face confidence score according to the embodiment as disclosed herein. The face confidence score evaluator (506) may determine the confidence score using the input image (501), the facially enhanced image or the first image (903), and the context-aware enhancement output or the second image (111). The evaluator (506) may use a multi-criteria analysis approach to assess the quality of enhancements, considering factors such as feature clarity, color accuracy, and texture consistency. The face confidence score evaluator (506) may employ a weighted scoring system to assign confidence scores, allowing for a nuanced evaluation of image quality. The evaluator's technique may be designed to be adaptive, learning from user feedback to continuously improve its scoring accuracy.

[0117] The face confidence score evaluator (506) may determine the confidence score (131) based on three outputs represented in terms of width of the image (W), height of the image (H), and the number of channels (n). Each channel may contain different feature information of the RoI. The face confidence score evaluator (506) may assign H*W*n1 values for the face restoration image or first image to evaluate improvements in sharpness, color, and features compared to the input image (501). Here, n1 represents features corresponding to the facial RoI like eyes, nose, mouth, skin texture, and so on. Similarly, the face confidence score evaluator (506) may assign H*W*n values for the context-aware enhancement output or second image (111) and evaluate for coherence in lighting, sharpness, color grading, etc. Here, n2 represents features corresponding to global features like indoor or outdoor or background scenes. Further, the face confidence score evaluator (506) may assign weights for each feature for the input image (501), the facially enhanced image (903), and the context-aware enhancement output or second image (111) based on the quality of each RoI feature. The evaluator may use a machine learning model trained on a large dataset of annotated images to predict the optimal weights for each feature, ensuring that the confidence scores accurately reflect the perceived image quality.

[0118] Furthermore, the face confidence score evaluator (506) may determine the confidence score for each of the facially enhanced image (903) and the second image (111) based on the skin texture and face structure of the human face in both the facially enhanced image (903) and the second image (111). The training of the face confidence score evaluator (506) may involve enabling the evaluator (506) to assign confidence scores to the features of the RoI within the image. The face confidence score evaluator (506) may compare the input image (501), and first image (903) to generate the first confidence score map. The face confidence score evaluator (506) may compare the input image (501), and second image (111) to generate the second confidence score map. Based on the comparison, a confidence score map is generated that assigns a weight (e.g., ranging from 0 to 1) to each feature in the RoI. These weights are directly based on the determined reconstruction fidelity of the respective feature, ensuring that features with higher reconstruction quality are given higher weightages. The evaluator's technique may incorporate a feedback mechanism that allows the evaluator to learn from user interactions, continuously refining its scoring criteria to better align with user preferences.

[0119] FIG. 14 illustrates the working and learning mechanism of the regression weights analyzer (508). The regression weight analyzer (507) may determine regression weights for face-restored output and context-aware enhancement model output. The face confidence score feature map is used to compute weights for the combination of the face restoration model (503) and the context-aware enhancement engine (505) outputs. The face confidence score feature map may include data that spatially maps the confidence scores across different features of the image, such as the first confidence score and the second confidence score. The analyzer may employ a regression model that uses historical data to predict the optimal weights for combining the two outputs, ensuring that the final image is both visually appealing and contextually accurate. The regression model may be trained using a diverse dataset of images, allowing the regression model to generalize well across different image types and conditions.

[0120] Further, the regression weight analyzer (507) may receive the two types of confidence scores, which evaluate the quality of each RoI feature and represent weight as w1 and w2 for the face restoration image (902) and second image (111), respectively. These regression weights are not fixed; the weights are learned based on the best possible face images from the dataset. This dataset consists of high-quality facial and pixel RoI data, allowing the analyzer (507) to determine how much weight to assign to the scores from the first image. For example, the first weights (w1) may correspond to a portion of the RoI, such as eye quality weights (e.g., w11, w12, ..., w1n). Herein n may be the number of the channel for the first image. The analyzer (507) may determine the scores from the second image. For example, the second weights (w2) may correspond to a portion of the entire image, such as pixel quality weights (e.g., w21, w22, ..., w2n). Herein n may be the number of the channel for the second image. The weights may be adjusted dynamically based on the characteristics of the images. The regression weights generated ensure a seamless transition in restoration property from face and non-face regions. The analyzer's technique incorporates a feedback loop that allows it to continuously refine the regression weights based on user feedback and new data, ensuring that the model remains up-to-date and effective.

[0121] FIG. 15 is the block diagram illustrating the working operation of the combiner (508) according to the embodiment as disclosed herein. The combiner (508) integrates outputs of the face restoration model (503) with the output of the context-aware detail enhancement model (505). The locally enhanced features from the face restoration model (503) may be multiplied by the first weight and may be represented as (I - A)* *w1. The output obtained from the context-aware detail enhancement model (505) may be multiplied by the first weight and may be represented as (I - B)* *w2. Herein, I represents the input image, A represents the first image, B represents the second image, w1 represents the first weight, and w2 represents the second weight. The parameters and ensure that the combined output remains within the predefined domain. By integrating this information, the combiner (508) generates the output image (509) that represents effective enhancement over the input image (501). The combiner (508) uses a blending technique that intelligently merges the two outputs, ensuring that the final image retains the best features of both models. It employs a multi-scale approach to handle different levels of detail, ensuring that both fine and coarse features are enhanced appropriately.

[0122] FIG. 16 illustrates the combined output image (509) according to the embodiment as disclosed herein. In FIG. 16, the highlighted portion (152a) indicates the face restored achieved while maintaining non-facial region enhancement. The highlighted portion (152a) enhanced the facial feature of the human by applying the required enhancement technique. This enhancement technique specifically improves the facial features of the human subject. The combiner (508) uses a context-aware blending technique that ensures the enhancements are applied seamlessly across different image regions, maintaining a natural appearance. It employs a feedback mechanism that allows it to iteratively refine the enhancements based on user feedback, ensuring that the final image meets user expectations.

[0123] The highlighted portion (152b) indicates the sharpened background. In the highlighted portion (152b), the background of the image is naturally enhanced compared to the input image (501). The combiner (508) uses a texture analysis technique to identify and enhance background details, ensuring that the enhancements are consistent with the overall scene composition. It employs a dynamic thresholding technique to determine the level of enhancement required for different background elements, ensuring that the final image is both visually appealing and contextually accurate.

[0124] Further, FIG. 17A, FIG. 17B, and FIG. 17C illustrate the comparison of human image enhancement between the existing system and the proposed disclosure applied to the input images. FIG. 17A indicates the input image (501) with the highlighted portion (171a) and (171b) representing a blurred face and blurred background. The proposed disclosure uses a multi-stage enhancement process that applies targeted enhancements to both the facial and non-facial regions, ensuring that the final image is both visually appealing and contextually accurate. The multi-stage enhancement process may employ a feedback mechanism that allows the process to iteratively refine the enhancements based on user feedback, ensuring that the final image meets user expectations.

[0125] FIG. 17B illustrates the enhanced human image output obtained from the existing system according to prior art. The highlighted portion (172a) indicates the face, though enhancement applied on the input image (501), the face appears unnatural with a painted effect and noisy regions. Additionally, the background (172b) remains of low quality as the existing system does not focus on enhancing the background. The proposed disclosure uses a context-aware enhancement engine that applies targeted enhancements to both the facial and non-facial regions, ensuring that the final image is both visually appealing and contextually accurate. The context-aware enhancement engine may employ a feedback mechanism that allows the context-aware enhancement engine to iteratively refine the enhancements based on user feedback, ensuring that the final image meets user expectations.

[0126] FIG. 17C illustrates the output image (509) obtained from the proposed disclosure. The highlighted portion (173a) and (173b) indicate a sharper facial structure and skin texture as well as better details in the background, respectively. The FIG. 17C appears balanced and visually appealing. The proposed disclosure uses a multi-stage enhancement process that applies targeted enhancements to both the facial and non-facial regions, ensuring that the final image is both visually appealing and contextually accurate. The multi-stage enhancement process may employ a feedback mechanism that allows the process to iteratively refine the enhancements based on user feedback, ensuring that the final image meets user expectations.

[0127] FIG. 18A, FIG. 18B, and FIG. 18C illustrate the comparison of text image enhancement between the existing system and the proposed disclosure applied to the input images. FIG. 18A displays the text-input image (601) with the highlighted portion (181) illustrating the image with text. The text and background in the FIG. 18 appear blurred with a lack of clarity. The proposed disclosure uses a context-aware enhancement engine that applies targeted enhancements to both the text and background regions, ensuring that the final image is both visually appealing and contextually accurate. The context-aware enhancement engine may employ a feedback mechanism that allows the context-aware enhancement engine to iteratively refine the enhancements based on user feedback, ensuring that the final image meets user expectations.

[0128] FIG. 18B illustrates the text image output obtained from the existing system according to prior art. The highlighted portion (182) indicates the text, though enhancement applied, the text appears blurred and noisy regions. Additionally, the background remains of low quality as the existing system does not focus on enhancing the background. The proposed disclosure uses a context-aware enhancement engine that applies targeted enhancements to both the text and background regions, ensuring that the final image is both visually appealing and contextually accurate. The context-aware enhancement engine may employ a feedback mechanism that allows the context-aware enhancement engine to iteratively refine the enhancements based on user feedback, ensuring that the final image meets user expectations.

[0129] FIG. 18C illustrates the output image obtained from the proposed model. The highlighted portion (183) indicates significantly improved text clarity and distinction, while the background exhibits enhanced details and visual balance, resulting in the overall enhanced output. The proposed disclosure uses a multi-stage enhancement process that applies targeted enhancements to both the text and background regions, ensuring that the final image is both visually appealing and contextually accurate. The multi-stage enhancement process may employ a feedback mechanism that allows the multi-stage enhancement process to iteratively refine the enhancements based on user feedback, ensuring that the final image meets user expectations.

[0130] FIG. 19A and FIG. 19B illustrate the comparison between the input image (501) and the output image (509) generated by the proposed disclosure. FIG. 19A represents the input image (501) captured at a defined zoom level, which appears blurred face and blurred background. The proposed disclosure uses a context-aware enhancement engine that applies targeted enhancements to both the facial and non-facial regions, ensuring that the final image is both visually appealing and contextually accurate. The context-aware enhancement engine may employ a feedback mechanism that allows the context-aware enhancement engine to iteratively refine the enhancements based on user feedback, ensuring that the final image meets user expectations.

[0131] FIG. 19B illustrates the output image (509) of the proposed model, which represents a naturally enhanced face region along with the background. The enhancement preserves fine textures while significantly reducing over-sharpening in sensitive areas. The proposed disclosure uses a multi-stage enhancement process that applies targeted enhancements to both the facial and non-facial regions, ensuring that the final image is both visually appealing and contextually accurate. The multi-stage enhancement process may employ a feedback mechanism that allows the multi-stage enhancement process to iteratively refine the enhancements based on user feedback, ensuring that the final image meets user expectations.

[0132] In an embodiment, the proposed disclosure generates a visually appealing face in images captured at a defined zoom level within the capture pipeline. The disclosure uses a context-aware enhancement engine that applies targeted enhancements to both the facial and non-facial regions, ensuring that the final image is both visually appealing and contextually accurate. The context-aware enhancement engine employs a feedback mechanism that allows the context-aware enhancement engine to iteratively refine the enhancements based on user feedback, ensuring that the final image meets user expectations.

[0133] In an embodiment, the method involves context-aware image restoration that utilizes both local and global context features. Further, the method combines the local context-based enhanced image and the global context-aware enhanced image based on the confidence scores to generate the desired output. The disclosure uses a multi-stage enhancement process that applies targeted enhancements to both the facial and non-facial regions, ensuring that the final image is both visually appealing and contextually accurate. The multi-stage enhancement process may employ a feedback mechanism that allows the multi-stage enhancement process to iteratively refine the enhancements based on user feedback, ensuring that the final image meets user expectations.

[0134] In an embodiment, the image quality is enhanced through varied scales of sharpening and texture reproduction based on context. This approach preserves fine details while achieving a more natural appearance in facial regions. The proposed solution demonstrates a +3 MOS (Mean Opinion Score) improvement in perceptual quality compared to existing methods, resulting in a significantly better user experience, particularly at high zoom levels that may be higher than the threshold magnification (e.g., 10x, 100x). The disclosure uses a context-aware enhancement engine that applies targeted enhancements to both the facial and non-facial regions, ensuring that the final image is both visually appealing and contextually accurate. It employs a feedback mechanism that allows it to iteratively refine the enhancements based on user feedback, ensuring that the final image meets user expectations.

[0135] The foregoing description of the specific embodiments will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify and or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments herein have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments herein can be practiced with modification within the scope of the embodiments as described herein.

[0136] According to an embodiment of the disclosure, a method for image enhancement by an electronic device may be provided. The method may include obtaining, by the electronic device (401), an input image (501) comprising a first region. The method may include generating, by the electronic device (401), a first image (903) comprising a first set of locally adjusted features corresponding to the first region, based on the input image (501). The method may include generating, by the electronic device (401), a second image (111) comprising a second set of globally adjusted features over the input image (501), based on the first set of locally adjusted features and the input image (501). The method may include determining, by the electronic device (401), a first confidence score for the first image (903) and a second confidence score for the second image (111), based on the input image (501). The method may include generating, by the electronic device (401), an output image (509) by combining the first image (903) and the second image (111), based on the first confidence score for the first image (903) and the second confidence score for the second image (111).

[0137] According to an embodiment of the disclosure, the method may include determining, by the electronic device (401), a first map of the first region, based on the input image (501). The first semantic map may comprise localization information of a plurality of portions of the first region in the input image (501). The method may include determining, by the electronic device (401), the first set of locally adjusted features by adjusting at least one of sharpness and texture of the plurality of portions, based on the first semantic map. Each locally adjusted feature of the first set comprises adjusted feature corresponding to each of the plurality of portions of the first region. The method may include generating, by the electronic device (401), the first image (903) comprising the first set of locally adjusted features.

[0138] According to an embodiment of the disclosure, the method may include determining, by the electronic device (401), a second semantic map of the input image (501), based on the input image (501). The second semantic map may comprise category information corresponding to a plurality of pixels of the input image (501). The method may include determining, by the electronic device (401), the second set of globally adjusted features by adjusting sharpness of the first region based on the first semantic map and adjusting sharpness of the input image based on the second semantic map. Each globally adjusted feature of the second set comprises adjusted feature corresponding to each of the plurality of pixels of the input image (501). The method may include generating, by the electronic device (401), the second image (111) comprising the second set of globally adjusted features.

[0139] According to an embodiment of the disclosure, the method may include generating, by the electronic device (401), a first regression weight based on the first confidence score feature. The method may include generating, by the electronic device (401), a second regression weight based on the first confidence score feature. The first regression weight and the second regression weight may be configured to provide a transition in restoration properties between the first region and non-first region of the input image (501). The method may include combining, by the electronic device (401), the first set of locally adjusted features weighted by a first parameter and first regression weight with the second set of globally adjusted features in the second image (111) weighted by a second parameter and the second regression weight. The first parameter and the second parameter are configured to restrict a combined output lies within a predefined domain. The method may include generating, by the electronic device (401), the output image (509) having the first region, based on the combination of the first set of locally adjusted features and the second set of globally adjusted features.

[0140] According to an embodiment of the disclosure, the method may include determining, by the electronic device (401), the first confidence score for each of the first set of locally adjusted features. The first confidence score indicates a degree of similarity between the first image and the input image. The method may include determining, by the electronic device (401), the second confidence score for each of the second set of globally adjusted features. The second confidence score indicates a degree of similarity between the second image and the input image.

[0141] According to an embodiment of the disclosure, the method may include comparing, by the electronic device (401), the first set of locally adjusted features in the first image (903) and the input image (501) using a pre-trained deep neural network (DNN) model. The comparison output comprises the first confidence score. The method may include comparing, by the electronic device (401), the second set of globally adjusted features in the second image (111) and the input image (501), using the pre-trained deep neural network (DNN) model. The comparison output may comprise the second confidence score.

[0142] According to an embodiment of the disclosure, the first set of locally adjusted features may comprise at least one of a texture of the first region, a main structure of the first region, a sub-structure of the first region, a type of the first region, a size of the first region, a color of the first region, or textual regions. The second set of globally adjusted features may comprise at least one of background regions, foreground regions, the texture of the first region, the main structure of the first region, a sub-structure of the first region, a type of the first region, a size of the first region, a color of the first region, or textual regions.

[0143] According to an embodiment of the disclosure, an electronic device (401) for context-aware image enhancement may be provided. The electronic device may comprise a memory (404) storing one or more instructions, and at least one processor (402). The instructions, when executed by the at least one processor (402) individually or collectively, cause the electronic device (401) to obtain an input image (501) comprising a first region. The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to generate a first image (903) comprising a first set of locally adjusted features corresponding to the first region, based on the input image (501). The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to generate a second image (111) comprising a second set of globally adjusted features over the input image (501), based on the second set of locally adjusted features and the input image (501). The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to determine a first confidence score for the first image (903) and a second confidence score for the second image (111), based on the input image (501). The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to generate an output image (509) by combining the first image (903) and the second image (111), based on the first confidence score for the first image (903) and the second confidence score for the second image (111).

[0144] According to an embodiment of the disclosure, the instructions, when executed by the at least one processor individually or collectively, cause the electronic device to determine a first semantic map of the first region, based on the input image (501). The first semantic map may comprise localization information of a plurality of portions of the first region in the input image (501). The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to determine the first set of locally adjusted features by adjusting at least one of sharpness and texture of the plurality of portions, based on the first semantic map. Each locally adjusted feature of the first set may comprise adjusted feature corresponding to each of the plurality of portions of the first region. The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to generate the first image (903) comprising the first set of locally adjusted features.

[0145] According to an embodiment of the disclosure, the instructions, when executed by the at least one processor individually or collectively, cause the electronic device to determine a second semantic map of the input image (501), based on the input image (501). The second semantic map may comprise category information corresponding to a plurality of pixels of the input image (501). The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to determine the second set of globally adjusted features by adjusting sharpness of the first region based on the first semantic map and adjusting sharpness of the input image based on the second semantic map. Each globally adjusted feature of the second set may comprise adjusted feature corresponding to each of the plurality of pixels of the input image (501). The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to generate the second image (111) comprising the second set of globally adjusted features.

[0146] According to an embodiment of the disclosure, the instructions, when executed by the at least one processor individually or collectively, cause the electronic device to generate a first regression weight based on the first confidence score feature. The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to generate a second regression weight based on the first confidence score feature. The first regression weight and the second regression weight may be configured to provide a transition in restoration properties between the first region and non-first region of the input image (501). The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to combine the first set of locally adjusted features weighted by a first parameter and first regression weight with the second set of globally adjusted features in the second image (111) weighted by a second parameter and the second regression weight. The first parameter and the second parameter are configured to restrict a combined output lies within a predefined domain. The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to generate the output image (509) having the first region, based on the combination of the first set of locally adjusted features and the second set of globally adjusted features.

[0147] According to an embodiment of the disclosure, the instructions, when executed by the at least one processor individually or collectively, cause the electronic device to determine the first confidence score for each of the first set of locally adjusted features. The first confidence score indicates a degree of similarity between the first image and the input image. The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to determine the second confidence score for each of the second set of globally adjusted features. The second confidence score indicates a degree of similarity between the second image and the input image.

[0148] According to an embodiment of the disclosure, the instructions, when executed by the at least one processor individually or collectively, cause the electronic device to compare the first set of locally adjusted features in the first image (903) and the input image (501) using a pre-trained deep neural network (DNN) model. The comparison output may comprise the first confidence score. The instructions, when executed by the at least one processor individually or collectively, cause the electronic device to compare the second set of globally adjusted features in the second image (111) and the input image (501), using the pre-trained deep neural network (DNN) model. The comparison output may comprise the second confidence score.

[0149] According to an embodiment of the disclosure, the first set of locally adjusted features may comprise at least one of a texture of the first region, a main structure of the first region, a sub-structure of the first region, a type of the first region, a size of the first region, a color of the first region, or textual regions. The second set of globally adjusted features may comprise at least one of background regions, foreground regions, the texture of the first region, the main structure of the first region, a sub-structure of the first region, a type of the first region, a size of the first region, a color of the first region, or textual regions.

[0150] According to an embodiment of the disclosure, a computer-readable medium storing one or more instructions may be provided. The one or more instructions, when executed by at least one processor, may cause the at least one processor of an electronic device to perform operation corresponding to the method.

Claims

1.A method for image enhancement by an electronic device, comprises:obtaining, by the electronic device (401), an input image (501) comprising a first region;generating, by the electronic device (401), a first image (903) comprising a first set of locally adjusted features corresponding to the first region, based on the input image (501);generating, by the electronic device (401), a second image (111) comprising a second set of globally adjusted features over the input image (501), based on the first set of locally adjusted features and the input image (501);determining, by the electronic device (401), a first confidence score for the first image (903) and a second confidence score for the second image (111), based on the input image (501); andgenerating, by the electronic device (401), an output image (509) by combining the first image (903) and the second image (111), based on the first confidence score for the first image (903) and the second confidence score for the second image (111).2.The method as claimed in claim 1,wherein generating the first image (903) comprising the first set of locally adjusted features corresponding to the first region comprises:determining, by the electronic device (401), a first map of the first region, based on the input image (501), wherein the first semantic map comprises localization information of a plurality of portions of the first region in the input image (501);determining, by the electronic device (401), the first set of locally adjusted features by adjusting at least one of sharpness and texture of the plurality of portions, based on the first semantic map, wherein each locally adjusted feature of the first set comprises adjusted feature corresponding to each of the plurality of portions of the first region; andgenerating, by the electronic device (401), the first image (903) comprising the first set of locally adjusted features.3.The method as claimed in any one of claims 1 and 2,wherein generating the second image (111) comprising a second set of globally adjusted features over the input image (501) comprises:determining, by the electronic device (401), a second semantic map of the input image (501), based on the input image (501), wherein the second semantic map comprises category information corresponding to a plurality of pixels of the input image (501);determining, by the electronic device (401), the second set of globally adjusted features by adjusting sharpness of the first region based on the first semantic map and adjusting sharpness of the input image based on the second semantic map, wherein each globally adjusted feature of the second set comprises adjusted feature corresponding to each of the plurality of pixels of the input image (501); andgenerating, by the electronic device (401), the second image (111) comprising the second set of globally adjusted features.4.The method as claimed in any one of claims 1 to 3, wherein generating, by the electronic device (401), the output image (509) comprises:generating, by the electronic device (401), a first regression weight based on the first confidence score feature;generating, by the electronic device (401), a second regression weight based on the first confidence score feature, wherein the first regression weight and the second regression weight are configured to provide a transition in restoration properties between the first region and non-first region of the input image (501); andcombining, by the electronic device (401), the first set of locally adjusted features weighted by a first parameter and first regression weight with the second set of globally adjusted features in the second image (111) weighted by a second parameter and the second regression weight, wherein the first parameter and the second parameter are configured to restrict a combined output lies within a predefined domain; andgenerating, by the electronic device (401), the output image (509) having the first region, based on the combination of the first set of locally adjusted features and the second set of globally adjusted features.5.The method as claimed in any one of claims 1 to 4, wherein determining the first confidence score and the second confidence score comprises:determining, by the electronic device (401), the first confidence score for each of the first set of locally adjusted features, wherein the first confidence score indicates a degree of similarity between the first image and the input image; anddetermining, by the electronic device (401), the second confidence score for each of the second set of globally adjusted features, wherein the second confidence score indicates a degree of similarity between the second image and the input image.6.The method as claimed in any one of claims 1 to 5, wherein determining the first confidence score and the second confidence score comprises:comparing, by the electronic device (401), the first set of locally adjusted features in the first image (903) and the input image (501) using a pre-trained deep neural network (DNN) model, wherein the comparison output comprises the first confidence score; andcomparing, by the electronic device (401), the second set of globally adjusted features in the second image (111) and the input image (501), using the pre-trained deep neural network (DNN) model,wherein the comparison output comprises the second confidence score.7.The method as claimed in any one of claims 1 to 6,wherein the first set of locally adjusted features comprises at least one of a texture of the first region, a main structure of the first region, a sub-structure of the first region, a type of the first region, a size of the first region, a color of the first region, or textual regions, andwherein the second set of globally adjusted features comprises at least one of background regions, foreground regions, the texture of the first region, the main structure of the first region, a sub-structure of the first region, a type of the first region, a size of the first region, a color of the first region, or textual regions.8.An electronic device (401) for context-aware image enhancement, comprises:memory (404) storing one or more instructions;at least one processor (402); andwherein the instructions, when executed by the at least one processor (402) individually or collectively, cause the electronic device (401) to:obtain an input image (501) comprising a first region;generate a first image (903) comprising a first set of locally adjusted features corresponding to the first region, based on the input image (501);generate a second image (111) comprising a second set of globally adjusted features over the input image (501), based on the second set of locally adjusted features and the input image (501);determine a first confidence score for the first image (903) and a second confidence score for the second image (111), based on the input image (501); andgenerate an output image (509) by combining the first image (903) and the second image (111), based on the first confidence score for the first image (903) and the second confidence score for the second image (111).9.The electronic device (401) as claimed in claim 8,wherein the instructions, when executed by the at least one processor (402) individually or collectively, cause the electronic device (401) to:determine a first semantic map of the first region, based on the input image (501), wherein the first semantic map comprises localization information of a plurality of portions of the first region in the input image (501);determine the first set of locally adjusted features by adjusting at least one of sharpness and texture of the plurality of portions, based on the first semantic map, wherein each locally adjusted feature of the first set comprises adjusted feature corresponding to each of the plurality of portions of the first region; andgenerate the first image (903) comprising the first set of locally adjusted features.10.The electronic device (401) as claimed in any one of claims 8 to 9,wherein the instructions, when executed by the at least one processor (402) individually or collectively, cause the electronic device (401) to:determine a second semantic map of the input image (501), based on the input image (501), wherein the second semantic map comprises category information corresponding to a plurality of pixels of the input image (501);determine the second set of globally adjusted features by adjusting sharpness of the first region based on the first semantic map and adjusting sharpness of the input image based on the second semantic map, wherein each globally adjusted feature of the second set comprises adjusted feature corresponding to each of the plurality of pixels of the input image (501); andgenerate the second image (111) comprising the second set of globally adjusted features.11.The electronic device (401) as claimed in any one of claims 8 to 10,wherein the instructions, when executed by the at least one processor (402) individually or collectively, cause the electronic device (401) to:generate a first regression weight based on the first confidence score feature;generate a second regression weight based on the first confidence score feature, wherein the first regression weight and the second regression weight are configured to provide a transition in restoration properties between the first region and non-first region of the input image (501); andcombine the first set of locally adjusted features weighted by a first parameter and first regression weight with the second set of globally adjusted features in the second image (111) weighted by a second parameter and the second regression weight, wherein the first parameter and the second parameter are configured to restrict a combined output lies within a predefined domain; andgenerate the output image (509) having the first region, based on the combination of the first set of locally adjusted features and the second set of globally adjusted features.12.The electronic device (401) as claimed in any one of claims 8 to 11,wherein the instructions, when executed by the at least one processor (402) individually or collectively, cause the electronic device (401) to:determine the first confidence score for each of the first set of locally adjusted features, wherein the first confidence score indicates a degree of similarity between the first image and the input image; anddetermine the second confidence score for each of the second set of globally adjusted features, wherein the second confidence score indicates a degree of similarity between the second image and the input image.13.The electronic device (401) as claimed in any one of claims 8 to 12,wherein the instructions, when executed by the at least one processor (402) individually or collectively, cause the electronic device (401) to:compare the first set of locally adjusted features in the first image (903) and the input image (501) using a pre-trained deep neural network (DNN) model; wherein the comparison output comprises the first confidence score; andcompare the second set of globally adjusted features in the second image (111) and the input image (501), using the pre-trained deep neural network (DNN) model,wherein the comparison output comprises the second confidence score.14.The electronic device (401) as claimed in any one of claims 8 to 13,wherein the first set of locally adjusted features comprises at least one of a texture of the first region, a main structure of the first region, a sub-structure of the first region, a type of the first region, a size of the first region, a color of the first region, or textual regions, andwherein the second set of globally adjusted features comprises at least one of background regions, foreground regions, the texture of the first region, the main structure of the first region, a sub-structure of the first region, a type of the first region, a size of the first region, a color of the first region, or textual regions.15.A computer-readable medium storing one or more instructions, wherein the one or more instructions, when executed by at least one processor, cause the at least one processor of an electronic device to perform the method of any one the claims 1 to 7.