Human body image super-resolution method, super-resolution device, electronic equipment and storage medium
By iteratively using face and body super-resolution diffusion models to process human images, the problem of unnatural transition at the boundary between face and body images is solved, achieving efficient and natural image super-resolution results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HONOR DEVICE CO LTD
- Filing Date
- 2024-03-19
- Publication Date
- 2026-06-26
AI Technical Summary
During the super-resolution process of human images, the large difference in resolution between the initial super-resolution face image and the initial super-resolution human image leads to an unnatural transition at the boundary between the face region image and the human body region image in the target super-resolution human image.
The initial human body image is subjected to iterative super-resolution processing using a face super-resolution diffusion model and a human body super-resolution diffusion model. By learning face noise and human body noise, the face region and human body region images are fused until the transition at the boundary is natural.
It improves the naturalness of the transition at the boundary between the face and body regions in the image, meets the requirements for the sharpness of the face and body, reduces the amount of data processing, and improves the super-resolution efficiency and image quality.
Smart Images

Figure CN120707380B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image processing technology, and in particular to a super-resolution method, super-resolution device, electronic device and storage medium for human images. Background Technology
[0002] Super-resolution, also known as super-resolution, refers to taking a low-resolution image as input and processing it to obtain a high-resolution image with the same content.
[0003] Currently, users require higher resolution for the face portion of a human image compared to the overall human image resolution. When performing super-resolution on a human image, the face image is typically super-resolutiond separately to obtain an initial super-resolution face image, and the human image is super-resolutiond separately to obtain an initial super-resolution human image. The initial super-resolution face image is then pasted and used to replace the face region image in the initial super-resolution human image to obtain the target super-resolution human image.
[0004] However, due to the significant difference in resolution between the initial super-resolution face image and the initial super-resolution human body image, there is an unnatural transition at the boundary between the face region image and the human body region image in the target super-resolution human body image. Summary of the Invention
[0005] In view of the above, embodiments of this application provide a human image super-resolution method, super-resolution device, electronic device, and storage medium to overcome the problems of the prior art.
[0006] In a first aspect, embodiments of this application provide a human image super-resolution method, which includes: acquiring an initial human image; performing a first super-resolution process on the initial human image using a super-resolution model to obtain a first super-resolution human image; wherein the super-resolution model includes a face super-resolution diffusion model and a human super-resolution diffusion model, and the first super-resolution process includes: acquiring an initial face feature map and an initial human feature map based on the initial human image; inputting the initial face feature map into the face super-resolution diffusion model to obtain a first super-resolution face image; inputting the initial human feature map into the human super-resolution diffusion model to obtain an initial super-resolution human image, wherein the resolution of the first super-resolution face image is higher than the resolution of the initial super-resolution human image; fusing the first super-resolution face image and the initial super-resolution human image to obtain a first super-resolution human image; using the first super-resolution human image obtained after the first super-resolution process as the initial human image in the second super-resolution process, and repeating the first super-resolution process until the Tth super-resolution process is completed to obtain the first super-resolution human image of the Tth super-resolution process.
[0007] The solution provided in this application, during the super-resolution of human images, learns the facial noise of the initial human image based on the face super-resolution diffusion model and the human noise of the initial human image based on the human super-resolution diffusion model, and fuses the facial noise and human noise to obtain the first super-resolution human image. The first super-resolution human image is then iteratively super-resolution processed. The face super-resolution diffusion model iteratively learns the facial noise in the previous face super-resolution image, and the human super-resolution diffusion model iteratively learns the human noise in the previous human super-resolution image. Ultimately, this makes the transition between the face region image and the human region image in the super-resolution human image obtained by multiple super-resolution iterations more natural.
[0008] In some optional embodiments, obtaining an initial face feature map and an initial human feature map based on an initial human body image includes: compressing the initial human body image to obtain a compressed human body image; and obtaining the initial face feature map and the initial human body feature map based on the compressed human body image.
[0009] The solution provided in this embodiment compresses the initial human body image and performs super-resolution based on the compressed human body image, thereby reducing the size of the data involved in the super-resolution process and improving the super-resolution efficiency of the human body image.
[0010] In some optional embodiments, obtaining an initial face feature map and an initial human feature map based on a compressed human image includes: determining the compressed human image as the initial human feature map; and cropping the initial human feature map based on the face bounding box of the initial human feature map to obtain the initial face feature map.
[0011] The solution provided in this embodiment crops the initial human body feature map based on the face bounding box. The cropped initial face feature map always contains face features, which improves the accuracy of obtaining the initial face feature map.
[0012] In some optional embodiments, before cropping the initial human feature map based on the face bounding box of the initial human feature map to obtain the initial face feature map, the human image super-resolution method further includes: performing face detection on the initial human image to obtain the coordinates of the first face bounding box; determining the coordinates of the second face bounding box of the initial human feature map based on the coordinates of the first face bounding box and the mapping relationship between the initial human image and the initial human feature map; and determining the face bounding box of the initial human feature map based on the coordinates of the second face bounding box.
[0013] The solution provided in this embodiment enables the determination of the face bounding box of the initial human feature map based on the coordinates of the first face bounding box of the initial human image and the mapping relationship between the initial human image and the initial human feature map, resulting in a face bounding box with high accuracy.
[0014] In some optional embodiments, fusing the first super-resolution face image and the initial super-resolution human body image to obtain the first super-resolution human body image includes: selecting the second super-resolution face image in the initial super-resolution human body image that corresponds to the coordinates of the second face bounding box; and replacing the second super-resolution face image with the first super-resolution face image in the initial super-resolution human body image to obtain the first super-resolution human body image.
[0015] The solution provided in this embodiment realizes the first super-resolution face image obtained based on face super-resolution, and updates the face region image in the initial super-resolution human image obtained based on human body super-resolution. The resulting first super-resolution human image satisfies both the face clarity requirement and the human body clarity requirement, thus improving the super-resolution experience of human body image super-resolution.
[0016] In some optional embodiments, the first super-resolution face image includes a first super-resolution head image and a first super-resolution non-head image, and the initial super-resolution human body image includes an initial super-resolution head image and an initial super-resolution non-head image; in the initial super-resolution human body image, replacing the second super-resolution face image with the first super-resolution face image to obtain the first super-resolution human body image includes: in the initial super-resolution human body image, replacing the initial super-resolution head image with the first super-resolution head image to obtain the first super-resolution human body image.
[0017] The solution provided in this embodiment realizes the first super-resolved head image obtained based on face super-resolved image, and updates the initial super-resolved head image in the initial super-resolved human body image obtained based on human body super-resolved image, thereby reducing the amount of data computation in the super-resolved processing and improving the super-resolved efficiency of human body image super-resolved.
[0018] In some optional embodiments, the human image super-resolution method further includes: decoding the first super-resolution human image processed in the Tth super-resolution process to obtain a target super-resolution human image, wherein the resolution of the target super-resolution human image is higher than the resolution of the initial human image.
[0019] The solution provided in this embodiment decodes the first super-resolution human image in the Tth super-resolution process, and the resulting target super-resolution human image has the same dimension as the initial human image, thereby improving the image quality of the target super-resolution human image.
[0020] In some optional embodiments, the initial face feature map includes multiple sub-face feature maps. Before inputting the initial face feature map into the face super-resolution diffusion model to obtain the first super-resolution face image, the human image super-resolution method further includes: performing sliding window block processing on the initial face feature map to obtain multiple sub-face feature maps; inputting the initial face feature map into the face super-resolution diffusion model to obtain the first super-resolution face image, including: inputting the multiple sub-face feature maps into the face super-resolution diffusion model one by one, so that the face super-resolution diffusion model outputs the first super-resolution face image based on the multiple sub-face feature maps.
[0021] The solution provided in this embodiment divides the initial face feature map into multiple sub-face feature maps by sliding window, and inputs each sub-face feature map into the face super-resolution diffusion model for face super-resolution. This avoids the large image data of the face feature map input to the face super-resolution diffusion model at one time, which would increase the amount of data computation for face super-resolution, and improves the super-resolution efficiency of face feature map face super-resolution.
[0022] In some optional embodiments, the initial human feature map includes multiple sub-human feature maps. Before inputting the initial human feature map into the human super-resolution diffusion model to obtain the initial super-resolution human image, the human image super-resolution method further includes: performing sliding window block processing on the initial human feature map to obtain multiple sub-human feature maps; inputting the initial human feature map into the human super-resolution diffusion model to obtain the initial super-resolution human image includes: inputting the multiple sub-human feature maps into the human super-resolution diffusion model one by one, so that the human super-resolution diffusion model outputs the initial super-resolution human image based on the multiple sub-human feature maps.
[0023] The solution provided in this embodiment divides the initial human feature map into multiple sub-human feature maps by sliding window, and inputs each sub-human feature map into the human super-resolution diffusion model for human super-resolution. This avoids the large image data of the human feature map input to the human super-resolution diffusion model at one time, which would increase the amount of data computation for human super-resolution, and improves the super-resolution efficiency of human feature map super-resolution.
[0024] In some optional embodiments, before performing the first super-resolution processing on the initial human image using a super-resolution model to obtain the first super-resolution human image, the human image super-resolution method further includes: constructing a super-resolution diffusion model based on a diffusion model and a control network model; training the super-resolution diffusion model based on a first historical face feature map and a second historical face feature map to obtain a face super-resolution diffusion model, wherein the first historical face feature map and the second historical face feature map contain the same face features, and the resolution of the first historical face feature map is higher than that of the second historical face feature map; and training the super-resolution diffusion model based on the first historical human feature map and the second historical human feature map to obtain a human super-resolution diffusion model, wherein the first historical human feature map and the second historical human feature map contain the same human features, and the resolution of the first historical human feature map is higher than that of the second historical human feature map.
[0025] The solution provided in this embodiment constructs a super-resolution diffusion model based on a diffusion model and a control network model. The control network model serves as the conditional control generation model for the super-resolution diffusion model, making the super-resolution process of human body feature maps controllable and improving the user's super-resolution experience for human body feature maps.
[0026] In some optional embodiments, a face super-resolution diffusion model is trained based on a first historical face feature map and a second historical face feature map to obtain a face super-resolution diffusion model. This includes: inputting the first historical face feature map into the diffusion model of the super-resolution diffusion model, so that the diffusion model outputs the first historical face features to the control network model based on the first historical face feature map; and inputting the second historical face feature map into the control network model of the super-resolution diffusion model, so that the control network model is trained based on the first historical face features and the second historical face feature map to obtain the face super-resolution diffusion model.
[0027] The solution provided in this embodiment enables the training of the control network model in the super-diffusion model based on historical facial feature maps, eliminating the need to train the diffusion model in the super-diffusion model. This reduces the amount of data computation during the training process of the super-diffusion model and improves the training speed of the super-diffusion model.
[0028] In some optional embodiments, a human super-resolution diffusion model is trained based on a first historical human feature map and a second historical human feature map to obtain a human super-resolution diffusion model, including: inputting the first historical human feature map into the diffusion model in the super-resolution diffusion model, so that the diffusion model outputs the first historical human feature map to the control network model; inputting the second historical human feature map into the control network model in the super-resolution diffusion model, so that the control network model is trained based on the first historical human feature map and the second historical human feature map to obtain the human super-resolution diffusion model.
[0029] The solution provided in this embodiment enables the training of the control network model in the super-diffusion model based on historical human feature maps, eliminating the need to train the diffusion model in the super-diffusion model. This reduces the amount of data computation during the training process of the super-diffusion model and improves the training speed of the super-diffusion model.
[0030] In some optional embodiments, obtaining the initial human image includes: obtaining an initial scene image containing the initial human image; performing human detection on the initial scene image to obtain a human bounding box; and cropping the initial scene image based on the human bounding box to obtain the initial human image.
[0031] The solution provided in this embodiment, based on the human body bounding box obtained by human body detection, crops the initial human body image from the initial scene image, thereby improving the accuracy of obtaining the initial human body image.
[0032] In some optional embodiments, the human image super-resolution method further includes: segmenting the first super-resolution human image of the Tth super-resolution process to obtain a human mask image; and pasting the human mask image back to the position corresponding to the initial human image in the initial scene image to obtain the target scene image.
[0033] The solution provided in this embodiment enables super-resolution processing only on the initial human body image in the initial scene image, avoiding the low super-resolution efficiency caused by super-resolution processing of the entire initial scene image, and improving the super-resolution efficiency of super-resolution of human body images.
[0034] In some optional embodiments, the human image super-resolution method further includes: eroding the boundary of the human mask image in the target scene image.
[0035] The solution provided in this embodiment reduces the unnatural transition between the super-resolution human image and the initial scene image by eroding the boundary of the super-resolution human image in the target scene image, thereby improving the fusion degree between the super-resolution human image and the initial scene image.
[0036] In some optional embodiments, the human image super-resolution method further includes: performing Gaussian blurring on the boundary of the human mask image in the target scene image.
[0037] The solution provided in this embodiment reduces the unnatural transition between the super-resolution human image and the initial scene image by performing Gaussian blurring on the boundary of the super-resolution human image in the target scene image, thereby improving the fusion degree between the super-resolution human image and the initial scene image.
[0038] Secondly, embodiments of this application provide a human image super-resolution device, comprising: an initial image acquisition module for acquiring an initial human image; a super-resolution processing module for performing a first super-resolution processing on the initial human image using a super-resolution model to obtain a first super-resolution human image; wherein the super-resolution model includes a face super-resolution diffusion model and a human super-resolution diffusion model, and the first super-resolution processing includes: acquiring an initial face feature map and an initial human feature map based on the initial human image; inputting the initial face feature map into the face super-resolution diffusion model to obtain a first super-resolution face image; inputting the initial human feature map into the human super-resolution diffusion model to obtain an initial super-resolution human image, wherein the resolution of the first super-resolution face image is higher than the resolution of the initial super-resolution human image; fusing the first super-resolution face image and the initial super-resolution human image to obtain a first super-resolution human image; and a repetition execution module for using the first super-resolution human image obtained after the first super-resolution processing as the initial human image in the second super-resolution processing, and repeatedly executing the first super-resolution processing process until the Tth super-resolution processing is completed to obtain the first super-resolution human image of the Tth super-resolution processing.
[0039] Thirdly, embodiments of this application provide an electronic device, which includes: one or more processors and a memory; the memory is coupled to one or more processors, and the memory is used to store computer program code, the computer program code including computer instructions, and one or more processors call the computer instructions to cause the electronic device to execute the human image super-resolution method provided in the first aspect above.
[0040] Fourthly, embodiments of this application provide a chip system applied to an electronic device. The chip system includes one or more processors, which are used to invoke computer instructions to cause the electronic device to execute the human image super-resolution method provided in the first aspect above.
[0041] In some alternative embodiments, the chip system further includes a memory connected to one or more processors via circuits or wires.
[0042] In some alternative embodiments, the chip system also includes a communication interface.
[0043] Fifthly, embodiments of this application provide a computer-readable storage medium including instructions that, when executed on an electronic device, cause the electronic device to perform the human image super-resolution method as described in the first aspect above.
[0044] In a sixth aspect, embodiments of this application provide a computer program product that, when run on an electronic device, causes the electronic device to execute the human image super-resolution method provided in the first aspect above.
[0045] It is understood that the beneficial effects of the second to sixth aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here. Attached Figure Description
[0046] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0047] Figure 1 A schematic diagram of the software system of an electronic device provided in an embodiment of this application is shown.
[0048] Figure 2 A flowchart illustrating a human image super-resolution method provided in an embodiment of this application is shown.
[0049] Figure 3This paper illustrates a flowchart of the super-resolution process in the human image super-resolution method provided in an embodiment of this application.
[0050] Figure 4 This illustration shows a scenario of sliding window block processing in the human image super-resolution method provided in this application embodiment.
[0051] Figure 5 This illustration shows a scenario flow diagram of the human image super-resolution method provided in an embodiment of this application.
[0052] Figure 6 This illustration shows a scenario flowchart of super-resolution processing in the human image super-resolution method provided in this application embodiment.
[0053] Figure 7 A flowchart illustrating a super-resolution model construction method provided in an embodiment of this application is shown.
[0054] Figure 8 This illustration shows a scenario diagram of the diffusion model in the human image super-resolution method provided in this application embodiment.
[0055] Figure 9 This paper illustrates a structural schematic diagram of a super-resolution diffusion model in the human image super-resolution method provided in an embodiment of this application.
[0056] Figure 10 This paper illustrates a schematic diagram of a super-resolution diffusion model used for inference in the human image super-resolution method provided in this application embodiment.
[0057] Figure 11 A structural block diagram of a human image super-resolution device provided in an embodiment of this application is shown.
[0058] Figure 12 A schematic diagram of a hardware structure of an electronic device provided in an embodiment of this application is shown.
[0059] Figure 13 A functional block diagram of an electronic device provided in an embodiment of this application is shown. Detailed Implementation
[0060] The embodiments of this application are described in detail below. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain this application, and should not be construed as limiting this application.
[0061] The following disclosure provides many different implementations or examples for carrying out different structures of this application. To simplify the disclosure, specific examples of components and arrangements are described below. Of course, these are merely examples and are not intended to limit the scope of this application. Furthermore, reference numerals and / or reference letters may be repeated in different examples; such repetition is for simplification and clarity and does not in itself indicate a relationship between the various implementations and / or arrangements discussed.
[0062] Super-resolution, also known as super-resolution, refers to taking a low-resolution image as input and processing it to obtain a high-resolution image with the same content.
[0063] Currently, users require higher resolution for the face portion of a human image compared to the overall human image resolution. When performing super-resolution on a human image, the face image is typically super-resolutiond separately to obtain an initial super-resolution face image, and the human image is super-resolutiond separately to obtain an initial super-resolution human image. The initial super-resolution face image is then pasted and used to replace the face region image in the initial super-resolution human image to obtain the target super-resolution human image.
[0064] However, due to the significant difference in resolution between the initial super-resolution face image and the initial super-resolution human body image, there is an unnatural transition at the boundary between the face region image and the human body region image in the target super-resolution human body image.
[0065] To address the aforementioned problems, the human image super-resolution method, super-resolution device, electronic device, and storage medium provided in this application acquire an initial human image and perform a first super-resolution process on the initial human image using a super-resolution model to obtain a first super-resolution human image. The super-resolution model includes a face super-resolution diffusion model and a human super-resolution diffusion model. The first super-resolution process includes acquiring an initial face feature map and an initial human feature map based on the initial human image, inputting the initial face feature map into the face super-resolution diffusion model to obtain a first super-resolution face image, and inputting the initial human feature map into the human super-resolution diffusion model to obtain an initial super-resolution human image. The resolution of the first super-resolution face image is higher than that of the initial super-resolution human image. The first super-resolution face image and the initial super-resolution human image are then fused to obtain the first super-resolution human image. The first super-resolution process is then applied to the human super-resolution diffusion model. The first super-resolution human image obtained after processing is used as the initial human image in the second super-resolution process. The process of the first super-resolution process is repeated until the Tth super-resolution process is completed, resulting in the first super-resolution human image of the Tth super-resolution process. This achieves the following in the human image super-resolution process: the face noise of the initial human image is learned based on the face super-resolution diffusion model, and the human noise of the initial human image is learned based on the human super-resolution diffusion model. The face noise and human noise are then fused to obtain the first super-resolution human image. The first super-resolution human image is then iteratively super-resolution processed. The face super-resolution diffusion model iteratively learns the face noise in the previous face super-resolution image, and the human super-resolution diffusion model iteratively learns the human noise in the previous human super-resolution image. Ultimately, this makes the transition between the face region image and the human region image in the super-resolution human image obtained by multiple super-resolution iterations more natural.
[0066] The technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings.
[0067] The human image super-resolution method provided in this application can be applied to electronic devices. Electronic devices may include various terminal devices, which may also be called terminals, user equipment (UE), mobile stations (MS), mobile terminals (MT), etc.
[0068] Terminal devices can include mobile phones, robot vacuum cleaners, drones, smart TVs, wearable devices, personal digital assistants (PDAs), computers with wireless transceiver capabilities, virtual reality (VR) terminal devices, augmented reality (AR) terminal devices, wireless terminals in industrial control, wireless terminals in self-driving, wireless terminals in remote medical surgery, wireless terminals in smart grids, wireless terminals in transportation safety, wireless terminals in smart cities, and wireless terminals in smart homes. The type of terminal device is not limited here; it can be configured according to actual needs.
[0069] The software system of an electronic device can adopt a layered architecture, event-driven architecture, microkernel architecture, microservice architecture, or cloud architecture. This application uses the layered architecture Android system as an example to illustrate the software structure of an electronic device.
[0070] Please see Figure 1 This illustration shows a schematic diagram of the software system of an electronic device according to an embodiment of this application. The software system includes several layers, each with a clear role and division of labor, and the layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom: the application layer, the application framework layer, the system library, and the kernel layer.
[0071] The application layer can include a range of applications, such as super-resolution applications, camera applications, gallery applications, calling applications, wireless local area network (WLAN) applications, video applications, media provider applications, filesystem in userspace (FUSE) applications, etc.
[0072] Among them, the super-resolution application can be used to acquire human body images and perform super-resolution processing on the human body images to obtain super-resolution human body images.
[0073] Media providers are used to create or access multimedia files within FUSE. Applications in the application layer can create or access multimedia files within FUSE through a MediaProvider.
[0074] FUSE is used to store multimedia files created by media providers. Of course, in other embodiments, FUSE can also be used to store other data.
[0075] The application framework layer provides application programming interfaces (APIs) and a programming framework for applications in the application layer. The application framework layer includes predefined functions. For example, it may include a window manager, content provider, resource manager, view system, package manager service (PMS), and activity manager service (AMS).
[0076] The window manager is used to manage windowed applications. It can retrieve screen size, determine the presence of a status bar, lock the screen, and capture screenshots, among other things.
[0077] Content providers store and retrieve data, making that data accessible to applications. This data can include videos, images, audio, phone calls made and received, browsing history and bookmarks, phone books, and more.
[0078] A view system includes visual controls, such as controls for displaying text and controls for displaying images. View systems can be used to build applications. A display interface can consist of one or more views. For example, a display interface including a text notification icon could include views for displaying text and views for displaying images.
[0079] The file explorer provides applications with various resources, such as localized strings, icons, images, layout files, video files, and more.
[0080] The package management service, acting as a package manager service, is primarily responsible for installing, managing, and uninstalling applications on Android devices. It scans specified directories in the system to find files ending in APK, parses these files to obtain all application information, and stores it in packages.xml.
[0081] When a new application is installed, the package management service identifies all components of the application (such as Activities, Services, and Broadcast Receivers) and assigns appropriate permissions to these components. Simultaneously, the package management service monitors the status of installed applications to ensure their integrity and security.
[0082] The package management service also manages application-encrypted (DE) and credential-encrypted (CE) data. The key for DE data is only available after a verifiable boot process has been performed on the system. The CE catalog encrypts data using a key associated with user authentication (e.g., pattern, password), which is only available after the user has authenticated.
[0083] The application's CE directory may include the application's original UID. The package management service is used to execute the DE and CE data of the super-resolution application in this embodiment.
[0084] The Activity Management Service, acting as the Activity Manager service, is primarily responsible for managing and tracking the activity tasks and lifecycle of all applications. When an application is opened, the Activity Management Service starts the application's process and allocates processor resources and memory to it. When the application is no longer in the foreground or background, or when system memory is insufficient, the Activity Management Service terminates or kills the application's process.
[0085] For example, the activity management service can be responsible for managing and tracking the activity tasks and lifecycle of a super-resolution application. When a super-resolution application is opened, the activity management service starts the application's process and allocates processor resources and memory to it for super-resolution processing of the acquired human images. When the super-resolution application is no longer in the foreground or background, or when system memory is insufficient, the activity management service will terminate or kill the application's process.
[0086] System libraries may include Surface Manager, Media Libraries, Android Rruntime, etc.
[0087] The Android runtime consists of the core libraries and the virtual machine. The Android runtime is responsible for scheduling and managing the Android system. The core libraries comprise two parts: one part contains the functionalities that Java needs to call, and the other part consists of the Android core libraries. The application layer and application framework layer run in the virtual machine. The virtual machine executes the Java files of the application layer and application framework layer as binary files. The virtual machine is used to perform functions such as object lifecycle management, stack management, thread management, security and exception management, and garbage collection.
[0088] The Surface Manager is used to manage the display subsystem and provides the blending of 2D and 3D layers for multiple applications.
[0089] The media library supports playback and recording of various common audio and video formats, as well as still image files. It supports multiple audio and video encoding formats, such as MPEG4, H.264, MP3, AAC, AMR, JPG, and PNG.
[0090] The kernel layer can include modules such as camera driver, display driver, Wi-Fi driver, Bluetooth driver, and audio driver.
[0091] Understandable, Figure 1 The layers in the illustrated software structure and the components contained in each layer do not constitute a specific limitation on the electronic device. In other embodiments of this application, the electronic device may include more or fewer layers than illustrated, and each layer may include more or fewer components; this application does not impose any limitations.
[0092] It should be noted that although the embodiments in this application use the Android system as an example for illustration, the basic principles are equally applicable to systems based on... Electronic devices using operating systems such as Harmony.
[0093] Please see Figure 2 This document illustrates a flowchart of a human image super-resolution method according to an embodiment of this application. In a specific embodiment, the human image super-resolution method can be applied to an electronic device. The following example uses an electronic device to illustrate this method. Figure 2 The process shown is described in detail. The human image super-resolution method may include the following steps S110 to S130.
[0094] Step S110: Obtain the initial human body image.
[0095] In this embodiment of the application, when a user needs to perform super-resolution of a human image, he / she can send a super-resolution command to an electronic device, which receives and responds to the super-resolution command to obtain an initial human image.
[0096] Among them, human body image is an image of the entire body. Human body image can include face image and torso image. Face image can include head image and non-head image. Torso image is body image other than face image in human body image.
[0097] In some implementations, when a user needs to perform super-resolution of a human image, a super-resolution command can be sent to an electronic device. The electronic device receives and responds to the super-resolution command, acquires an initial scene image containing an initial human image, performs human detection on the initial scene image to obtain a human bounding box, and crops the initial scene image based on the human bounding box to obtain the initial human image. This achieves cropping the initial human image from the initial scene image based on the human bounding box obtained by human detection, thereby improving the accuracy of acquiring the initial human image.
[0098] In one implementation, the electronic device may include a camera. When a user needs to perform super-resolution of a human image, they can send a super-resolution command to the electronic device. The electronic device receives and responds to the super-resolution command, controls the camera to capture an image of the environment in which the human is located, obtains an initial scene image, performs human detection on the initial scene image to obtain a human bounding box, and crops the initial scene image based on the human bounding box to obtain an initial human image. The initial scene image contains the initial human image.
[0099] The camera can be any of the following: wide-angle camera, macro camera, ultra-wide-angle camera, or panoramic camera. The type of camera is not limited here, and the specific settings can be made according to actual needs.
[0100] In one implementation, the electronic device can connect to the camera via a network and interact with the camera through the network. When a user needs to perform super-resolution of a human image, they can send a super-resolution command to the electronic device. The electronic device receives and responds to the super-resolution command, sends a capture command to the camera via the network, and the camera receives and responds to the capture command, capturing an image of the environment in which the human is located, obtaining an initial scene image. The camera then sends the initial scene image back to the electronic device via the network. The electronic device receives the initial scene image returned by the camera, performs human detection on the initial scene image, obtains a human bounding box, and crops the initial scene image based on the human bounding box to obtain an initial human image.
[0101] The network can be any of the following: ZigBee network, Bluetooth (BT) network, Wireless Fidelity (Wi-Fi) network, Thread network, Long Range Radio (LoRa) network, Low-Power Wide-Area Network (LPWAN), Infrared network, Narrow Band Internet of Things (NB-IoT), Controller Area Network (CAN), Digital Living Network Alliance (DLNA) network, Wide Area Network (WAN), Local Area Network (LAN), Metropolitan Area Network (MAN), or Wireless Personal Area Network (WPAN). The type of network is not limited here; it can be configured according to actual needs.
[0102] In some implementations, the electronic device pre-stores an initial human body image. When a user needs to perform super-resolution of the human body image, they can send a super-resolution command to the electronic device, which receives and responds to the command, reading the pre-stored initial human body image.
[0103] In some implementations, the server pre-stores initial human images, connects to electronic devices via a network, and interacts with the electronic devices via the network.
[0104] When a user needs to perform super-resolution of a human body image, they can send a super-resolution command to the electronic device. The electronic device receives and responds to the super-resolution command, and sends an acquisition command to the server via the network. The server receives and responds to the acquisition command, and sends the pre-stored initial human body image to the electronic device via the network. The electronic device then receives the initial human body image returned by the server.
[0105] The server can be a standalone physical server, a server cluster or distributed system consisting of multiple physical servers, or any of the following: cloud servers that provide basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDN), big data or artificial intelligence platforms. The type of server is not limited here, and the specific configuration can be based on actual needs.
[0106] In some implementations, when a user needs to perform super-resolution of a human image, a super-resolution command can be sent to an electronic device. The electronic device receives and responds to the super-resolution command, generates upload prompt information, and receives the initial human image uploaded by the user according to the upload prompt information.
[0107] The upload prompt message can be used to guide the user to upload an initial human image to the electronic device according to the prompt message. The upload prompt message can be at least one of the following: sound prompt, text prompt, or light prompt, etc. The type of upload prompt message is not limited here and can be set according to actual needs.
[0108] In some implementations, when a user needs to perform super-resolution of a human image, a super-resolution instruction carrying an initial human image can be sent to an electronic device. The electronic device receives and responds to the super-resolution instruction and acquires the initial human image according to the super-resolution instruction.
[0109] In some implementations, the electronic device may be equipped with an input panel. When a user needs to perform super-resolution of a human image, they can input a super-resolution command based on the input panel of the electronic device. For example, the super-resolution command can be input by handwriting on the input panel of the electronic device, or by pressing a button on the input panel of the electronic device. The electronic device receives the super-resolution command through the input panel.
[0110] In some implementations, the electronic device may be equipped with a voice recognition module. When a user needs to perform super-resolution of a human image, the user can send voice information within the voice acquisition range of the voice recognition module. The voice recognition module acquires the voice information sent by the user and performs voice recognition on the acquired voice information to obtain a voice recognition result. When it is determined that the voice recognition result contains keywords used to indicate super-resolution of the human image, such as the keyword "human image super-resolution", or the keywords "human image" and "super-resolution", it is determined that a super-resolution instruction has been received.
[0111] As an example, if the user sends a voice message: "Perform super-resolution on the human body image," and the speech recognition result contains the keywords "human body image" and "super-resolution," then it is determined that the super-resolution instruction has been received.
[0112] In some implementations, the client connects to the electronic device via a network and interacts with the electronic device via the network.
[0113] When a user needs to perform super-resolution of a human image, they can send a super-resolution command to the client. The client receives and responds to the super-resolution command, and forwards the super-resolution command to the electronic device via the network. The electronic device then receives the super-resolution command forwarded by the client.
[0114] The client can be any of the following: a mobile client (e.g., a mobile phone client, a PDA client, a Tablet PC client, a laptop client, a smartwatch client, a smart bracelet client, or a wearable client) or a fixed client (e.g., a desktop computer client, a smart panel client). The type of client is not limited here and can be set according to actual needs.
[0115] Step S120: Perform the first super-resolution processing on the initial human image using a super-resolution model to obtain the first super-resolution human image.
[0116] In this embodiment of the application, after the electronic device acquires the initial human body image, it can perform a first super-resolution process on the initial human body image using a super-resolution model to obtain a first super-resolution human body image.
[0117] The super-resolution model can include a face super-resolution diffusion model and a body super-resolution diffusion model. The face super-resolution diffusion model can be used for super-resolution processing of face images, and the body super-resolution diffusion model can be used for super-resolution processing of body images. Figure 3 As shown, the first super-resolution process may include steps S121 to S124.
[0118] Step S121: Obtain the initial face feature map and the initial human feature map based on the initial human body image.
[0119] In this embodiment, the electronic device can determine the initial human body image as the initial human body feature map, and crop the initial human body feature map according to the face bounding box of the initial human body feature map to obtain the initial face feature map. This realizes the cropping of the initial human body feature map based on the face bounding box. The cropped initial face feature map always contains face features, which improves the accuracy of obtaining the initial face feature map.
[0120] In some implementations, the electronic device can determine an initial human body image as an initial human body feature map, perform face detection on the initial human body image to obtain the coordinates of a first face bounding box, and determine the coordinates of a second face bounding box of the initial human body feature map based on the coordinates of the first face bounding box and the mapping relationship between the initial human body image and the initial human body feature map. Then, based on the coordinates of the second face bounding box, the face bounding box of the initial human body feature map is determined, and the initial human body feature map is cropped based on the face bounding box to obtain the initial face feature map. This achieves the determination of the face bounding box of the initial human body feature map based on the coordinates of the first face bounding box of the initial human body image and the mapping relationship between the initial human body image and the initial human body feature map, resulting in a face bounding box with high accuracy.
[0121] In some implementations, the electronic device can compress the initial human image to obtain a compressed human image, and obtain an initial facial feature map and an initial human feature map based on the compressed human image. By compressing the initial human image and performing super-resolution based on the compressed human image, the size of the data involved in the calculation during the super-resolution process is reduced, and the super-resolution efficiency of the human image is improved.
[0122] The compression process can be latent space coding, and the compressed human image can be a latent space human image.
[0123] Electronic devices can determine a compressed human image as an initial human feature map, and crop the initial human feature map based on the face bounding box of the initial human feature map to obtain an initial face feature map. The cropped initial human feature map based on the face bounding box always contains facial features, thus improving the accuracy of obtaining the initial face feature map.
[0124] Electronic devices can perform face detection on an initial human body image to obtain the coordinates of a first face bounding box. Based on the coordinates of the first face bounding box and the mapping relationship between the initial human body image and the compressed human body image, the coordinates of a second face bounding box in the initial human body feature map are determined. Based on the coordinates of the second face bounding box, the face bounding box of the initial human body feature map is determined. This achieves the determination of the face bounding box of the initial human body feature map based on the coordinates of the first face bounding box in the initial human body image and the mapping relationship between the initial human body image and the compressed human body image. The obtained face bounding box has high accuracy.
[0125] Step S122: Input the initial face feature map into the face super-resolution diffusion model to obtain the first super-resolution face image.
[0126] In this embodiment, the electronic device can input an initial face feature map into a face super-resolution diffusion model. The face super-resolution diffusion model receives and responds to the initial face feature map, and outputs a first super-resolution face image. The resolution of the first super-resolution face image is higher than the resolution of the initial face image.
[0127] In some implementations, the initial face feature map may include multiple sub-face feature maps. The electronic device can perform sliding window segmentation on the initial face feature map to obtain multiple sub-face feature maps, and input the multiple sub-face feature maps one by one into the face super-resolution diffusion model. The face super-resolution diffusion model receives and responds to the multiple sub-face feature maps and outputs the first super-resolution face image. By processing the initial face feature map into multiple sub-face feature maps through sliding window segmentation and inputting the multiple sub-face feature maps one by one into the face super-resolution diffusion model for face super-resolution, the large image data of the face feature map input to the face super-resolution diffusion model at a time is avoided, which would increase the amount of data computation for face super-resolution and improve the super-resolution efficiency of face super-resolution of face feature maps.
[0128] The sliding window segmentation process involves processing the image using a sliding window based on preset pixel block sizes and pixel overlap lengths. For example, the preset pixel block size could be 64*64, and the pixel overlap length could be 16; alternatively, the preset pixel block size could be 25*25, and the pixel overlap length could be 5, etc. The size of the preset pixel block and the pixel overlap length are not limited here and can be set according to actual needs. Figure 4 As shown, it illustrates a scenario of image processing using a sliding window block method.
[0129] Step S123: Input the initial human feature map into the human super-resolution diffusion model to obtain the initial super-resolution human image.
[0130] In this embodiment of the application, the electronic device can input an initial human feature map into the human super-resolution diffusion model, and the human super-resolution diffusion model receives and responds to the initial human feature map and outputs an initial super-resolution human image.
[0131] Among them, the resolution of the first super-resolution face image is higher than the resolution of the initial super-resolution human body image, and the resolution of the initial super-resolution human body image is higher than the resolution of the initial human body image.
[0132] In some implementations, the initial human feature map may include multiple sub-human feature maps. The electronic device can perform sliding window segmentation on the initial human feature map to obtain multiple sub-human feature maps, and input the multiple sub-human feature maps one by one into the human super-resolution diffusion model. The human super-resolution diffusion model receives and responds to the multiple sub-human feature maps and outputs the initial super-resolution human image. By processing the initial human feature map into multiple sub-human feature maps through sliding window segmentation and inputting the multiple sub-human feature maps one by one into the human super-resolution diffusion model for human super-resolution, the large image data of the human feature map input to the human super-resolution diffusion model at one time is avoided, which would increase the amount of data computation for human super-resolution and improve the super-resolution efficiency of human feature map.
[0133] Step S124: Fuse the first super-resolution face image and the initial super-resolution human body image to obtain the first super-resolution human body image.
[0134] In this embodiment of the application, the electronic device can fuse a first super-resolution face image and an initial super-resolution human body image to obtain a first super-resolution human body image, wherein the resolution of the first super-resolution human body image is higher than that of the initial human body image.
[0135] Specifically, the electronic device can select a second super-resolved face image corresponding to the coordinates of the second face bounding box in the initial super-resolved human body image, and replace the second super-resolved face image with the first super-resolved face image in the initial super-resolved human body image to obtain the first super-resolved human body image. This realizes the first super-resolved face image obtained based on face super-resolved imaging and updates the face region image in the initial super-resolved human body image obtained based on human body super-resolved imaging. The obtained first super-resolved human body image satisfies both the face clarity requirement and the human body clarity requirement, thus improving the super-resolved imaging experience of human body images.
[0136] In some implementations, the first super-resolution face image may include a first super-resolution head image and a first super-resolution non-head image, and the initial super-resolution human body image may include an initial super-resolution head image and an initial super-resolution non-head image.
[0137] The electronic device can select an initial super-resolved head image corresponding to the coordinates of the second face bounding box in the initial super-resolved human body image, and replace the initial super-resolved head image with the first super-resolved head image in the initial super-resolved human body image to obtain the first super-resolved human body image. This realizes the first super-resolved head image obtained based on face super-resolved imaging. The update of the initial super-resolved head image in the initial super-resolved human body image obtained based on human body super-resolved imaging reduces the amount of data computation in the super-resolved processing and improves the super-resolved efficiency of human body image super-resolved imaging.
[0138] Step S130: Use the first super-resolution human image obtained after the first super-resolution processing as the initial human image in the second super-resolution processing, and repeat the process of the first super-resolution processing until the Tth super-resolution processing is completed, and obtain the first super-resolution human image of the Tth super-resolution processing.
[0139] It is understood that in the embodiments of this application, the electronic device can perform T super-resolution processing. The process of each super-resolution processing is similar to that of the first super-resolution processing. However, for other super-resolution processing besides the first super-resolution processing, the result obtained from the previous super-resolution processing is used as the input for the next super-resolution processing. This process is repeated iteratively until the Tth super-resolution processing is completed, and the first super-resolution human image of the Tth super-resolution processing is obtained.
[0140] In this embodiment, the electronic device can use the first super-resolution human image obtained after the first super-resolution processing as the initial human image in the second super-resolution processing, and repeat the first super-resolution processing process T times until the Tth super-resolution processing is completed, thereby obtaining the first super-resolution human image of the Tth super-resolution processing. This achieves the following: during the human image super-resolution process, the face noise of the initial human image is learned based on the face super-resolution diffusion model, and the human noise of the initial human image is learned based on the human super-resolution diffusion model. The face noise and human noise are then fused to obtain the first super-resolution human image. The first super-resolution human image is then iteratively super-resolution processed. The face super-resolution diffusion model iteratively learns the face noise in the previous face super-resolution image, and the human super-resolution diffusion model iteratively learns the human noise in the previous human super-resolution image. Ultimately, this makes the transition between the face region image and the human region image in the super-resolution human image obtained by multiple super-resolution iterations more natural.
[0141] Where T is a positive integer greater than or equal to 2.
[0142] In some implementations, the electronic device can decode the first super-resolution human image of the Tth super-resolution process to obtain a target super-resolution human image. The resolution of the target super-resolution human image is higher than that of the initial human image. By decoding the first super-resolution human image of the Tth super-resolution process, the obtained target super-resolution human image has the same dimensions as the initial human image, thereby improving the image quality of the target super-resolution human image.
[0143] In some implementations, the electronic device can segment the first super-resolution human image of the Tth super-resolution process to obtain a human mask image, and then paste the human mask image back to the position corresponding to the initial human image in the initial scene image to obtain the target scene image. This achieves super-resolution processing only on the initial human image in the initial scene image, avoiding the low super-resolution efficiency caused by super-resolution processing of the entire initial scene image, and improving the super-resolution efficiency of super-resolution of human images.
[0144] In some implementations, the electronic device can perform erosion processing on the boundary of the human mask image in the target scene image. By performing erosion processing on the boundary of the super-resolution human image in the target scene image, the unnatural transition between the super-resolution human image and the initial scene image can be reduced, and the fusion degree between the super-resolution human image and the initial scene image can be improved.
[0145] In some implementations, the electronic device can perform Gaussian blurring on the boundary of the human mask image in the target scene image. By performing Gaussian blurring on the boundary of the super-resolution human image in the target scene image, the unnatural transition between the super-resolution human image and the initial scene image can be reduced, and the fusion degree between the super-resolution human image and the initial scene image can be improved.
[0146] In one application scenario, such as Figure 5 As shown, the human image super-resolution method may include steps S201 to S210.
[0147] Step S201: Obtain the initial scene image.
[0148] Step S202: Perform human detection and face detection on the initial scene image to obtain human bounding boxes and face bounding boxes.
[0149] Step S203: Obtain the latent space feature map based on the human body bounding box and the initial scene image.
[0150] Specifically, the initial scene image is cropped based on the human body bounding box to obtain the initial human body image, and then input into the encoder ε for encoding to obtain the latent space feature map.
[0151] Step S204: Perform sliding window segmentation on the latent space feature map to obtain multiple sub-latent space feature maps.
[0152] Step S205: Obtain human super-resolution image noise ∈ based on multiple sub-latent space feature maps g and face super-resolution image noise ∈ f .
[0153] Specifically, multiple sub-latent space feature maps are input into the human super-resolution diffusion model to obtain the noise ∈ human super-resolution image. g The sub-latent space feature maps involving face regions from multiple sub-latent space feature maps are input into the face super-resolution diffusion model to obtain the noise ∈ of the face super-resolution image. f .
[0154] Step S206: Fuse noise from the human super-resolution image ∈ g and face super-resolution image noise ∈ f The super-resolution human noise ∈ is obtained.
[0155] Specifically, based on the mapping coordinates of the face bounding box in the latent space, the noise in the human super-resolution image is... g The noise in the face portion is replaced with noise in the face super-resolution image. f The super-resolution human noise ∈ is obtained.
[0156] Step S207: Repeat steps S204 to S206 to T times to obtain the target super-resolution human body noise z0.
[0157] Specifically, from the second time to the Tth time, the super-resolution human noise ∈ output from the previous time is used as the input for the next sliding window block processing to obtain the target super-resolution human noise z0.
[0158] Step S208: Decode the target super-resolution human noise z0 to obtain the target super-resolution human image.
[0159] Specifically, the target super-resolution human noise z0 is input to the decoder D. The decoder D receives and responds to the target super-resolution human noise z0, decodes the target super-resolution human noise z0, and obtains the target super-resolution human image.
[0160] Step S209: Fuse the target super-resolution human image and the initial scene image to obtain the current scene image.
[0161] Specifically, the target super-resolution human image is segmented to obtain a human mask image, and the human mask image is then pasted back onto the position corresponding to the initial human image in the initial scene image to obtain the current scene image.
[0162] Step S210: Perform erosion and Gaussian blur processing on the boundary of the human mask image in the target scene image to obtain the target scene image.
[0163] It should be noted that, as Figure 6 As shown, when the initial scene image includes multiple human body images, steps S201 to S210 are performed on each human body image to obtain the target scene image.
[0164] Both ends of the diffusion model and the control network model are connected to an autoencoder ε. The autoencoder ε is a variational autoencoder, which downsamples the image through an encoding factor f, where f is a user-preset encoding factor value, preferably f = 8.
[0165] Furthermore, both the diffusion model and the control network model incorporate text input, encoding the user-input text information into a text embedding using a text encoder. This text encoder can be a text encoder from Contrastive Language-Image Pretraining (CLIP). The diffusion model uses a U-Net structure, and the U-Net structure outputs a noise estimate.
[0166] This embodiment provides a solution that acquires an initial human body image and performs a first super-resolution processing on the initial human body image using a super-resolution model to obtain a first super-resolution human body image. The super-resolution model includes a face super-resolution diffusion model and a human body super-resolution diffusion model. The first super-resolution processing involves acquiring an initial face feature map and an initial human body feature map from the initial human body image, inputting the initial face feature map into the face super-resolution diffusion model to obtain a first super-resolution face image, and inputting the initial human body feature map into the human body super-resolution diffusion model to obtain an initial super-resolution human body image. The resolution of the first super-resolution face image is higher than that of the initial super-resolution human body image. The first super-resolution face image and the initial super-resolution human body image are then fused to obtain the first super-resolution human body image. This first super-resolution human body image obtained after the first super-resolution processing is used as... The initial human body image in the second super-resolution process repeats the process of the first super-resolution process until the Tth super-resolution process is completed, resulting in the first super-resolution human body image of the Tth super-resolution process. This achieves the following in the human body image super-resolution process: the face noise of the initial human body image is learned based on the face super-resolution diffusion model, and the human body noise of the initial human body image is learned based on the human body super-resolution diffusion model. The face noise and human body noise are then fused to obtain the first super-resolution human body image. The first super-resolution human body image is then iteratively super-resolution processed. The face super-resolution diffusion model iteratively learns the face noise in the previous face super-resolution image, and the human body super-resolution diffusion model iteratively learns the human body noise in the previous human body super-resolution image. Ultimately, this makes the transition between the face region image and the human body region image in the super-resolution human body image obtained by multiple super-resolution iterations more natural.
[0167] Please see Figure 7 This document illustrates a flowchart of a super-resolution model construction method provided in one embodiment of this application. In a specific embodiment, the super-resolution model construction method can be applied to electronic devices. The following example uses an electronic device to illustrate this method. Figure 7 The process shown is described in detail. The super-resolution model construction method may include the following steps S310 to S330.
[0168] Step S310: Construct a super-resolution diffusion model based on the diffusion model and the control network model.
[0169] In this embodiment, the electronic device can integrate the diffusion model and the control network model to obtain a super-division diffusion model.
[0170] Among them, the diffusion model is a generative model that can learn to approximate an unknown data distribution given independent and identically distributed sample data from a location data distribution.
[0171] The diffusion model mainly includes two processes: noise addition and noise reduction, such as... Figure 8As shown, the noise-adding process refers to the process of gradually adding Gaussian noise to the real images in the dataset; the noise-removing process refers to the process of gradually removing noise from the noisy images to restore the real images.
[0172] As an example, given a data point x0 with a probability distribution q(x0), the noise-adding process gradually destroys the data structure of data point x0 by repeatedly applying the following Markov diffusion kernel.
[0173] The Markov diffusion nucleus is: Where t∈{1,2,…,T}, It is a predefined or learned variation of noise variance. Through... With proper design, q(x) is theoretically guaranteed. t It converges to a unit spherical Gaussian distribution.
[0174] The marginal distribution at any time t has the following analytical form:
[0175] The goal of the denoising process is to reduce x t To x t-1 We learn a transfer kernel, which is defined by the following Gaussian distribution:
[0176] p θ (x t-1 x t )=N(x t-1 μ θ (x t ,t),∑ θ (x t ,t), where θ is a learnable parameter.
[0177] Using such a learned transfer kernel, the data distribution q(x0) can be approximated by the following marginal distribution: marginal distribution Where p(x) T )=N(x T ;0,I).
[0178] In one application scenario, such as Figure 9The diagram illustrates a structural schematic of a super-resolution diffusion model, which includes a diffusion model and a control network model. The diffusion model is a stable diffusion model, comprising a first autoencoder (Auto Encoder_1), a Gaussian noise adder, a convolutional layer, a text encoder, a time encoder, multiple first encoder blocks (SD Encoder Blocks), a first intermediate block (SD Middle Block), multiple decoder blocks (SD DecoderBlocks), and an auto decoder. The control network model includes a second autoencoder (Auto Encoder_2), multiple zero convolutional layers (Zero Convolution), multiple second encoder blocks, and a second intermediate block.
[0179] Multiple first encoder blocks (SD Encoder Block), first intermediate blocks (SD Middle Block), and multiple decoder blocks (SD Decoder Block) form the Unet model. Each first encoder block can include multiple residual networks (ResNet).
[0180] During the training process of the super-resolution diffusion model, Auto Encoder_1 is used to encode the input high-resolution image HRImage, Text Encoder is used to encode the input prompt word Prompt, Time Encoder is used to encode the time, and the second autoencoder Auto Encoder_2 is used to encode the input low-resolution image LRImage.
[0181] During the training of the super-resolution diffusion model, the parameters of the stable diffusion model are frozen and not trained. The parameters of the control network model are primarily trained to obtain the target super-resolution diffusion model used for inference, such as... Figure 10 As shown.
[0182] Step S320: Based on the first historical face feature map and the second historical face feature map, train the super-resolution diffusion model to obtain the face super-resolution diffusion model.
[0183] In this embodiment, the electronic device can acquire a first historical face feature map and a second historical face feature map, and train a super-resolution diffusion model based on the first historical face feature map and the second historical face feature map to obtain a face super-resolution diffusion model. The super-resolution diffusion model is constructed based on the diffusion model and the control network model. The control network model serves as the conditional control generation model of the super-resolution diffusion model, making the super-resolution process of face super-resolution of face feature maps controllable, which can improve the user's super-resolution experience of face super-resolution of face feature maps.
[0184] The first historical face feature map and the second historical face feature map contain the same face features, and the resolution of the first historical face feature map is higher than that of the second historical face feature map.
[0185] Specifically, the electronic device can acquire a first historical face feature map and a second historical face feature map, and input the first historical face feature map into the diffusion model in the super-resolution diffusion model. The diffusion model receives and responds to the first historical face feature map, and outputs the first historical face feature to the control network model. The control network model receives the first historical face feature output by the diffusion model. The electronic device inputs the second historical face feature map into the control network model in the super-resolution diffusion model. The control network model receives the second historical face feature map input by the electronic device. Based on the first historical face feature and the second historical face feature map, the device is trained to obtain the face super-resolution diffusion model. This realizes the training of the control network model in the super-resolution diffusion model based on the historical face feature map, without the need to train the diffusion model in the super-resolution diffusion model, reducing the amount of data computation during the training process of the super-resolution diffusion model and improving the training speed of the super-resolution diffusion model.
[0186] Step S330: Based on the first historical human body feature map and the second historical human body feature map, train the super-diffusion model to obtain the human body super-diffusion model.
[0187] In this embodiment, the electronic device can acquire a first historical human feature map and a second historical human feature map, and train a super-resolution diffusion model based on the first historical human feature map and the second historical human feature map to obtain a human super-resolution diffusion model. The super-resolution diffusion model is constructed based on the diffusion model and the control network model. The control network model serves as the conditional control generation model of the super-resolution diffusion model, making the super-resolution process of human feature map super-resolution controllable, which can improve the user's super-resolution experience of human feature map super-resolution.
[0188] The first historical human body feature map and the second historical human body feature map contain the same human body features, and the resolution of the first historical human body feature map is higher than that of the second historical human body feature map.
[0189] Specifically, the electronic device can acquire a first historical human feature map and a second historical human feature map, and input the first historical human feature map into the diffusion model in the super-resolution diffusion model. The diffusion model receives and responds to the first historical human feature map, and outputs the first historical human feature map into the control network model. The control network model receives the first historical human feature map output by the diffusion model. The electronic device inputs the second historical human feature map into the control network model in the super-resolution diffusion model. The control network model receives the second historical human feature map input by the electronic device and is trained based on the first historical human feature map and the second historical human feature map to obtain the human super-resolution diffusion model. This realizes the training of the control network model in the super-resolution diffusion model based on the historical human feature map, without the need to train the diffusion model in the super-resolution diffusion model, reducing the amount of data computation in the training process of the super-resolution diffusion model and improving the training speed of the super-resolution diffusion model.
[0190] This embodiment provides a solution that constructs a super-resolution diffusion model based on a diffusion model and a control network model. A face super-resolution diffusion model is obtained by training the super-resolution diffusion model based on a first historical face feature map and a second historical face feature map. Similarly, a human body super-resolution diffusion model is obtained by training the super-resolution diffusion model based on a first historical human body feature map and a second historical human body feature map. This achieves the construction of a super-resolution diffusion model based on the diffusion model and the control network model. The control network model serves as the conditional control generation model for the super-resolution diffusion model, making the super-resolution process of face super-resolution on face feature maps and human body super-resolution on human body feature maps controllable, thereby improving the user's super-resolution experience.
[0191] Please see Figure 11 This document illustrates a human image super-resolution device 600 according to an embodiment of this application. The human image super-resolution device 600 can be applied to electronic devices. The following example uses an electronic device to illustrate this. Figure 11 The human image super-resolution device 600 shown will be described in detail. The human image super-resolution device 600 may include an initial image acquisition module 610, a super-resolution processing module 620, and a repetitive execution module 630.
[0192] The initial image acquisition module 610 can be used to acquire an initial human body image; the super-resolution processing module 620 can be used to perform the first super-resolution processing on the initial human body image through the super-resolution model to obtain the first super-resolution human body image; the repeated execution module 630 can be used to use the first super-resolution human body image obtained after the first super-resolution processing as the initial human body image in the second super-resolution processing, and repeat the process of the first super-resolution processing until the Tth super-resolution processing is completed to obtain the first super-resolution human body image of the Tth super-resolution processing.
[0193] The super-resolution model may include a face super-resolution diffusion model and a human body super-resolution diffusion model, and the super-resolution processing module 620 may include a first acquisition unit, a first input unit, a second input unit, and a fusion unit.
[0194] The first acquisition unit can be used to acquire an initial face feature map and an initial human body feature map based on an initial human body image; the first input unit can be used to input the initial face feature map into a face super-resolution diffusion model to obtain a first super-resolution face image; the second input unit can be used to input the initial human body feature map into a human body super-resolution diffusion model to obtain an initial super-resolution human body image, wherein the resolution of the first super-resolution face image is higher than the resolution of the initial super-resolution human body image; the fusion unit can be used to fuse the first super-resolution face image and the initial super-resolution human body image to obtain the first super-resolution human body image.
[0195] In some implementations, the first acquisition unit may include a compression subunit and an acquisition subunit.
[0196] The compression subunit can be used to compress the initial human image to obtain a compressed human image; the acquisition subunit can be used to acquire the initial face feature map and the initial human feature map based on the compressed human image.
[0197] In some implementations, obtaining a sub-unit may include determining a secondary sub-unit and trimming the secondary sub-unit.
[0198] The determination sub-unit can be used to determine the compressed human image as the initial human feature map; the cropping sub-unit can be used to crop the initial human feature map based on the face bounding box of the initial human feature map to obtain the initial face feature map.
[0199] In some embodiments, the human image super-resolution device 600 may further include a detection module, a first determination module, and a second determination module.
[0200] The detection module can be used to crop the face bounding box of the initial human body feature map based on the secondary sub-unit. Before obtaining the initial face feature map, face detection is performed on the initial human body image to obtain the coordinates of the first face bounding box. The first determination module can be used to determine the coordinates of the second face bounding box of the initial human body feature map based on the coordinates of the first face bounding box and the mapping relationship between the initial human body image and the initial human body feature map. The second determination module can be used to determine the face bounding box of the initial human body feature map based on the coordinates of the second face bounding box.
[0201] In some implementations, the fusion unit may include a selection subunit and a replacement subunit.
[0202] The selection sub-unit can be used to select the second super-resolved face image corresponding to the coordinates of the second face bounding box in the initial super-resolved human body image; the replacement sub-unit can be used to replace the second super-resolved face image with the first super-resolved face image in the initial super-resolved human body image to obtain the first super-resolved human body image.
[0203] In some implementations, the first super-resolution face image may include a first super-resolution head image and a first super-resolution non-head image, and the initial super-resolution human body image may include an initial super-resolution head image and an initial super-resolution non-head image; the replacement sub-unit may include a replacement secondary sub-unit.
[0204] The replacement sub-unit can be used to replace the initial super-resolution human head image with the first super-resolution human head image in the initial super-resolution human head image, thus obtaining the first super-resolution human head image.
[0205] In some embodiments, the human image super-resolution device 600 may also include a decoding module.
[0206] The decoding module can be used to decode the first super-resolution human image in the Tth super-resolution process to obtain the target super-resolution human image, the resolution of which is higher than that of the initial human image.
[0207] In some implementations, the initial face feature map may include multiple sub-face feature maps, and the human image super-resolution device 600 may also include a first sliding window module.
[0208] The first sliding window module can be used to input the initial face feature map into the face super-resolution diffusion model from the first input unit, and before obtaining the first super-resolution face image, to perform sliding window block processing on the initial face feature map to obtain multiple sub-face feature maps.
[0209] In some implementations, the first input unit may include a first input subunit.
[0210] The first input sub-unit can be used to input multiple sub-face feature maps one by one into the face super-resolution diffusion model, so that the face super-resolution diffusion model outputs the first super-resolution face image based on the multiple sub-face feature maps.
[0211] In some implementations, the initial human feature map may include multiple sub-human feature maps, and the human image super-resolution device 600 may also include a second sliding window module.
[0212] The second sliding window module can be used to input the initial human feature map into the human super-resolution diffusion model from the second input unit. Before obtaining the initial super-resolution human image, the initial human feature map is processed by sliding window block segmentation to obtain multiple sub-human feature maps.
[0213] In some implementations, the second input unit may include a second input subunit.
[0214] The second input sub-unit can be used to input multiple sub-human feature maps into the human super-resolution diffusion model one by one, so that the human super-resolution diffusion model outputs an initial super-resolution human image based on the multiple sub-human feature maps.
[0215] In some embodiments, the human image super-resolution device 600 may further include a construction module, a first training module, and a second training module.
[0216] The construction module can be used by the super-resolution processing module 620 to perform the first super-resolution processing on the initial human image through the super-resolution model to obtain the first super-resolution human image, and to construct a super-resolution diffusion model based on the diffusion model and the control network model; the first training module can be used to train the super-resolution diffusion model based on the first historical face feature map and the second historical face feature map to obtain a face super-resolution diffusion model, wherein the first historical face feature map and the second historical face feature map contain the same face features, and the resolution of the first historical face feature map is higher than that of the second historical face feature map; the second training module can be used to train the super-resolution diffusion model based on the first historical human feature map and the second historical human feature map to obtain a human super-resolution diffusion model, wherein the first historical human feature map and the second historical human feature map contain the same human features, and the resolution of the first historical human feature map is higher than that of the second historical human feature map.
[0217] In some implementations, the first training module may include a third input unit and a fourth input unit.
[0218] The third input unit can be used to input the first historical face feature map into the diffusion model in the super-resolution diffusion model, so that the diffusion model outputs the first historical face features to the control network model based on the first historical face feature map; the fourth input unit can be used to input the second historical face feature map into the control network model in the super-resolution diffusion model, so that the control network model is trained based on the first historical face features and the second historical face feature map to obtain the face super-resolution diffusion model.
[0219] In some implementations, the second training module may include a fifth input unit and a sixth input unit.
[0220] The fifth input unit can be used to input the first historical human feature map into the diffusion model in the super-resolution diffusion model, so that the diffusion model outputs the first historical human feature map into the control network model; the sixth input unit can be used to input the second historical human feature map into the control network model in the super-resolution diffusion model, so that the control network model is trained according to the first historical human feature map and the second historical human feature map to obtain the human super-resolution diffusion model.
[0221] In some implementations, the initial image acquisition module 610 may include a second acquisition unit, a detection unit, and a cropping unit.
[0222] The second acquisition unit can be used to acquire an initial scene image containing an initial human image; the detection unit can be used to perform human detection on the initial scene image to obtain a human bounding box; the cropping unit can be used to crop the initial scene image based on the human bounding box to obtain an initial human image.
[0223] In some embodiments, the human image super-resolution device 600 may also include a segmentation module and a pasting module.
[0224] The segmentation module can be used to segment the first super-resolution human image in the Tth super-resolution process to obtain a human mask image; the pasting module can be used to paste the human mask image back to the position corresponding to the initial human image in the initial scene image to obtain the target scene image.
[0225] In some embodiments, the human image super-resolution device 600 may also include an erosion module.
[0226] The erosion module can be used to erode the boundaries of human mask images in a target scene image.
[0227] In some embodiments, the human image super-resolution device 600 may also include a blurring module.
[0228] The blur module can be used to perform Gaussian blurring on the boundaries of human mask images in a target scene image.
[0229] The solution provided in this embodiment obtains an initial human image and performs a first super-resolution process on the initial human image using a super-resolution model to obtain a first super-resolution human image. The super-resolution model includes a face super-resolution diffusion model and a human body super-resolution diffusion model. The first super-resolution process includes obtaining an initial face feature map and an initial human body feature map from the initial human image, inputting the initial face feature map into the face super-resolution diffusion model to obtain a first super-resolution face image, and inputting the initial human body feature map into the human body super-resolution diffusion model to obtain an initial super-resolution human image. The resolution of the first super-resolution face image is higher than that of the initial super-resolution human image. The first super-resolution face image and the initial super-resolution human image are then fused to obtain the first super-resolution human image. The first super-resolution human image obtained after the first super-resolution process is then used as... The initial human image in the second super-resolution process is used as the basis for repeating the first super-resolution process until the Tth super-resolution process is completed, resulting in the first super-resolution human image of the Tth super-resolution process. This achieves the following: during the human image super-resolution process, the face noise of the initial human image is learned based on the face super-resolution diffusion model, and the human noise of the initial human image is learned based on the human super-resolution diffusion model. The face noise and human noise are then fused to obtain the first super-resolution human image. The first super-resolution human image is then iteratively super-resolution processed. The face super-resolution diffusion model iteratively learns the face noise in the previous face super-resolution image, and the human super-resolution diffusion model iteratively learns the human noise in the previous human super-resolution image. Ultimately, this makes the transition between the face region image and the human region image in the super-resolution human image obtained by multiple super-resolution iterations more natural.
[0230] It should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For device embodiments, since they are basically similar to method embodiments, the description is relatively simple; relevant parts can be referred to in the description of the method embodiments. Any processing method described in the method embodiments can be implemented in the device embodiments through corresponding processing modules, and will not be elaborated upon further in the device embodiments.
[0231] Furthermore, the functional modules in the various embodiments of this application can be integrated into one processing module, or each module can exist physically separately, or two or more modules can be integrated into one module. The integrated modules described above can be implemented in hardware or as software functional modules.
[0232] Please see Figure 12 This illustrates a schematic diagram of the hardware structure of an electronic device 700 provided in one embodiment of this application. For example... Figure 12As shown, the electronic device 700 may include a processor 710, an external memory interface 720, an internal memory 721, a Universal Serial Bus (USB) interface 730, a charging management module 740, a power management module 741, a battery 742, an antenna 1, an antenna 2, a mobile communication module 750, a wireless communication module 760, an audio module 770, a speaker 770A, a receiver 770B, a microphone 770C, a headphone jack 770D, a sensor module 780, buttons 790, a motor 791, an indicator 792, a camera 793, a display screen 794, and a Subscriber Identification Module (SIM) card interface 795, etc. The sensor module 780 may include a pressure sensor 780A, a gyroscope sensor 780B, a barometric pressure sensor 780C, a magnetic sensor 780D, an accelerometer sensor 780E, a distance sensor 780F, a proximity light sensor 780G, a fingerprint sensor 780H, a temperature sensor 780J, a touch sensor 780K, an ambient light sensor 780L, a bone conduction sensor 780M, etc.
[0233] It is understood that the structures illustrated in the embodiments of this application do not constitute a specific limitation on the electronic device 700. In other embodiments of this application, the electronic device 700 may include more or fewer components than illustrated, or combine some components, or split some components, or have different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
[0234] For example, Figure 12 The processor 710 shown may include one or more processing units, such as an application processor (AP), a modem processor, a graphics processing unit (GPU), an image signal processor (ISP), a controller, memory, a video codec, a digital signal processor (DSP), a baseband processor, and / or a neural network processing unit (NPU). These different processing units may be independent devices or integrated into one or more processors.
[0235] The AP can be used to control and manage super-resolution applications. For example, the AP can control the super-resolution application to perform super-resolution processing on the acquired human body images.
[0236] The controller can be the nerve center and command center of the electronic device 700. The controller can generate operation control signals according to the instruction opcode and timing signals to complete the control of instruction fetching and execution.
[0237] The processor 710 may also include a memory for storing instructions and data. In some embodiments, the memory in the processor 710 is a cache memory. This memory can store instructions or data that the processor 710 has just used or that are used repeatedly. If the processor 710 needs to use the instruction or data again, it can retrieve it directly from the memory. This avoids repeated accesses, reduces the waiting time of the processor 710, and thus improves the efficiency of the system.
[0238] In some embodiments, the processor 710 may include one or more interfaces. Interfaces may include an Inter-Integrated Circuit (I2C) interface, an Inter-Integrated Circuit Sound (I2S) interface, a Pulse Code Modulation (PCM) interface, a Universal Asynchronous Receiver / Transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI) interface, a General Purpose Input / Output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and / or a Universal Serial Bus (USB) interface, etc.
[0239] Electronic device 700 implements display functions through a GPU, a display screen 794, and an application processor. The GPU is a microprocessor for image processing, connected to the display screen 794 and the application processor. The GPU performs mathematical and geometric calculations and is used for graphics rendering. Processor 710 may include one or more GPUs, which execute program instructions to generate or modify display information.
[0240] The display screen 794 is used to display images, videos, etc. The display screen 794 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), a Mini-LED, a Micro-LED, a Micro-OLED, a quantum dot light-emitting diode (QLED), etc. In some embodiments, the electronic device 700 may include one or N displays 794, where N is a positive integer greater than 1.
[0241] Electronic device 700 can achieve shooting function through ISP, camera 793, video codec, GPU, display 794 and application processor.
[0242] The ISP (Image Signal Processor) is used to process data fed back from the camera 793. For example, when taking a picture, the shutter is opened, and light is transmitted through the lens to the camera's photosensitive element. The light signal is converted into an electrical signal, and the camera's photosensitive element transmits the electrical signal to the ISP for processing, transforming it into an image visible to the naked eye. The ISP can also perform algorithmic optimization of image noise, brightness, and skin tone. The ISP can also optimize parameters such as exposure and color temperature of the shooting scene. In some embodiments, the ISP can be set in the camera 793.
[0243] Camera 793 is used to capture still images or videos. An object is projected onto a photosensitive element by generating an optical image through the lens. The photosensitive element can be a charge-coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the light signal into an electrical signal, which is then passed to an ISP for conversion into a digital image signal. The ISP outputs the digital image signal to a DSP for processing. The DSP converts the digital image signal into image signals in standard RGB, YUV, or other formats. In some embodiments, electronic device 700 may include one or N cameras 793, where N is a positive integer greater than 1.
[0244] The external memory interface 720 can be used to connect an external memory card, such as a Secure Digital (SD) card, to expand the storage capacity of the electronic device 700. The external memory card communicates with the processor 710 through the external memory interface 720 to perform data storage functions. For example, it can save captured images, videos, and other files to the external memory card.
[0245] Internal memory 721 can be used to store computer executable program code, including instructions. Processor 710 executes various functional applications and data processing of electronic device 700 by running the instructions stored in internal memory 721. Internal memory 721 may include a program storage area and a data storage area. The program storage area may store the operating system, at least one application program required for a function (such as audio capture, image capture, etc.). The data storage area may store data created during the use of electronic device 700 (such as audio data, image data, etc.). Furthermore, internal memory 721 may include high-speed random access memory and may also include non-volatile memory, such as at least one disk storage device, flash memory device, Universal Flash Storage (UFS), etc.
[0246] Please see Figure 13 This illustrates a functional block diagram of an electronic device 800 according to an embodiment of this application. Figure 13 As shown, the electronic device 800 includes: one or more processors 810 ( Figure 13 Only one processor is shown in the diagram) and a memory 820, which is coupled to one or more processors 810. The memory 820 is used to store computer program code 830, which includes computer instructions. One or more processors 810 call the computer instructions to cause the electronic device 800 to perform the steps in any of the above methods.
[0247] Those skilled in the art will understand that Figure 13 This is merely an example of electronic device 800 and does not constitute a limitation on electronic device 800. In practice, electronic device 800 may include more or fewer components than shown, or combine certain components, or different components. For example, it may also include input / output devices, network access devices, etc. Electronic device 800 may also be the same device as electronic device 700 described in the above embodiments.
[0248] The processor 810 can be a Central Processing Unit (CPU), other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor.
[0249] In some embodiments, memory 820 may be an internal storage unit of electronic device 800, such as a hard disk or memory of electronic device 800. In other embodiments, memory 820 may be an external storage device of electronic device 800, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on electronic device 800. Optionally, memory 820 may include both internal and external storage units of electronic device 800. Memory 820 is used to store operating system, application programs, bootloaders, data, and other programs, such as program code of computer programs. Memory 820 may also be used to temporarily store data that has been output or will be output.
[0250] It should be noted that the information interaction and execution process between the above-mentioned devices / units are based on the same concept as the method embodiments of this application. For details on their specific functions and technical effects, please refer to the method embodiments section, and they will not be repeated here.
[0251] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the above device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0252] This application also provides a chip system applied to a foldable electronic device. The chip system includes one or more processors, which are used to invoke computer instructions to cause the foldable electronic device to implement the steps in any of the above methods.
[0253] In some implementations, the chip system also includes a memory connected to one or more processors via circuitry or wiring.
[0254] In some implementations, the chip system also includes a communication interface.
[0255] This application also provides a computer-readable storage medium including instructions that, when executed on a foldable electronic device, cause the foldable electronic device to perform the methods described in the above-described method embodiments.
[0256] This application also provides a computer program product that, when run on a foldable electronic device, causes the foldable electronic device to perform the aforementioned related steps to implement the methods described in the various method embodiments above.
[0257] If the integrated units described above are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. A computer-readable medium can include at least: any entity or device capable of carrying computer program code to a photographic device / foldable electronic device, a recording medium, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks. In some jurisdictions, according to legislation and patent practice, computer-readable media cannot be electrical carrier signals or telecommunication signals.
[0258] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0259] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0260] In the embodiments provided in this application, it should be understood that the disclosed apparatus / device and method can be implemented in other ways. For example, the apparatus / device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0261] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0262] It should be understood that, when used in this application specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or a collection thereof.
[0263] It should also be understood that the term “and / or” as used in this application specification and the appended claims means any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.
[0264] Furthermore, in the description of this application and the appended claims, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0265] References to "one embodiment" or "some embodiments" in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.
[0266] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A super-resolution method for human body images, characterized in that, The method includes: Obtain the initial human body image; The initial human image is subjected to a first super-resolution process using a super-resolution model to obtain a first super-resolution human image; wherein the super-resolution model includes a face super-resolution diffusion model and a human body super-resolution diffusion model, and the first super-resolution process includes: Based on the initial human body image, obtain the initial facial feature map and the initial human body feature map; The initial face feature map is input into the face super-resolution diffusion model to obtain the first super-resolution face image; The initial human feature map is input into the human super-resolution diffusion model to obtain an initial super-resolution human image, wherein the resolution of the first super-resolution face image is higher than the resolution of the initial super-resolution human image. The first super-resolution face image and the initial super-resolution human body image are fused together to obtain the first super-resolution human body image; The first super-resolution human image obtained after the first super-resolution processing is used as the initial human image in the second super-resolution processing. The process of the first super-resolution processing is repeated until the Tth super-resolution processing is completed, and the first super-resolution human image of the Tth super-resolution processing is obtained.
2. The method according to claim 1, characterized in that, The step of obtaining the initial facial feature map and the initial human feature map based on the initial human image includes: The initial human image is compressed to obtain a compressed human image; The initial facial feature map and the initial human feature map are obtained from the compressed human image.
3. The method according to claim 2, characterized in that, The step of obtaining the initial facial feature map and the initial human feature map based on the compressed human image includes: The compressed human image is used as the initial human feature map; Based on the face bounding box of the initial human body feature map, the initial human body feature map is cropped to obtain the initial face feature map.
4. The method according to claim 3, characterized in that, Before cropping the initial human feature map based on the face bounding box of the initial human feature map to obtain the initial face feature map, the method further includes: Face detection is performed on the initial human image to obtain the coordinates of the first face bounding box; Based on the coordinates of the first face bounding box and the mapping relationship between the initial human body image and the initial human body feature map, the coordinates of the second face bounding box of the initial human body feature map are determined; The face frame of the initial human feature map is determined based on the coordinates of the second face frame.
5. The method according to claim 4, characterized in that, The process of fusing the first super-resolution face image and the initial super-resolution human body image to obtain the first super-resolution human body image includes: Select the second super-resolution face image from the initial super-resolution human image that corresponds to the coordinates of the second face bounding box; In the initial super-resolution human body image, the second super-resolution face image is replaced with the first super-resolution face image to obtain the first super-resolution human body image.
6. The method according to claim 5, characterized in that, The first super-resolution face image includes a first super-resolution head image and a first super-resolution non-head image, and the initial super-resolution human body image includes an initial super-resolution head image and an initial super-resolution non-head image; The step of replacing the second super-resolution face image with the first super-resolution face image in the initial super-resolution human image to obtain the first super-resolution human image includes: In the initial super-resolution human body image, the initial super-resolution human head image is replaced with the first super-resolution human head image to obtain the first super-resolution human body image.
7. The method according to any one of claims 2 to 6, characterized in that, The method further includes: The first super-resolution human image of the Tth super-resolution process is decoded to obtain the target super-resolution human image, the resolution of which is higher than that of the initial human image.
8. The method according to any one of claims 1 to 6, characterized in that, The initial face feature map includes multiple sub-face feature maps. Before inputting the initial face feature map into the face super-resolution diffusion model to obtain the first super-resolution face image, the method further includes: The initial face feature map is divided into multiple sub-face feature maps by sliding window segmentation. The initial face feature map is input to the face super-resolution diffusion model to obtain the first super-resolution face image, including: Multiple sub-face feature maps are input one by one into the face super-resolution diffusion model, so that the face super-resolution diffusion model outputs a first super-resolution face image based on the multiple sub-face feature maps.
9. The method according to any one of claims 1 to 6, characterized in that, The initial human feature map includes multiple sub-human feature maps. Before inputting the initial human feature map into the human super-resolution diffusion model to obtain the initial super-resolution human image, the method further includes: The initial human feature map is divided into multiple sub-human feature maps by sliding window segmentation. The process of inputting the initial human feature map into the human super-resolution diffusion model to obtain the initial super-resolution human image includes: Multiple sub-human feature maps are input one by one into the human super-resolution diffusion model, so that the human super-resolution diffusion model outputs an initial super-resolution human image based on the multiple sub-human feature maps.
10. The method according to any one of claims 1 to 6, characterized in that, Before performing the first super-resolution processing on the initial human body image using a super-resolution model to obtain the first super-resolution human body image, the method further includes: A super-resolution diffusion model is constructed based on the diffusion model and the control network model; The super-resolution diffusion model is trained based on the first historical face feature map and the second historical face feature map to obtain the face super-resolution diffusion model. The first historical face feature map and the second historical face feature map contain the same face features, and the resolution of the first historical face feature map is higher than that of the second historical face feature map. The super-resolution diffusion model is obtained by training the super-resolution diffusion model based on the first historical human body feature map and the second historical human body feature map. The first historical human body feature map and the second historical human body feature map contain the same human body features, and the resolution of the first historical human body feature map is higher than that of the second historical human body feature map.
11. The method according to claim 10, characterized in that, The process of training the super-resolution diffusion model based on the first historical face feature map and the second historical face feature map to obtain the face super-resolution diffusion model includes: The first historical face feature map is input into the diffusion model in the super-resolution diffusion model, so that the diffusion model outputs the first historical face feature to the control network model based on the first historical face feature map; The second historical face feature map is input into the control network model in the super-resolution diffusion model, so that the control network model is trained based on the first historical face feature map and the second historical face feature map to obtain the face super-resolution diffusion model.
12. The method according to claim 10, characterized in that, The process of training the super-resolution diffusion model based on the first historical human feature map and the second historical human feature map to obtain the human super-resolution diffusion model includes: The first historical human feature map is input into the diffusion model in the super-resolution diffusion model, so that the diffusion model outputs the first historical human feature to the control network model based on the first historical human feature map; The second historical human feature map is input into the control network model in the super-resolution diffusion model, so that the control network model is trained based on the first historical human feature map and the second historical human feature map to obtain the human super-resolution diffusion model.
13. The method according to any one of claims 1 to 6 or 11 to 12, characterized in that, The acquisition of the initial human image includes: Obtain the initial scene image containing the initial human body image; Human body detection is performed on the initial scene image to obtain human body bounding boxes; The initial scene image is cropped based on the human body bounding box to obtain the initial human body image.
14. The method according to claim 13, characterized in that, The method further includes: The first super-resolution human image of the Tth super-resolution process is segmented to obtain a human mask image; The human body mask image is then pasted back onto the position corresponding to the initial human body image in the initial scene image to obtain the target scene image.
15. The method according to claim 14, characterized in that, The method further includes: The boundary of the human body mask image in the target scene image is subjected to erosion processing.
16. The method according to claim 14 or 15, characterized in that, The method further includes: Gaussian blurring is applied to the boundary of the human body mask image in the target scene image.
17. A human body image super-resolution device, characterized in that, The device includes: The initial image acquisition module is used to acquire the initial human body image; The super-resolution processing module is used to perform a first super-resolution processing on the initial human image using a super-resolution model to obtain a first super-resolution human image; wherein, the super-resolution model includes a face super-resolution diffusion model and a human body super-resolution diffusion model, and the first super-resolution processing includes: Based on the initial human body image, obtain the initial facial feature map and the initial human body feature map; The initial face feature map is input into the face super-resolution diffusion model to obtain the first super-resolution face image; The initial human feature map is input into the human super-resolution diffusion model to obtain an initial super-resolution human image, wherein the resolution of the first super-resolution face image is higher than the resolution of the initial super-resolution human image. The first super-resolution face image and the initial super-resolution human body image are fused together to obtain the first super-resolution human body image; The repeat execution module is used to take the first super-resolution human image obtained after the first super-resolution processing as the initial human image in the second super-resolution processing, and repeat the process of the first super-resolution processing until the Tth super-resolution processing is completed, so as to obtain the first super-resolution human image of the Tth super-resolution processing.
18. An electronic device, characterized in that, The electronic device includes: one or more processors, and a memory; The memory is coupled to one or more processors, the memory being used to store computer program code, the computer program code including computer instructions, and the one or more processors calling the computer instructions to cause the electronic device to perform the method as described in any one of claims 1 to 16.
19. A computer-readable storage medium, characterized in that, The computer-readable storage medium includes instructions that, when executed on an electronic device, cause the electronic device to perform the method as described in any one of claims 1 to 16.