Image quality enhancement method and apparatus

By using layout analysis and regional adaptive image quality enhancement processing, the problem of low image quality in BYOM video conferencing has been solved, improving the user experience, especially the image quality of the main screen area, without affecting the quality of non-main screen areas.

WO2026123666A1PCT designated stage Publication Date: 2026-06-18HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2025-07-08
Publication Date
2026-06-18

Smart Images

  • Figure CN2025107527_18062026_PF_FP_ABST
    Figure CN2025107527_18062026_PF_FP_ABST
Patent Text Reader

Abstract

An image quality enhancement method and apparatus. The method comprises: acquiring an image to be processed; performing layout analysis on said image to obtain a layout analysis result, wherein the layout analysis result indicates a main picture area and a non-main picture area of said image, the main picture area is an image area where a main content picture is located, and the non-main picture area is an image area where a content picture unrelated to the main content picture is located, and the main picture area comprises a plurality of sub-image areas; and using different image quality enhancement strategies to process sub-image areas having different image quality among the plurality of sub-image areas.
Need to check novelty before this filing date? Find Prior Art

Description

An image quality enhancement method and apparatus

[0001] This application claims priority to Chinese Patent Application No. 202411815502.X, filed on December 09, 2024, entitled “A Method and Apparatus for Enhancing Image Quality”, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of computer technology, and in particular to an image quality enhancement method and apparatus. Background Technology

[0003] BYOM (Bring Your Own Meeting) conferencing is a core function of video conferencing. It is a wireless conferencing solution based on wireless LAN technology, utilizing mobile devices to connect to video conferencing terminals for efficient and convenient communication. In BYOM conferencing, mobile devices can use the camera, speakers, microphone, monitor, and other hardware of the video conferencing terminal to project their screen content onto the video conferencing terminal.

[0004] Existing BYOM video conferencing solutions suffer from low image quality displayed on the conferencing video terminals, which negatively impacts the user's viewing experience. Summary of the Invention

[0005] The embodiments of this application provide an image quality enhancement method and apparatus that achieves regional adaptive improvement of the displayed image quality, thereby enhancing the user's viewing experience.

[0006] In a first aspect, this application provides an image quality enhancement method, which includes acquiring an image to be processed; performing layout analysis on the image to be processed to obtain layout analysis results, wherein the layout analysis results indicate the main image area and non-main image areas of the image to be processed, wherein the main image area is the image area where the main content screen is located, and the non-main image areas are the image areas where content screens unrelated to the main content screen are located, wherein the main image area includes multiple sub-image areas; and employing different image quality enhancement processing strategies to process sub-image areas of different image quality within the multiple sub-image areas.

[0007] This application uses layout analysis to identify the main image area and non-main image areas in the image to be processed, where the content that users are mainly interested in is displayed. Generally speaking, the non-main image areas are smaller and have higher image quality, so they are not enhanced in terms of image quality to avoid distortion caused by image quality enhancement. In addition, different image quality enhancement processing strategies are adopted according to the image quality of different areas in the main image area to achieve regional adaptive image quality enhancement, ensuring the image quality enhancement effect of the image to be processed and improving the user viewing experience.

[0008] In one possible implementation, the image to be processed is the first video frame image in the conference video stream; the multiple sub-image regions within the main screen image region are the image regions where the video conference views of multiple participants are located. In other words, the image quality enhancement method provided in this application can be applied to enhance the image quality of the video conference display. The main screen image region is the image region where the video stream content of multiple participants is displayed, which can be referred to as the main screen of the conference. Non-main screen image regions include the windows of the conference software, UI controls (such as various pop-ups, virtual buttons, etc.), the desktop, taskbar, and other non-main screen elements. Non-main screen image regions typically have sufficiently high resolution and do not require image quality enhancement, thus avoiding distortion problems caused by enhancing non-main screen elements.

[0009] In another possible implementation, the layout analysis results also indicate the view layout of the main screen image area. The view layout indicates the positional distribution of the video conferencing views of multiple participants in the main screen image area and the area size of each video conferencing view in the multiple participants' video conferencing views. A specific implementation of using different image quality enhancement strategies for sub-image areas of different image quality in multiple sub-image areas is as follows: based on the image quality and area size of each video conferencing view, several sub-image areas to be enhanced are determined from the multiple sub-image areas; no image quality enhancement processing is performed on the non-enhanced image quality sub-image areas in the multiple sub-image areas; different degrees of image quality enhancement processing are performed on the sub-image areas to be enhanced with different image quality in the several sub-image areas to be enhanced.

[0010] Typically, video conferencing screens have multiple view modes (i.e. multiple view layouts), and adaptive image quality enhancement processing is performed for different view modes.

[0011] For example, video conferencing view modes include full-screen meeting view, immersive meeting view, grid view, gallery view, and speaker view. For full-screen and immersive meeting views, which occupy a larger image area, adaptive image enhancement is applied. For grid views with more than four grids, where each grid represents one participant's video conference view, each grid is smaller and remains relatively clear even with lower image quality; therefore, no image enhancement is applied to any of the grids in this case. For grid views with four or fewer grids, adaptive image enhancement is applied to all grids. For gallery and speaker views, adaptive image enhancement is applied only to the view with the largest area.

[0012] In another possible implementation, based on the image quality and area size of each video conferencing view, a specific implementation of determining several sub-image regions to be enhanced from multiple sub-image regions is as follows: the video conferencing view of multiple participants whose image quality is lower than a first threshold and whose area size is greater than a second threshold is determined as the target video conferencing view; the image region where the target video conferencing view is located is determined as the sub-image region to be enhanced.

[0013] By analyzing the size of each participant's video conference view (i.e., the area of ​​the image region occupied by each participant's video conference view) and the image quality of each participant's video conference view, the target video conference view that needs image quality enhancement is accurately identified. The image region where the identified target video conference view is located is marked as the region to be enhanced, so that subsequent steps can perform adaptive image quality enhancement processing on the region to be enhanced.

[0014] In another possible implementation, a specific method for performing different levels of image quality enhancement processing on the sub-image regions to be enhanced with different image qualities in several sub-image regions to be enhanced is as follows: based on the image quality of each sub-image region to be enhanced, a target image quality enhancement model is selected for each sub-image region to be enhanced from a set of candidate image quality enhancement models. The set of candidate image quality enhancement models includes multiple image quality enhancement models, and different image quality enhancement models among the multiple image quality enhancement models are used to perform different levels of image quality enhancement processing on images with different image qualities; the target image quality enhancement model corresponding to each sub-image region to be enhanced is called to perform image quality enhancement processing on each sub-image region to be enhanced.

[0015] By selecting the corresponding image enhancement model for each sub-image region to be enhanced based on its image quality, image enhancement processing is performed on each sub-image region. This achieves regional adaptive image enhancement processing of the main image region, rather than enhancing the entire main image region as a whole, thus achieving more detailed image enhancement processing and ensuring the overall effect of image enhancement.

[0016] In another possible implementation, a specific method for calling the target image enhancement model corresponding to each sub-image region to be enhanced to perform image enhancement processing on each sub-image region is as follows: A masking operation is performed on the image regions outside each sub-image region to be enhanced in the first video frame image to obtain the masked first video frame image; the masked first video frame image is used as the input to the target image enhancement model corresponding to each sub-image region to be enhanced, and the output is the first video frame image after image enhancement processing for each sub-image region to be enhanced. That is, when enhancing the image quality of each sub-image region to be enhanced, the image regions outside each sub-image region to be enhanced are masked, thereby adjusting the enhancement coefficients of different image regions. The enhancement coefficient of the masked image regions is set to 0, and the enhancement coefficient of the unmasked image regions is set to 1, thus avoiding the influence of the image regions outside each sub-image region to be enhanced on the image enhancement processing.

[0017] In another possible implementation, calling the target image enhancement model corresponding to each sub-image region to be enhanced to perform image enhancement processing on each sub-image region to be enhanced further includes: determining the image quality score comparison of each image block in the first video frame image after masking before and after image enhancement processing; adjusting the mask image region of the first video frame image after masking based on the image quality score comparison to obtain the first video frame image after masking adjustment; and calling the target image enhancement model to perform image enhancement processing on the first video frame image after masking adjustment.

[0018] Based on the feedback after image quality enhancement, this scheme adaptively enhances masked areas in the image. For example, positive feedback areas are further enhanced, while negative feedback areas are gradually reduced in enhancement coefficient (i.e., negative feedback areas are masked).

[0019] In another possible implementation, before applying different image enhancement strategies to sub-image regions of varying quality within the multiple sub-image regions, the process further includes: performing image quality evaluation on the multiple sub-image regions within the main image region to obtain an image quality score for each sub-image region. This image quality score is used to assess the overall image quality of each sub-image region. Thus, obtaining the image quality of each sub-image region through image quality evaluation facilitates subsequent adaptive image enhancement based on image quality.

[0020] Optionally, the image quality score for each sub-image region includes the resolution score for each sub-image region.

[0021] In another possible implementation, calling the target image enhancement model to perform image enhancement processing on the image region to be enhanced also includes: determining the VMAF index of each image block in the first video frame image after masking, before and after image enhancement processing, the index indicating the degree of degradation of each image block before and after image enhancement processing; adjusting the mask image region of the first video frame image after masking based on the VMAF index to obtain the first video frame image after masking adjustment; and calling the target image enhancement model to perform image enhancement processing on the first video frame image after masking adjustment.

[0022] In another possible implementation, before performing layout analysis on the image to be processed and obtaining the layout analysis results, the following steps are also included: performing scene recognition on the first video frame image; and determining that the scene in which the first video frame image is located is a mainstream scene.

[0023] In another possible implementation, the image quality enhancement method provided in this application further includes: if the scene in which the first video frame image is located is a secondary stream scene, then no image quality enhancement processing is performed on the first video frame image.

[0024] Generally speaking, the resolution of video streams in mainstream scenarios is usually 360P-1080P, while the resolution of video streams in auxiliary scenarios is usually above 1080P. Auxiliary streams have less quality loss during encoding, decoding, compression and transmission. Therefore, this application does not perform image quality enhancement processing on auxiliary streams after identifying them, but only performs image quality enhancement processing on video frames in mainstream scenarios to ensure the effectiveness of image quality enhancement processing.

[0025] In another possible implementation, scene recognition of the first video frame image can be specifically achieved by: detecting the number of video frames in the first video frame image whose differences are less than a third threshold, where K is a positive integer greater than 1; and determining the scene in which the first video frame image is located based on the number of video frames whose differences are less than the third threshold. The mainstream scene and the secondary scene are effectively identified using the frame difference method.

[0026] In another possible implementation, a specific way to determine the scene of the first video frame image based on the number of video frames with differences less than the third threshold is as follows: when the number of video frames with differences less than the third threshold is less than the fourth threshold, the scene of the first video frame image is determined to be the mainstream scene; when the number of video frames with differences less than the third threshold is greater than or equal to the fourth threshold, the scene of the first video frame image is determined to be the auxiliary stream scene.

[0027] In another possible implementation, a specific approach to perform layout analysis on the image to be processed and obtain the layout analysis result is as follows: perform target detection on the first video frame image to obtain the target detection result; and determine the layout analysis result based on the target detection result.

[0028] In another possible implementation, the target detection result includes multiple detection box information, which includes image regions selected by a first category of detection boxes and image regions selected by a second category of detection boxes. The image regions selected by the first category of detection boxes are the image regions in the first video frame image where the video conferencing view is located, and the image regions selected by the second category of detection boxes are the image regions in the first video frame image where the non-video conferencing view is located. Based on the target detection result, a specific implementation for determining the layout analysis result is as follows: when the number of first category detection boxes is equal to 1, and the proportion of the image region selected by the first category of detection boxes to the first video frame image is equal to 1, then the layout analysis result is determined. The view layout of the first video frame image is either a full-screen conference view or an immersive conference view. When the number of detection boxes in the first category is greater than 1, the proportion of the image area selected by the smallest detection box in the first category to the first video frame image is greater than or equal to the fifth threshold, and the proportion of the image area selected by the largest detection box in the first category to the first video frame image is less than the sixth threshold, then the view layout of the first video frame image is determined to be a gallery view. When the number of detection boxes in the first category is greater than 1, the proportion of the image area selected by the largest detection box in the first category to the first video frame image is greater than the sixth threshold, then the view layout of the first video frame image is determined to be a speaker view.

[0029] In another possible implementation, the first video frame image is obtained by capturing the screen image of a user device, which is the device used by the participant and is connected to the video conferencing terminal.

[0030] In another possible implementation, the image to be processed is the second video frame image in the projected video stream data or the sent video stream data. That is, the image quality enhancement method provided in this application embodiment can also be applied to adaptive image quality enhancement processing in ordinary projected and sent video stream scenarios to improve the user viewing experience.

[0031] Secondly, this application also provides an image quality enhancement device, which includes an acquisition module, an analysis module, and an image quality enhancement module. The acquisition module is used to acquire an image to be processed; the analysis module is used to perform layout analysis on the image to be processed to obtain layout analysis results, which indicate the main image area and non-main image areas of the image to be processed. The main image area is the image area where the main content image is located, and the non-main image areas are the image areas where content images unrelated to the main content image are located. The main image area includes multiple sub-image areas; the image quality enhancement module is used to process sub-image areas with different image quality in the multiple sub-image areas using different image quality enhancement processing strategies.

[0032] In one possible implementation, the image to be processed is the first video frame image in the conference video stream; the multiple sub-image regions in the main screen image region are the image regions where the video conference views of multiple participants are located.

[0033] In another possible implementation, the layout analysis results also indicate the view layout of the main screen image area. The view layout indicates the positional distribution of the video conferencing views of multiple participants in the main screen image area and the area size of each video conferencing view among the multiple participants. The image quality enhancement module is specifically used to: determine several sub-image areas to be enhanced from multiple sub-image areas based on the image quality and area size of each video conferencing view; not perform image quality enhancement processing on the non-enhanced image quality sub-image areas in the multiple sub-image areas; and perform different degrees of image quality enhancement processing on the sub-image areas to be enhanced with different image qualities in the several sub-image areas to be enhanced.

[0034] In another possible implementation, based on the image quality and area size of each video conferencing view, a specific implementation of determining several sub-image regions to be enhanced from multiple sub-image regions is as follows: the video conferencing view of multiple participants whose image quality is lower than a first threshold and whose area size is greater than a second threshold is determined as the target video conferencing view; the image region where the target video conferencing view is located is determined as the sub-image region to be enhanced.

[0035] In another possible implementation, a specific method for performing different levels of image quality enhancement processing on the sub-image regions to be enhanced with different image qualities in several sub-image regions to be enhanced is as follows: based on the image quality of each sub-image region to be enhanced, a target image quality enhancement model is selected for each sub-image region to be enhanced from a set of candidate image quality enhancement models. The set of candidate image quality enhancement models includes multiple image quality enhancement models, and different image quality enhancement models among the multiple image quality enhancement models are used to perform different levels of image quality enhancement processing on images with different image qualities; the target image quality enhancement model corresponding to each sub-image region to be enhanced is called to perform image quality enhancement processing on each sub-image region to be enhanced.

[0036] In another possible implementation, a specific method for calling the target image enhancement model corresponding to each sub-image region to be enhanced to perform image enhancement processing on each sub-image region is as follows: A masking operation is performed on the image regions outside each sub-image region to be enhanced in the first video frame image to obtain the masked first video frame image; the masked first video frame image is used as the input to the target image enhancement model corresponding to each sub-image region to be enhanced, and the output is the first video frame image after image enhancement processing for each sub-image region to be enhanced. That is, when enhancing the image quality of each sub-image region to be enhanced, the image regions outside each sub-image region to be enhanced are masked, thereby adjusting the enhancement coefficients of different image regions. The enhancement coefficient of the masked image regions is set to 0, and the enhancement coefficient of the unmasked image regions is set to 1, thus avoiding the influence of the image regions outside each sub-image region to be enhanced on the image enhancement processing.

[0037] In another possible implementation, calling the target image enhancement model corresponding to each sub-image region to be enhanced to perform image enhancement processing on each sub-image region to be enhanced further includes: determining the image quality score comparison of each image block in the first video frame image after masking before and after image enhancement processing; adjusting the mask image region of the first video frame image after masking based on the image quality score comparison to obtain the first video frame image after masking adjustment; and calling the target image enhancement model to perform image enhancement processing on the first video frame image after masking adjustment.

[0038] In another possible implementation, before applying different image enhancement strategies to sub-image regions of varying quality within the multiple sub-image regions, the process further includes: performing image quality evaluation on the multiple sub-image regions within the main image region to obtain an image quality score for each sub-image region. This image quality score is used to assess the overall image quality of each sub-image region. Thus, obtaining the image quality of each sub-image region through image quality evaluation facilitates subsequent adaptive image enhancement based on image quality.

[0039] Optionally, the image quality score for each sub-image region includes the resolution score for each sub-image region.

[0040] In another possible implementation, calling the target image enhancement model to perform image enhancement processing on the image region to be enhanced also includes: determining the VMAF index of each image block in the first video frame image after masking, before and after image enhancement processing, the index indicating the degree of degradation of each image block before and after image enhancement processing; adjusting the mask image region of the first video frame image after masking based on the VMAF index to obtain the first video frame image after masking adjustment; and calling the target image enhancement model to perform image enhancement processing on the first video frame image after masking adjustment.

[0041] In another possible implementation, the image quality enhancement processing apparatus provided in this application further includes a scene recognition module, which is used to perform scene recognition on the first video frame image and determine that the scene in which the first video frame image is located is a mainstream scene.

[0042] In another possible implementation, the scene recognition module is also used to: determine that the scene in which the first video frame image is located is a secondary stream scene, and then not perform image quality enhancement processing on the first video frame image.

[0043] In another possible implementation, the scene recognition module is specifically used to: detect the number of video frames in the first video frame image whose differences are less than a third threshold, where K is a positive integer greater than 1; and determine the scene in which the first video frame image is located based on the number of video frames whose differences are less than the third threshold. The mainstream scene and the secondary stream scene are effectively identified through the frame difference method.

[0044] In another possible implementation, a specific way to determine the scene of the first video frame image based on the number of video frames with differences less than the third threshold is as follows: when the number of video frames with differences less than the third threshold is less than the fourth threshold, the scene of the first video frame image is determined to be the mainstream scene; when the number of video frames with differences less than the third threshold is greater than or equal to the fourth threshold, the scene of the first video frame image is determined to be the auxiliary stream scene.

[0045] In another possible implementation, the analysis module is specifically used to: perform target detection on the first video frame image to obtain the target detection result; and determine the layout analysis result based on the target detection result.

[0046] In another possible implementation, the target detection result includes multiple detection box information, which includes image regions selected by a first category of detection boxes and image regions selected by a second category of detection boxes. The image regions selected by the first category of detection boxes are the image regions where the video conferencing view is located in the first video frame image, and the image regions selected by the second category of detection boxes are the image regions in the first video frame image where the non-video conferencing view is located. Based on the target detection result, a specific implementation for determining the layout analysis result is as follows: when the number of first category detection boxes is equal to 1, and the proportion of the image region selected by the first category of detection boxes to the first video frame image is equal to 1, then the view region of the first video frame image is determined to be... The image layout is either a full-screen meeting view or an immersive meeting view. If the number of detection boxes in the first category is greater than 1, and the image areas selected by each detection box in the first category are of the same size, then the view layout of the first video frame image is determined to be a grid view. If the number of detection boxes in the first category is greater than 1, and the proportion of the image area selected by the smallest detection box in the first category to the first video frame image is less than the fifth threshold, then the view layout of the first video frame image is determined to be a gallery view. If the number of detection boxes in the first category is greater than 1, and the proportion of the image area selected by the largest detection box in the first category to the first video frame image is greater than the sixth threshold, then the view layout of the first video frame image is determined to be a speaker view.

[0047] In another possible implementation, the first video frame image is obtained by capturing the screen image of a user device, which is the device used by the participant and is connected to the video conferencing terminal.

[0048] In another possible implementation, the image to be processed is the second video frame image in the projected video stream data or the video stream data sent for display.

[0049] Thirdly, embodiments of this application provide a computing device, including a memory and a processor, wherein the memory stores instructions that, when executed by the processor, cause the method described in the first aspect or any possible implementation of the first aspect to be implemented.

[0050] Fourthly, embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, causes the method described in the first aspect or any possible implementation thereof to be implemented.

[0051] Fifthly, embodiments of this application also provide a computer program or computer program product, which includes instructions that, when executed, cause a computer to perform the method described in the first aspect or any possible implementation thereof.

[0052] In a sixth aspect, embodiments of this application also provide a chip including at least one processor and a communication interface, the processor being configured to perform the method described in the first aspect or any possible implementation thereof.

[0053] It is understandable that the beneficial effects of the second to sixth aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here for the sake of brevity.

[0054] It is understandable that the beneficial effects of the second to sixth aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here for the sake of brevity. Attached Figure Description

[0055] Figure 1 shows a schematic diagram of the implementation process of a BYOM meeting;

[0056] Figure 2 shows a system architecture diagram of a video conferencing system to which the image quality enhancement method provided in the embodiments of this application can be applied;

[0057] Figure 3 is a schematic diagram of the implementation process of an image quality enhancement method provided in this application;

[0058] Figure 4 shows a schematic diagram of target detection results for an image captured from the screen;

[0059] Figure 5 is a schematic diagram of a gallery / grid view layout;

[0060] Figure 6 is a schematic diagram of a speaker view layout.

[0061] Figure 7 is a schematic diagram of a view layout in a full-screen view mode;

[0062] Figure 8 is a schematic diagram of a view layout in an immersive view mode;

[0063] Figures 9 to 13 illustrate the super-resolution processing strategies for different view modes derived from the layout analysis.

[0064] Figure 14 illustrates a flowchart of a specific implementation of an image enhancement step;

[0065] Figure 15 shows a schematic diagram of a conference video frame image;

[0066] Figure 16 illustrates the implementation flow of the image quality enhancement steps based on enhanced feedback provided in the embodiments of this application;

[0067] Figure 17 shows a schematic diagram illustrating the specific implementation of the image enhancement steps based on enhanced feedback;

[0068] Figures 18 and 19 show video frame images of the mainstream scene and the auxiliary stream scene, respectively.

[0069] Figure 20 is a schematic diagram of the implementation process of another image quality enhancement method provided in this application;

[0070] Figure 21 is a flowchart illustrating the image quality enhancement method provided in an embodiment of this application;

[0071] Figure 22 shows a schematic diagram of the implementation process of the image quality enhancement method provided in this application when applied to a normal screen projection scenario;

[0072] Figure 23 is a schematic diagram of an image quality enhancement device provided in an embodiment of this application;

[0073] Figure 24 is a schematic diagram of the structure of the computing device provided in the embodiment of this application. Detailed Implementation

[0074] The term "and / or" used in this article describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent three cases: A alone, A and B simultaneously, and B alone. The symbol " / " in this article indicates that the related objects are in an "or" relationship; for example, A / B means A or B.

[0075] The terms "first" and "second," etc., used in the specification and claims herein are used to distinguish different objects, not to describe a specific order of objects. It should be understood that such terms are interchangeable where appropriate; this is merely a way of distinguishing objects with the same properties in the description of embodiments of this application. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion, so that a process, method, system, product, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such processes, methods, systems, products, or apparatus.

[0076] In the embodiments of this application, the terms "exemplary" or "for example" are used to indicate that something is being described as an example, illustration, or description. Any embodiment or design that is described as "exemplary" or "for example" in the embodiments of this application should not be construed as being better or more advantageous than other embodiments or design. Specifically, the use of the terms "exemplary" or "for example" is intended to present the relevant concepts in a specific manner. In the description of the embodiments of this application, unless otherwise stated, "multiple" means two or more, for example, multiple processing units means two or more processing units, multiple elements means two or more elements, etc.

[0077] The image quality enhancement method provided in this application embodiment can be applied to various application scenarios that require image quality enhancement, such as ordinary screen projection scenarios, display transmission scenarios, video conferencing scenarios (including BYOM scenarios and screen projection conferencing scenarios), etc., to achieve regional adaptive enhancement of image quality.

[0078] For example, in a typical screen mirroring scenario, content played on a small-screen device is mirrored to a large-screen device for viewing. For instance, content viewed on a computer is mirrored to a smart TV. Lower-resolution content (e.g., 1080P) has little impact on the viewing experience on a smaller computer screen, but becomes noticeably blurry on a larger smart TV, affecting the user's viewing experience. Furthermore, the image quality varies across different areas of the computer screen. For example, the taskbar and desktop background typically have higher resolution (determined by the computer's display settings, which are usually not low), while the content played by the player / webpage is limited by the source image quality and generally has lower resolution. In this case, the image quality enhancement method provided in this application is used. For the content received by the large-screen terminal device, the target enhancement area is determined through layout analysis. Based on the image quality score of the target area, adaptive image quality enhancement processing is applied to the mirrored content, improving its clarity and enhancing the user's viewing experience.

[0079] In display scenarios, when terminal devices (such as computers, mobile phones, or smart TVs) play video content, the viewing experience is poor due to the low quality of the video source (for example, various video websites do not offer memberships, and the provided sources are mostly low-resolution sources, such as 360P, 540P, 720P, etc.). In this case, the image quality enhancement method provided in the embodiments of this application is used to obtain the video stream data before display, and then perform regional adaptive image quality enhancement processing on the video stream data before display to improve the display effect.

[0080] In screen-sharing conferencing scenarios, devices with screen-sharing capabilities, such as mobile phones and computers, project the meeting screen onto a conferencing terminal with a large display screen for convenient viewing of the meeting content. Typically, the image quality of the screen projected onto the large screen of the video conferencing terminal varies in different areas. For example, areas that are not part of the main meeting content (such as UI control areas, including virtual buttons, icons, and the taskbar area) usually have higher image quality and do not require image enhancement processing. However, the image quality of the meeting video stream is usually lower due to various reasons (such as the low resolution of cloud conferencing streams, mostly at 360P and 720P), and the image quality of the video streams from different participants also varies. Therefore, the image quality enhancement method provided in this application is used to acquire the projected video stream from the mobile phone / computer and then perform adaptive image quality enhancement processing on the projected video stream in different regions to improve the display effect.

[0081] In a BYOM (Bring Your Own Memory) scenario, mobile devices such as smartphones and computers leverage the hardware capabilities of the video conferencing terminal through BYOM. For example, a mobile device uses conferencing software like Tencent Meeting, Lark, or Teams to access the video stream from the conference room's camera, encodes the video stream, and then projects it onto the video conferencing terminal. The terminal receives the projected stream from the mobile device and displays the video conference content on its large screen for easy viewing by participants. Generally, the image quality varies across different areas on the video conferencing terminal's large screen. For instance, areas outside the main conference feed typically have higher image quality and require no enhancement. However, the conference video stream itself often has lower quality due to various reasons (e.g., low resolution in cloud conferencing streams, mostly 360P and 720P). Furthermore, remote conference video streams typically have lower quality, while local conference video streams usually have higher quality. Therefore, the image enhancement method provided in this application is used to acquire the projected video stream from the mobile device and then perform adaptive image enhancement processing on the stream in different regions to improve the display effect.

[0082] The following section uses a video conferencing scenario as an example to detail the specific implementation of the image quality enhancement method and apparatus provided in this application.

[0083] Figure 1 illustrates the implementation process of a BYOM meeting. As shown in Figure 1, mobile devices capture video streams from the camera of the video conferencing terminal through conferencing software such as Tencent Meeting, Lark, and Teams. The video streams are then encoded, transmitted, decoded, and displayed on the large screen of the video conferencing terminal.

[0084] However, existing BYOM video conferencing solutions suffer from insufficient video conferencing image quality. For example, mobile devices typically support a resolution of up to 1080P, not 4K, resulting in low clarity of the remote video feed under BYOM projection. Furthermore, cloud conferencing offers different resolution configurations for different mobile devices and accounts, such as 306P, 540P, 720P, and 1080P, leading to varying degrees of image clarity among participants. Additionally, the process of mobile devices requesting video streams from the video conferencing terminal's camera and then transmitting them to the terminal's display involves some quality loss, further contributing to the insufficient image quality.

[0085] The relevant technical solutions mainly include terminal devices that enhance the quality of the projected image. For example, the receiving terminal device uses deep learning-based artificial intelligence (AI) processing technology to optimize the image quality of the content sent by the projecting device, typically focusing on video-related content. Taking super-resolution reconstruction technology for projected images as an example, the following methods are used to process the projected data: The projected video stream is received from the mobile device, and the video stream is captured from a screenshot of the mobile device's screen to obtain the image information of the video stream. The video frame information is statistically analyzed, and a super-resolution model is selected to perform super-resolution on the video frames. The super-resolution video frames are then sent for display, and the images are upsampled to 4K for rendering.

[0086] The relevant technical solutions have the following problems:

[0087] Most mainstream video conferencing terminals and mobile conferencing software currently support BYOM functionality. However, the bitrate resolution of most conferencing software is 360P or 720P, with a maximum of 1080P after activating a membership. In actual use scenarios, the bitrate resolution will fluctuate continuously with network or device performance. In scenarios where bitrate information cannot be obtained, related technologies cannot effectively enhance video frames of different resolutions and qualities without causing image sharpening distortion or other degradation issues.

[0088] Video conferencing software offers various screen layouts, such as gallery (grid) view, speaker view, and full-screen meeting view. Based on these different view modes, the image quality of different areas on the projected screen varies significantly. For example, the image quality of pop-ups, the desktop, and software UI controls will differ considerably from the content displayed in the cloud meeting. A single image enhancement method cannot adaptively handle all content.

[0089] Video conferencing has mainstream auxiliary stream scenarios. The mainstream resolution does not exceed 1080P, while the highest resolution of the auxiliary stream is the original resolution. The image quality of the auxiliary stream is significantly higher than that of the mainstream scenario. The auxiliary stream scenario needs to be identified to avoid degradation after over-resolution.

[0090] In view of this, embodiments of this application provide an image quality enhancement method and apparatus, which achieves regional adaptive enhancement of BYOM video conferencing video images through layout analysis and image quality evaluation, thereby improving the video quality of the meeting in different meeting software views and different meeting scenarios on the terminal side without affecting the image quality of other UI areas.

[0091] To facilitate understanding of the image quality enhancement method and apparatus provided in the embodiments of this application, some technical terms involved in the embodiments of this application will be briefly explained below.

[0092] Super-resolution: Super-resolution imaging is a technique to improve the resolution of a video. It refers to the recovery of a high-resolution image from a low-resolution image or image sequence for image quality enhancement. In the embodiments of this application, super-resolution is an implementation algorithm for image quality enhancement.

[0093] BYOM is a meeting room solution that allows participants to host and control meetings using personal devices such as smartphones or laptops. Compared to traditional meeting room-based systems, the core of BYOM is that participants can not only use their own devices but also access the hardware capabilities of the meeting room, such as video conferencing terminals, making the meeting process and structure more convenient.

[0094] 4K resolution: 4K resolution has more than 8 million pixels, which is 4 times the resolution of 1080P. The commonly used resolution format is 3840*2160, and the commonly used format in the film industry is 4096*2160.

[0095] 1080P resolution: 1080P resolution is two million pixels, and the commonly used resolution format is 1920*1080.

[0096] 720P resolution: 720P resolution is 900,000 pixels, and the commonly used resolution format is 1280*720.

[0097] 360P resolution: 360P resolution is 200,000 pixels, and the commonly used resolution format is 640*360.

[0098] (Wireless) screen mirroring: Displaying the screen of a mobile device A (such as a mobile phone, tablet, laptop, or computer) "in real time" onto the screen of another device B (tablet, laptop, computer, TV, all-in-one machine, or projector) using a certain technical method. The output content includes various media information and real-time operation screens. The specific connection method can be wired or wireless.

[0099] Image upsampling: Image magnification, restoring an image from low resolution to high resolution. For an image of size H*W, upsampling it by a factor of s yields a (sH)*(sW) resolution image.

[0100] Image downsampling: Image reduction. For an image of size H*W, downsampling by a factor of s will yield a resolution image of size (H / s)*(W / s).

[0101] Video conferencing terminal: refers to a display terminal that supports video conferencing software and BYOM function, and is also a receiving and projection terminal.

[0102] The image quality enhancement method and apparatus provided in this application are described in detail below with reference to the accompanying drawings and embodiments.

[0103] Figure 2 illustrates a system architecture diagram of a video conferencing system to which the image quality enhancement method provided in this application embodiment can be applied. As shown in Figure 2, the video conferencing system includes a video conferencing terminal and a mobile device. After the mobile device starts the conferencing software, it shares its screen, then encodes the video stream of the shared image and sends it to the video conferencing terminal for display. The video conferencing terminal includes hardware devices such as a large-screen display, camera, microphone, and speakers, as well as an audio module, a control module, and a video module. The video module includes a decoding module, a scene recognition module, a layout analysis module, an image quality enhancement module, and a display module. The mobile device includes a screen image acquisition module and an encoding module.

[0104] The mobile device can be another local device that supports running conferencing software and connects to the video conferencing terminal via wired (e.g., HDMI) or wireless (e.g., WiFi). The mobile device can utilize various hardware components of the video conferencing terminal, such as cameras, microphones, and displays, to project its screen onto the large screen of the video conferencing terminal, facilitating viewing for all participants. The mobile device can be any device that supports BYOM (Bring Your Own Memory) projection, such as a mobile phone, tablet, laptop, or personal computer (PC).

[0105] The screen image acquisition module is used to acquire screen images of mobile devices, for example, by calling the GPU interface.

[0106] The encoding module is used to encode the captured screen image. Optionally, the encoding format can be yuc420.

[0107] The transmission module is used to send the encoded bitstream to the video conferencing terminal. The transmission method can be wired transmission (such as HDMI, Ethernet cable) or wireless transmission (such as WIFI).

[0108] The decoding module is used to decode the received bitstream.

[0109] The scene recognition module is used to identify the scene of the decoded image, mainly to determine whether the bitstream scene of the BYOM screen sharing meeting is the mainstream scene or the auxiliary stream scene.

[0110] The layout analysis module is used to asynchronously analyze the layout of decoded images, analyze image content, and detect the main screen image area and other controls, pop-ups, and image areas that do not require image quality enhancement.

[0111] The image quality enhancement module is used to guide the image quality enhancement process based on the layout analysis results. It enhances the image areas with poor image quality to a greater extent, and continues to enhance the areas with high image quality slightly or not at all.

[0112] The display module is used to output and display the enhanced image.

[0113] It should be noted that the system architecture diagram or internal structure diagram of the video conferencing system shown in Figure 2 is only a schematic diagram for the convenience of understanding one possible implementation of this solution, and does not constitute a limitation of this application. For example, the video module may include more or fewer modules, such as the video module may not include the scene recognition module.

[0114] Figure 3 is a schematic diagram illustrating the implementation process of an image quality enhancement method provided in this application. As shown in Figure 3, the implementation of this method includes the following steps: screen image acquisition, encoding, transmission, decoding, layout analysis, regional image quality enhancement, and display.

[0115] In the screen image acquisition step, the screen image of the mobile device is acquired. The mobile device can be a PC, mobile phone, or other device that supports BYOM screen sharing. The acquired screenshot can include the meeting screen of the video conferencing software, UI controls, software pop-ups, desktop background, etc. The meeting screen of the video conferencing software can include the main screen (including the video stream captured by the camera of the local conferencing terminal and the video stream captured by the camera of the remote participating terminal, etc.) and the auxiliary screen (screen content shared by the local or remote conferencing terminals, such as text and images). In the embodiments of this application, the resolution of the acquired screen image can be 180P, 360P, 540P, 720P, 1080P, 4K, etc., and is not limited here.

[0116] After acquiring the screen image video stream, the screen image video stream is encoded for transmission.

[0117] The encoded stream is then transmitted to the video conferencing terminal. For example, in a local screen sharing scenario, the encoded stream is transmitted locally to the video conferencing terminal. Transmission can be wired (e.g., HDMI, Ethernet cable) or wireless (e.g., WiFi).

[0118] In the decoding step, the video conferencing terminal, as the receiving end, uses the decoding method corresponding to that of the encoding end to decode the received bitstream and obtain a YUV image.

[0119] In the layout analysis step, target detection is performed on the decoded YUV image, asynchronously detecting the conference feed on the captured screen. The conference feed refers to the video stream from cameras entering the meeting from remote or local devices. In the meeting software's view mode, these camera video streams are combined into different conference displays. The layout analysis identifies these conference feeds and other content such as pop-ups and plugins that obstruct the conference feed.

[0120] In the image quality enhancement step, the target image region (i.e. the image region to be enhanced) identified by the layout analysis is subjected to image quality enhancement processing (e.g., super-resolution enhancement processing can be performed). The image region to be enhanced represents the image region where the bitstream of the conference is located, including the video stream obtained by encoding, transmission and decoding by local or remote conferencing software. For non-target image regions, super-resolution enhancement is not performed to avoid distortion of the region content caused by super-resolution enhancement by the model or enhancement algorithm.

[0121] Finally, the decoded and enhanced image is displayed on the video conferencing terminal screen.

[0122] The core steps of the image quality enhancement method provided in this application are the layout analysis step and the image quality enhancement step. The specific implementation of these two steps will be described in detail below.

[0123] In one example, layout analysis can be achieved through object detection. For instance, object detection can be performed on the captured screen image (referred to as the conference video frame image in subsequent steps), and then layout analysis can be performed based on the object detection results to obtain the layout analysis results.

[0124] Figure 4 illustrates a schematic diagram of target detection results for images captured from the screen. As shown in Figure 4, the detection boxes include two categories. The first category of detection boxes selects the image area containing the video stream content of the participants (which can be referred to as the mainstream content) in the conference video frame image. The second category of detection boxes selects the image area containing the title bar, various UI controls, taskbar display area, software pop-ups, and other elements unrelated to the participants' video stream content. In other words, target detection can detect different categories of content displayed in the image. The first category of detection boxes can be used to select the main screen image area (also known as the mainstream conference content). The main screen image area mainly displays the participants' video stream content. The second category of detection boxes can be used to select non-main screen image areas, such as the image areas containing the conference software's title bar, various UI controls, taskbar display area, and software pop-ups.

[0125] Optionally, a trained object detection model can be invoked to perform object detection on the conference video frame images. For example, the conference video frame images can be input into a trained object detection model, which will perform inference calculations and output the object detection result image as shown in Figure 4.

[0126] After obtaining the target detection results, layout analysis can be performed based on these results. As shown in Figure 4, the main image area and non-main image areas can be directly determined from the target detection results. Specifically, the image area selected by the first type of detection box is the main image area, and the image area selected by the second type of detection box is the non-main image area. In other words, the layout analysis results will indicate the main image area and the non-main image area of ​​the conference video frame image.

[0127] Optionally, the layout analysis results also indicate the view layout of the video conference (i.e., the view layout of the main image area). Video conferencing software typically supports multiple video layout modes, such as gallery / grid view, speaker view, full-screen meeting view, and immersive view.

[0128] Figure 5 is a schematic diagram of a gallery / grid view layout. As shown in Figure 5, the video conferencing software displays an M*N grid of content, where the grids can be of equal or unequal size, and each grid displays the video stream content of one participant.

[0129] Figure 6 is a schematic diagram of a speaker view layout. As shown in Figure 6, the video conferencing software presents one main screen and N smaller screens arranged around the edge of the main screen. That is, the video stream content of one participant can be set as the main screen, and the video stream content of other participants can be arranged in the form of N smaller screens around the edge of the main screen. For example, the N smaller screens can be arranged on the right side of the main screen, or above the main screen, or in an L-shaped arrangement above and to the right of the main screen, or in an L-shaped arrangement below and to the right of the main screen, etc.

[0130] Figure 7 is a schematic diagram of a view layout in a full-screen view mode. As shown in Figure 7, the conferencing software displays a single video stream (i.e., the video stream content of a selected participant) in full screen, while all other video stream content is closed.

[0131] Figure 8 is a schematic diagram of a view layout in an immersive view mode. As shown in Figure 8, the conferencing software extracts and aggregates the video stream content from multiple participants into a single virtual meeting room for display.

[0132] Of course, FIGS. 5 to 8 only show the view layouts of several common video conference view modes, and do not constitute a limitation on the embodiments of the present application. The types of view modes supported by each video conference software may be different, and the specific number of view modes is determined according to the video modes supported by the actual video conference software.

[0133] According to the analysis of the view layout characteristics of each view mode, the view mode of the current conference video frame image can be determined based on the object detection result. For example, the view mode of the current conference video frame image is obtained by performing layout analysis on the image according to the size (i.e., the area of the image region enclosed by the detection box) and the number of detection boxes of different first categories.

[0134] The following introduces several implementation methods for obtaining the conference view mode through layout analysis based on the object detection result.

[0135] The first method is to determine the view mode according to the number of detection boxes of the first category, the total area of the current conference video frame image, and the ratio of the area of the image region enclosed by a single detection box of the first category to the area of the image enclosed by all detection boxes of the first category. Exemplarily, first, it is discriminated based on the total pixel area of the detection boxes of the first category. If the ratio Q of its total pixel area to the pixel area of the conference video frame image reaches a set threshold a (e.g., 9 / 16), it is considered that the conference software has enabled the conference screen. Then, the ratio P of the pixel area of a single detection box of the first category to the total area of the first category of boxes and the number A of detection boxes of the first category are calculated.

[0136] If the number A of detection boxes of the first category = 1 and Q = 1, it is determined as the full-screen conference view or the immersive conference view.

[0137] If the number 1 < A ≤ 4 of detection boxes of the first category and the ratio min(P) of the smallest detection box to the area of the image region enclosed by all detection boxes of the first category ≥ a set threshold b (e.g., 1 / 6), it is determined as the gallery / grid view, and the number of grids does not exceed 4 grids.

[0138] If the number A of detection boxes of the first category > 4 and the ratio of the area of a single detection box to the total area of the first category of detection boxes is less than a set threshold c (e.g., 1 / 4), it is determined as the gallery / grid view, and the number of grids is 5 grids or more.

[0139] If the area of the largest detection box of the first category in the area of the image region enclosed by all detection boxes of the first category > a set threshold d (e.g., 1 / 2) and the number A of detection boxes of the first category > 1, it is determined as the speaker view.

[0140] The second method determines the view mode based on the number of detection boxes of the first category, the total area of ​​the current conference video frame image (i.e., the total pixel area of ​​the current conference video frame image), and the ratio of the area of ​​the image region selected by a single detection box of the first category to the total area of ​​the current conference video frame image. For example, when the number A of detection boxes of the first category is equal to 1, and the ratio of the area of ​​the image region selected by the detection box of the first category to the current video frame image is equal to 1, then the view mode of the current video frame is determined to be either full-screen conference view or immersive conference view.

[0141] When the number of detection boxes A in the first category is greater than 1, the proportion of the image region selected by the smallest detection box in the first category to the first video frame image is greater than or equal to a set threshold e (e.g., 3 / 12), and the proportion of the image region selected by the largest detection box in the first category to the current conference video frame image is less than a set threshold f (e.g., 3 / 8), then the view mode of the current conference video frame is determined to be gallery view.

[0142] If the number of detection boxes A in the first category is greater than 1, and the proportion of the image region selected by the largest detection box in the first category to the current conference video frame image is greater than the set threshold g (e.g., 2 / 5), then the view mode of the current conference video frame is determined to be speaker view.

[0143] Of course, other methods can also be used for layout analysis. For example, image semantic segmentation can be used to analyze the layout of the current meeting video frame. For instance, image semantic segmentation is performed on the current meeting video frame, dividing it into image regions with different semantic content, such as the main screen image region and non-main screen image regions, as well as image regions within the main screen image region containing the video stream content of different participants. Then, the view mode of the current meeting video frame is determined based on the semantic segmentation results. This application does not specifically limit the specific implementation method of layout analysis; an appropriate method can be selected based on the actual situation.

[0144] After obtaining the layout analysis results of the current meeting video frame image through layout analysis, image quality enhancement processing is performed on the current meeting video frame image based on the layout analysis results. Optionally, image quality enhancement processing can be performed in various ways, such as through super-resolution enhancement processing to increase the resolution of the current meeting video image, making the image display clearer; or through color enhancement processing to improve the color accuracy of the current meeting video frame image; or through HDR processing to enhance the HDR of the current meeting video frame image. The following uses super-resolution enhancement processing as an example to introduce the image quality enhancement processing of the current meeting video frame image.

[0145] For example, firstly, based on the analysis of the main image area and non-main image area, the non-main image area is not subjected to super-resolution enhancement processing.

[0146] It should be explained that the main screen image area mentioned in the embodiments of this application refers to the image area where the main content screen is located. For example, in a video conferencing scenario, the main content screen refers to the video content screen of the conference (e.g., the mainstream screen including the video stream content screen captured by the camera of the local conference terminal and the video stream content screen captured by the camera of the remote conference terminal, as well as the screen content shared by the local or remote conference terminals, such as auxiliary stream screens such as text and images), see the image area selected by the detection box of the first category in Figure 4; in other application scenarios such as ordinary screen projection or display scenarios, the main content screen refers to the playback content screen (e.g., playing... The main screen image area refers to the image area containing content unrelated to the main content screen, such as video content, images, or text. The non-main screen image area refers to the image area containing content unrelated to the main content screen. For example, in a meeting scenario, the non-main screen image area refers to the image area outside the main content screen, such as the image area containing UI controls, software pop-ups, desktop backgrounds, taskbars, etc. See Figure 4, the image area selected by the detection box of the second category. In other application scenarios, such as ordinary screen projection or display scenarios, the non-main screen image area refers to the image area containing content unrelated to the playback content, such as the image area containing software pop-ups, desktop backgrounds, taskbars, etc.

[0147] Different super-resolution enhancement strategies are adopted for different view modes. For example, when the layout analysis determines that the current meeting software is not in full screen and is in a small window display mode, the meeting software window is relatively small (e.g., smaller than the set threshold h, such as the window being smaller than 1 / 2 of the screen), and the meeting content is also relatively small. Even though the resolution is low, it is still clear. In this case, super-resolution enhancement is not performed (see Figure 9).

[0148] For full-screen or immersive meeting view modes, the entire main screen image area is directly used as the target image area, and global super-resolution enhancement is performed on the entire main screen image area (see Figure 10).

[0149] For gallery / grid view modes with 4 or fewer grids, all grid frames in the main image area that meet the size requirement (e.g., area greater than a set threshold j) are identified as target image areas, and super-resolution enhancement is performed on each grid frame that meets the size requirement (see Figure 11).

[0150] For gallery / grid view modes with more than 4 grids, since the number of grids is too large and each grid view is small, it is clear enough even at low resolution, so no global processing is performed (see Figure 12, i.e., no super-resolution enhancement processing is performed on the current video frame image).

[0151] For the speaker view mode, the largest frame in the evolver view is taken as the target image region, and super-resolution enhancement is performed on the largest frame in the evolver view, while the surrounding smaller frames are not processed (see Figure 13).

[0152] Specifically, super-resolution enhancement of the target image region can be achieved by masking the non-target image region in the current video frame to obtain a masked image, and then applying the super-resolution enhancement model to the masked image. The formula is shown below: result i =residual_mask i *residuals i +upscale(base)

[0153] Where residual_mask represents the region enhancement weights after super-resolution analysis, residuals represents the computational residuals of super-resolution enhancement, upscale represents the upsampling factor, and base represents the input image.

[0154] In this embodiment, when enhancing the image quality of the current video frame, a layout analysis is first performed. Based on the layout analysis results, different regions within the current video frame are processed differently. For example, no super-resolution enhancement is performed on non-main screen image regions. For the main screen image region, different image quality enhancement strategies are applied based on the meeting view mode obtained from the layout analysis. For instance, for full-screen / immersive view mode, super-resolution enhancement is performed on the main screen image region; for grid / gallery view mode with fewer than four grids, super-resolution enhancement is performed on all main screen image regions; for grid / gallery view mode with more than four grids, no super-resolution enhancement is performed on the main screen image region; and for the Evolver view, only the image region containing the largest frame is subjected to super-resolution enhancement. This achieves region-specific image quality enhancement processing, rather than enhancing or not enhancing the entire video frame image. The image quality enhancement processing in this embodiment is more granular, resulting in better image quality enhancement effects.

[0155] Considering that the video stream quality of multiple participants may differ—for example, remote meeting video streams are usually of lower quality, while local meeting video streams are usually of higher quality—the mainstream meeting footage includes a local preview (the image of the local meeting room played by the camera of the local mobile device accessing the meeting terminal), with an actual resolution of 1080P. There are also camera feeds from remote meeting rooms. If a remote Windows meeting terminal joins via BYOM or its built-in camera, the actual resolution is 720P. If a remote Android meeting terminal joins via a large screen, the actual resolution is 360P. If a remote iOS meeting terminal joins, the actual resolution ranges from 360P to 720P.

[0156] Furthermore, cloud conferencing systems have different resolution configurations across different mobile devices and accounts, such as 306P, 540P, 720P, and 1080P, resulting in varying resolutions of the video stream content from different participants. For instance, some participants have activated premium memberships in the cloud conferencing software, resulting in higher image quality, and therefore do not require image enhancement processing. Conversely, some participants without premium or memberships will have lower image quality in their video frames, requiring more significant image enhancement processing. Therefore, the image quality enhancement method provided in this application's embodiments, in the image quality enhancement processing step, determines the target image region based on the image quality of the participant's video frame image. Moreover, when performing image quality enhancement processing on the target image region, it also considers the image quality of the video frame image, applying greater image quality enhancement processing to low-quality video frame images and less to high-quality video frame images. For example, performing a significant image enhancement process on a 360P conference video frame image increases its resolution from 360P to 1080P, resulting in a resolution increase of 720P; while performing a minor image enhancement process on a 720P conference video frame image increases its resolution from 720P to 1080P, resulting in a resolution increase of 360P.

[0157] In another example, this embodiment adds an image quality evaluation operation to the specific implementation of the image quality enhancement step. Through this evaluation, an image quality score is obtained for the image region containing the video stream content of each participant in the main image area (the image quality score is used to distinguish the resolution of the image; a higher score indicates a higher resolution). Then, based on the image quality score of the image region containing the video stream content of each participant and the layout analysis results, the target image region of the meeting video frame image is identified. After identifying the target image region, adaptive image quality enhancement processing is performed on the target image region according to the image quality score.

[0158] For example, the layout of the current meeting video frame is analyzed to determine whether the view mode is full-screen or immersive. Then, the image quality of the main screen area is evaluated to obtain a quality score. Based on this score, it is determined whether the main screen area should be considered the target image area. For instance, if the quality score indicates that the resolution of the main screen area is greater than or equal to 1080P, no image quality enhancement is performed on the current meeting video frame; that is, the current meeting video frame does not have a target image area. If the quality score indicates that the resolution of the main screen area is less than 1080P, the entire main screen area is considered the target image area, and then the corresponding super-resolution enhancement model is called to perform image quality enhancement on the target image area based on the quality score.

[0159] If the layout analysis results indicate that the view mode of the current meeting view frame image is a gallery / grid view mode with more than 4 grids, then it is determined that the current meeting video frame image does not have a target image area, and no image quality enhancement processing will be performed on the current meeting video frame image.

[0160] If the layout analysis indicates that the current meeting view frame image is in a 4-grid gallery / grid view mode, and the area size of each grid cell meets the requirements, then the image quality of each grid cell in the main image area is evaluated to obtain an image quality score for each grid cell. The image area containing the grid cells with substandard image quality scores is designated as the target image area. For example, an image quality score less than 8 indicates an image resolution less than 1080P, and is therefore considered substandard. Then, based on the image quality scores of the substandard grid cells, an appropriate super-resolution enhancement model is applied to each grid cell to perform image quality enhancement processing.

[0161] If the layout analysis results indicate that the view mode of the current meeting view frame image is the speaker view mode, the image quality of the largest frame in the evolver view is evaluated. Based on the image quality score obtained from the evaluation, it is determined whether to use the image region where the largest frame is located as the target image region. If the image quality score of the image region where the largest frame is located does not meet the standard, it is used as the target image region. Then, based on the image quality score, the appropriate super-resolution enhancement model is called to perform image quality enhancement processing on it. The small frames around the largest frame are not subjected to image quality enhancement processing.

[0162] Figure 14 illustrates a flowchart of a specific image quality enhancement step. As shown in Figure 14, an IQA (Integrated Quality Assessment) is performed on the main image area of ​​the meeting. Based on the quality assessment result (which can also be called the image quality assessment result), a suitable super-resolution enhancement model is called to perform image quality enhancement processing on the target image area. For example, when the current meeting video frame image is a full-screen meeting view or an immersive meeting view, the target image area only displays the video stream content of one participant. In this case, the quality assessment result includes a quality score (also called an image quality score), for example, if the quality score is 360P, then super-resolution enhancement model A is called to perform super-resolution enhancement processing on the current video frame image, increasing the resolution of the current meeting video frame image to 1080P.

[0163] Figure 15 illustrates a schematic diagram of a conference video frame image. As shown in Figure 15, the current conference video frame image includes the video stream content of four participants: Participant 1, Participant 2, Participant 3, and Participant 4. The resolution of Participant 1's video stream content is 360P, Participant 2's is 540P, Participant 3's is 720P, and Participant 4's is 1080P. A quality assessment result is obtained by evaluating the quality of each image area in the main screen image region. The quality assessment result includes a quality score for the video stream content of the four participants. The quality scores show that the resolutions of the video stream content of the four participants are 360P for Participant 1, 540P for Participant 2, 720P for Participant 3, and 1080P for Participant 4. Since Participant 4's resolution of 1080P already meets the standard, no image quality enhancement processing is required. Based on the image quality score, appropriate super-resolution enhancement models are applied to enhance the video streams of the other three participants. For example, super-resolution enhancement model A is applied to participant 1's video stream, increasing its resolution from 360P to 1080P; super-resolution enhancement model B is applied to participant 2's video stream, increasing its resolution from 540P to 1080P; and super-resolution enhancement model C is applied to participant 3's video stream, increasing its resolution from 720P to 1080P. This achieves regional adaptive image quality enhancement of the conference video frames, providing more detailed enhancement and ensuring effective image quality enhancement.

[0164] Optionally, this application embodiment can set a candidate super-resolution enhancement model set, which includes super-resolution enhancement models for processing different resolutions, such as super-resolution enhancement model A, super-resolution enhancement model B, and super-resolution enhancement model C. Super-resolution enhancement model A is used to process images with a resolution of 360P, enhancing the resolution of the 360P image to 1080P; super-resolution enhancement model B is used to process images with a resolution of 540P, enhancing the resolution of the 540P image to 1080P; and super-resolution enhancement model C is used to process images with a resolution of 720P, enhancing the resolution of the 720P image to 1080P. This application embodiment can select a target super-resolution enhancement model from the candidate super-resolution enhancement model set based on the image quality score, and use the selected target super-resolution enhancement model to perform super-resolution enhancement processing on the image. This avoids blind super-resolution of the conference video frame image and ensures the image quality enhancement effect.

[0165] Image quality can be evaluated in various ways in the embodiments of this application. For example, an image quality evaluation model can be used to evaluate the image quality. Optionally, the image quality evaluation model can be a trained AI model that can establish a mapping between images and image quality scores. That is, the image is used as input to the AI ​​model, and the output is the image quality score. Optionally, the AI ​​model can be a convolutional neural network (CNN) model. Optionally, the image quality score can directly indicate the resolution, that is, the image is input to the AI ​​model, and the AI ​​model outputs the resolution of the image. Alternatively, the image quality score can be 1-10 points, where 8 points or above indicates an image resolution higher than 1080P, 6-8 points indicates an image resolution of 720P, 4-6 points indicates an image resolution of 540P, and below 4 points indicates an image resolution of 360P.

[0166] Of course, in some other examples, other image quality assessment methods can also be used, such as the NR-IQA image quality assessment scheme. This application embodiment does not specifically limit the specific image quality assessment method used, and a suitable image quality assessment method can be selected as needed.

[0167] In another example, to further enhance the image quality of the conference video frame, this application embodiment also provides an image quality enhancement method based on enhancement feedback. That is, after enhancing the conference video frame, the image quality enhancement coefficient is adjusted according to the enhancement feedback, and then the adjusted image is further enhanced. This process is iterated to enhance the image quality.

[0168] Figure 16 illustrates the implementation flow of the image quality enhancement step based on enhanced feedback provided in this application embodiment. As shown in Figure 16, compared to the image quality enhancement step shown in Figure 14, the image quality enhancement step shown in Figure 16 adds a quality evaluation of the image after image quality enhancement processing. Based on the comparison of quality scores before and after enhancement, the enhancement effect is evaluated by region. The enhancement coefficient is corrected according to the evaluation result (i.e., the mask region in the image is corrected). Then, the image after the enhancement coefficient is corrected continues to undergo image quality enhancement processing. This process is iterated until the enhancement effect evaluation shows a positive improvement, thus ensuring the image quality enhancement effect.

[0169] Figure 17 illustrates the specific implementation of the image quality enhancement step based on enhanced feedback. As shown in Figure 17, the conference video frame image is processed into frame patch blocks, dividing the conference video frame image into multiple image blocks. The quality score a1 of each image block in the conference video frame image before image quality enhancement is obtained using the NR-IQA image quality assessment method. Then, the aforementioned image quality enhancement method is used to enhance the conference video frame image. The quality score a2 of each image block in the conference video frame image after image quality enhancement is obtained again using the NR-IQA image quality assessment method. Based on the comparison of the quality scores of the same image blocks before and after enhancement (i.e., the comparison of quality score a1 and quality score a2), the enhanced feedback mask is obtained. The positive feedback area continues to be enhanced, while the enhancement coefficient of the negative feedback area is gradually reduced. In this way, the mask area of ​​the conference video frame image is adjusted to make the mask area more accurate, thereby ensuring the image quality enhancement effect of the conference video frame image.

[0170] Optionally, the enhanced feedback mask can also be a VMAF index, which indicates the degree of degradation of the enhanced image patch, i.e., the quality score of the enhanced image patch is reduced. Image patches with large VMAF indices are identified as negative feedback regions, and image patches with small VMAF indices are identified as positive feedback regions.

[0171] It should be noted that the video stream content of the participants mentioned in the embodiments of this application refers to the video stream content shared by the participants through screen projection from their mobile terminals. This video stream content can be the content of an auxiliary stream scenario or the content of a mainstream scenario. An auxiliary stream scenario is a screen-sharing scenario, such as screen content shared by local or remote video conferencing terminals, such as text and images. A mainstream scenario is the video stream content captured by the participant's camera (usually the participant's mobile device calling the camera of the video conferencing terminal, but the mobile device's own camera is also excluded).

[0172] In mainstream scenarios, video stream resolution is typically 360P to 1080P, with images of people and furniture as the main content. In secondary streaming scenarios, video stream resolution is above 1080P, with text and icons as the main content. Secondary streaming scenarios experience minimal quality loss during encoding, decoding, compression, and transmission; therefore, image enhancement processing is unnecessary. To address this, the image enhancement method provided in this application adds a scene recognition step before the layout analysis step. By performing scene recognition on the current meeting video frame image, it identifies whether the displayed image is a secondary or mainstream scene. Once a secondary scene is identified, it avoids over-resolution processing to prevent image degradation.

[0173] Figures 18 and 19 show video frame images of the main scene and the auxiliary stream scene, respectively.

[0174] Figure 20 is a schematic diagram of the implementation flow of another image quality enhancement method provided in this application. As shown in Figure 20, the difference between this method and the method described in Figure 3 above is that a scene recognition step is added. The video frame images of the identified auxiliary flow scene are not subjected to quality enhancement processing, while the video frame images of the identified mainstream scene are processed in subsequent steps.

[0175] After the video conferencing terminal decodes the conference video frame images, the content of the main stream and auxiliary stream scenes differs significantly. The main stream scene has low resolution and high frame rate, while the auxiliary stream scene has high resolution and low frame rate. Therefore, a frame difference method can be used to identify the main stream and auxiliary stream scenes. For example, the number of video frames with differences less than a set threshold λ among K-frame conference video frames can be detected. If the number of video frames is greater than a set threshold 0, the current conference video frame is determined to be an auxiliary stream scene; otherwise, it is a main stream scene. Of course, other suitable scene recognition methods can also be used to identify auxiliary stream and main stream scenes, such as using scene category classification algorithms, or combining YUV image distribution features with strategies to identify main and auxiliary stream scenes. This application does not specifically limit the specific method of scene recognition; appropriate scene recognition methods can be adopted as needed.

[0176] Figure 21 is a flowchart illustrating the image quality enhancement method provided in this application embodiment. This method can be executed by any device, equipment, platform, or device cluster with computing capabilities. This application embodiment does not specifically limit the specific computing device executing the method; a suitable computing device can be selected as needed. For example, it can be implemented by the video conferencing terminal shown in Figure 2, meaning the image quality enhancement method provided in this application embodiment can be deployed as software on the video conferencing terminal to provide image quality enhancement services to users. Alternatively, it can be implemented by a server. The server communicates with the video conferencing terminal, sending the enhanced video stream to the video conferencing terminal to clearly display the video stream content on the large screen of the video conferencing terminal. The following describes the specific implementation of the image quality enhancement method provided in this application embodiment, using the video conferencing terminal as the execution subject. As shown in Figure 21, the image quality enhancement method provided in this application embodiment includes at least steps S2101 to S2103.

[0177] In step S2101, the conference video frame image is acquired.

[0178] Generally speaking, participants' mobile devices (such as PCs) and video conferencing terminals cannot obtain the video conferencing stream (such as the original camera video stream). What they can obtain is only the video stream data of the cloud conferencing software's display screen. Without obtaining the video conferencing stream, it is impossible to use conventional methods to enhance the image quality of the conference video frames.

[0179] This application embodiment acquires conference video frame image data through screen image capture. For example, the participant's mobile device uses a screen capture module to capture the screen display content to obtain a screen video stream. Then, the video stream is encoded and sent to the video conferencing terminal. The video conferencing terminal receives the video stream sent by the mobile device and decodes the received stream to obtain the individual conference video frame image data in the conference video frame stream.

[0180] In step S2102, the layout analysis of the conference video frame image is performed to obtain the layout analysis result. The layout analysis result indicates the main screen image area and non-main screen image area of ​​the conference video frame image. The main screen image area includes multiple sub-image areas.

[0181] After acquiring the conference video stream data, the following steps are used to enhance the image quality of each conference video frame, resulting in enhanced conference video frame images and improving the clarity of the image display on the large screen of the video conferencing terminal.

[0182] In one example, after acquiring the meeting video frame images, layout analysis is performed directly on the meeting video frame images to obtain the layout analysis results. The layout analysis results indicate the main image area and non-main image areas of the meeting video frame images. For the specific implementation of layout analysis, please refer to the detailed description of layout analysis above; for the sake of brevity, it will not be repeated here.

[0183] In another example, after acquiring the conference video frame image, scene recognition is performed on the conference video frame image. If the conference video frame image is identified as a secondary stream scene, no image quality enhancement processing is performed on the conference video frame image, and the processing of the next conference video frame image is skipped; if the conference video frame image is identified as a mainstream scene, then the layout analysis processing of the conference video frame image is performed.

[0184] For specific scene recognition methods, please refer to the detailed description of scene recognition methods above. For the sake of brevity, it will not be repeated here.

[0185] In step S2103, different image quality enhancement strategies are used to process sub-image regions with different image qualities in multiple sub-image regions.

[0186] In this embodiment, the image quality of multiple sub-image regions can be obtained in various ways, such as determining the image quality of multiple sub-image regions based on the account information of each participant. The account information of each participant includes whether their account has activated a membership, and the membership level (e.g., basic member, intermediate member, advanced member, premium member, etc.). For example, the image quality resolution of the video stream displayed by a participant with an unactivated membership is 360P, that of a participant with an intermediate membership is 540P, that of a participant with an advanced membership is 720P, and that of a participant with a premium membership is 1080P. Therefore, in this embodiment of the application, the image quality enhancement method provided in this embodiment of the application can obtain the account information of each participant, and then determine the image quality of the display screen of the meeting video stream content of each participant, that is, the image quality of multiple sub-image areas, based on the account information of each participant.

[0187] In another example, the image quality of multiple sub-image regions can also be obtained through image quality evaluation. In this application embodiment, there are various image quality evaluation methods, such as using an image quality evaluation model. The conference video frame image is input into the image quality evaluation model, and the model outputs the image quality score for each image region in the conference video frame image. This image quality score can be directly the resolution, or it can be a score from 0 to 10. A higher score indicates a higher resolution, and a lower score indicates a lower resolution. Each score or score range can correspond to a resolution, thus achieving the purpose of obtaining the resolution of each image region in the conference video frame image through image quality scoring.

[0188] Optionally, the image quality evaluation model can be a CNN model.

[0189] For example, the NR-IQA image quality assessment scheme can also be used to assess the image quality of conference video frames. This application does not specifically limit the particular image quality assessment method used; a suitable method can be selected as needed.

[0190] Finally, based on the image quality evaluation results, the main image area of ​​the conference video frame is subjected to regional adaptive enhancement processing. The image areas with low image quality are enhanced more significantly, for example, enhancing the 360P video stream content of the participants to 1080P, which is an enhancement of 720P; the image areas with high image quality are enhanced less significantly, for example, enhancing the 720P video stream content of the participants to 1080P, which is an enhancement of 360P.

[0191] For the specific steps involved in image enhancement, please refer to the detailed description above. For the sake of brevity, they will not be repeated here.

[0192] Of course, in some other examples, image quality can be evaluated not by resolution, but by other dimensions, such as color and contrast.

[0193] The image quality enhancement method provided in this application embodiment can be deployed in software on the conferencing video terminal side to enhance the conferencing screen on the terminal side. It solves the common pain point of unclear content display in conferencing software in the industry, and does not rely on third-party conferencing software. It is compatible with existing conferencing software devices and is easy to promote.

[0194] This application uses a video conferencing scenario as an example to illustrate the specific implementation of the image quality enhancement method provided in this application. The specific implementation when applied to other scenarios (such as ordinary screen projection scenarios and display sending scenarios) is similar and can be referred to for execution.

[0195] Figure 22 shows a schematic diagram of the implementation process of the image quality enhancement method provided in this application embodiment when applied to a normal screen projection scenario.

[0196] For example, a smaller screen projection terminal (such as a mobile phone or PC) can project its displayed content (such as videos, images, PPT documents, etc.) onto a larger display terminal (such as a smart TV). The display terminal performs layout analysis on the projected content, analyzing the content type (images, software UI, text, etc.) and simultaneously performing a no-reference quality evaluation on different layouts to obtain various quality scores. This allows for adaptive enhancement of low-quality layout areas using appropriate super-resolution models, while high-quality layout areas remain unchanged from the original image. The evaluation criteria for high and low quality need to consider the display resolution. For example, if the display resolution is 1080P, then 360P to 720P is considered low quality, and 1080P and above are considered high quality, without the need for super-resolution.

[0197] Based on the same concept as the aforementioned embodiment of the image quality enhancement method, this application also provides an image quality enhancement device 2300, which can be deployed on a server or terminal device (e.g., a video conferencing terminal) to provide image quality enhancement services. The image quality enhancement device 2300 includes units or modules for implementing the various steps of the image quality enhancement method shown in Figures 3-22.

[0198] Figure 23 is a schematic diagram of an image quality enhancement device provided in an embodiment of this application. As shown in Figure 23, the image quality enhancement device 2300 includes an acquisition module 2301, an analysis module 2302, and an image quality enhancement module 2303. The acquisition module 2301 is used to acquire the image to be processed; the analysis module 2302 is used to perform layout analysis on the image to be processed to obtain layout analysis results. The layout analysis results indicate the main image area and non-main image areas of the image to be processed. The main image area is the image area where the main content screen is located, and the non-main image area is the image area where content screens unrelated to the main content screen are located. The main image area includes multiple sub-image areas; the image quality enhancement module 2303 is used to process sub-image areas with different image quality using different image quality processing strategies.

[0199] In one possible implementation, the image to be processed is the first video frame image in the conference video stream; the multiple sub-image regions in the main screen image region are the image regions where the video conference views of multiple participants are located.

[0200] In another possible implementation, the layout analysis results also indicate the view layout of the main screen image area. The view layout indicates the positional distribution of the video conferencing views of multiple participants in the main screen image area and the area size of each video conferencing view in the video conferencing views of multiple participants. The image quality enhancement module 2303 is specifically used to: determine several sub-image areas to be enhanced from multiple sub-image areas based on the image quality and area size of each video conferencing view; not perform image quality enhancement processing on the non-enhanced image quality sub-image areas in the multiple sub-image areas; and perform different degrees of image quality enhancement processing on the sub-image areas to be enhanced with different image qualities in the several sub-image areas to be enhanced.

[0201] In another possible implementation, based on the image quality and area size of each video conferencing view, a specific implementation of determining several sub-image regions to be enhanced from multiple sub-image regions is as follows: the video conferencing view of multiple participants whose image quality is lower than a first threshold and whose area size is greater than a second threshold is determined as the target video conferencing view; the image region where the target video conferencing view is located is determined as the sub-image region to be enhanced.

[0202] In another possible implementation, a specific method for performing different levels of image quality enhancement processing on the sub-image regions to be enhanced with different image qualities in several sub-image regions to be enhanced is as follows: based on the image quality of each sub-image region to be enhanced, a target image quality enhancement model is selected for each sub-image region to be enhanced from a set of candidate image quality enhancement models. The set of candidate image quality enhancement models includes multiple image quality enhancement models, and different image quality enhancement models among the multiple image quality enhancement models are used to perform different levels of image quality enhancement processing on images with different image qualities; the target image quality enhancement model corresponding to each sub-image region to be enhanced is called to perform image quality enhancement processing on each sub-image region to be enhanced.

[0203] In another possible implementation, a specific method for calling the target image enhancement model corresponding to each sub-image region to be enhanced to perform image enhancement processing on each sub-image region is as follows: A masking operation is performed on the image regions outside each sub-image region to be enhanced in the first video frame image to obtain the masked first video frame image; the masked first video frame image is used as the input to the target image enhancement model corresponding to each sub-image region to be enhanced, and the output is the first video frame image after image enhancement processing for each sub-image region to be enhanced. That is, when enhancing the image quality of each sub-image region to be enhanced, the image regions outside each sub-image region to be enhanced are masked, thereby adjusting the enhancement coefficients of different image regions. The enhancement coefficient of the masked image regions is set to 0, and the enhancement coefficient of the unmasked image regions is set to 1, thus avoiding the influence of the image regions outside each sub-image region to be enhanced on the image enhancement processing.

[0204] In another possible implementation, calling the target image enhancement model corresponding to each sub-image region to be enhanced to perform image enhancement processing on each sub-image region to be enhanced further includes: determining the image quality score comparison of each image block in the first video frame image after masking before and after image enhancement processing; adjusting the mask image region of the first video frame image after masking based on the image quality score comparison to obtain the first video frame image after masking adjustment; and calling the target image enhancement model to perform image enhancement processing on the first video frame image after masking adjustment.

[0205] In another possible implementation, before applying different image enhancement strategies to sub-image regions of varying quality within the multiple sub-image regions, the process further includes: performing image quality evaluation on the multiple sub-image regions within the main image region to obtain an image quality score for each sub-image region. This image quality score is used to assess the overall image quality of each sub-image region. Thus, obtaining the image quality of each sub-image region through image quality evaluation facilitates subsequent adaptive image enhancement based on image quality.

[0206] Optionally, the image quality score for each sub-image region includes the resolution score for each sub-image region.

[0207] In another possible implementation, calling the target image enhancement model to perform image enhancement processing on the image region to be enhanced also includes: determining the VMAF index of each image block in the first video frame image after masking, before and after image enhancement processing, the index indicating the degree of degradation of each image block before and after image enhancement processing; adjusting the mask image region of the first video frame image after masking based on the VMAF index to obtain the first video frame image after masking adjustment; and calling the target image enhancement model to perform image enhancement processing on the first video frame image after masking adjustment.

[0208] In another possible implementation, the image quality enhancement processing apparatus 2300 provided in this application further includes a scene recognition module 2304, which is used to perform scene recognition on the first video frame image and determine that the scene in which the first video frame image is located is a mainstream scene.

[0209] In another possible implementation, the scene recognition module 2304 is also used to: determine that the scene in which the first video frame image is located is an auxiliary stream scene, and then not perform image quality enhancement processing on the first video frame image.

[0210] In another possible implementation, the scene recognition module 2304 is specifically used to: detect the number of video frames in the first video frame image whose differences are less than a third threshold, where K is a positive integer greater than 1; and determine the scene in which the first video frame image is located based on the number of video frames whose differences are less than the third threshold. The mainstream scene and the secondary stream scene are effectively identified through the frame difference method.

[0211] In another possible implementation, a specific way to determine the scene of the first video frame image based on the number of video frames with differences less than the third threshold is as follows: when the number of video frames with differences less than the third threshold is less than the fourth threshold, the scene of the first video frame image is determined to be the mainstream scene; when the number of video frames with differences less than the third threshold is greater than or equal to the fourth threshold, the scene of the first video frame image is determined to be the auxiliary stream scene.

[0212] In another possible implementation, the analysis module 2302 is specifically used to: perform target detection on the first video frame image to obtain the target detection result; and determine the layout analysis result based on the target detection result.

[0213] In another possible implementation, the target detection result includes multiple detection box information, which includes image regions selected by a first category of detection boxes and image regions selected by a second category of detection boxes. The image regions selected by the first category of detection boxes are the image regions where the video conferencing view is located in the first video frame image, and the image regions selected by the second category of detection boxes are the image regions in the first video frame image where the non-video conferencing view is located. Based on the target detection result, a specific implementation for determining the layout analysis result is as follows: when the number of first category detection boxes is equal to 1, and the proportion of the image region selected by the first category of detection boxes to the first video frame image is equal to 1, then the view region of the first video frame image is determined to be... The image layout is either a full-screen meeting view or an immersive meeting view. If the number of detection boxes in the first category is greater than 1, and the image areas selected by each detection box in the first category are of the same size, then the view layout of the first video frame image is determined to be a grid view. If the number of detection boxes in the first category is greater than 1, and the proportion of the image area selected by the smallest detection box in the first category to the first video frame image is less than the fifth threshold, then the view layout of the first video frame image is determined to be a gallery view. If the number of detection boxes in the first category is greater than 1, and the proportion of the image area selected by the largest detection box in the first category to the first video frame image is greater than the sixth threshold, then the view layout of the first video frame image is determined to be a speaker view.

[0214] In another possible implementation, the first video frame image is obtained by capturing the screen image of a user device, which is the device used by the participant and is connected to the video conferencing terminal.

[0215] In another possible implementation, the image to be processed is the second video frame image in the projected video stream data or the video stream data sent for display.

[0216] The image quality enhancement device 2300 according to the embodiments of this application can be used to execute the methods described in the embodiments of this application. The above and other operations and / or functions of each module in the image quality enhancement device 2300 are respectively to implement the corresponding processes of each method in FIG3-22. For the sake of brevity, they will not be described again here.

[0217] This application also provides a computing device including at least one processor, a memory, and a communication interface, wherein the processor is used to execute the method described in FIG3-22.

[0218] Figure 24 is a schematic diagram of the structure of the computing device provided in the embodiment of this application.

[0219] As shown in Figure 24, the computing device 2400 includes at least one processor 2401, a memory 2402, and a communication interface 2403. The processor 2401, memory 2402, and communication interface 2403 are communicatively connected, which can be achieved via a wired (e.g., bus) or wireless connection. The communication interface 2403 is used to send and / or receive data from other devices. The memory 2402 stores computer instructions, which the processor 2401 executes to perform the methods described in the aforementioned method embodiments, thereby achieving effective image quality enhancement for displays with low image quality.

[0220] It should be understood that, in the embodiments of this application, the processor 2401 may be a central processing unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.

[0221] The memory 2402 may include read-only memory and random access memory, and provides instructions and data to the processor 2401. The memory 2402 may also include non-volatile random access memory. Optionally, the random access memory may be, for example, high bandwidth memory (HBM).

[0222] The memory 2402 can be volatile memory or non-volatile memory, or it can include both. The non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. The volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous linked dynamic random access memory (SLDRAM), and direct rambus RAM (DR RAM).

[0223] It should be understood that the computing device 2400 according to the embodiments of this application can execute the method shown in Figures 3-22 of the embodiments of this application. For a detailed description of the implementation of the method, please refer to the above text. For the sake of brevity, it will not be repeated here.

[0224] Embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, wherein when the computer instructions are executed by a processor, the aforementioned method is implemented.

[0225] An embodiment of this application provides a chip including at least one processor and an interface, wherein the at least one processor determines program instructions or data through the interface; the at least one processor is used to execute the program instructions to implement the method mentioned above.

[0226] Embodiments of this application provide a computer program or computer program product that includes instructions that, when executed, cause a computer to perform the methods mentioned above.

[0227] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0228] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented using hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.

[0229] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of this application. It should be understood that the above description is only a specific embodiment of this application and is not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.

Claims

1. An image quality enhancement method, characterized in that, include: Obtain the image to be processed; The image to be processed is subjected to layout analysis to obtain layout analysis results. The layout analysis results indicate the main screen image area and non-main screen image area of ​​the image to be processed. The main screen image area is the image area where the main content screen is located. The non-main screen image area is the image area where content screens unrelated to the main content screen are located. The main screen image area includes multiple sub-image areas. Different image quality enhancement strategies are used to process sub-image regions with different image qualities from the multiple sub-image regions.

2. The method according to claim 1, characterized in that, The image to be processed is the first video frame image in the conference video stream; The multiple sub-image areas in the main screen image area are the image areas where the video conference views of multiple participants are located.

3. The method according to claim 2, characterized in that, The layout analysis results also indicate the view layout of the main screen image area, which indicates the positional distribution of the video conferencing views of the multiple participants in the main screen image area and the size of each video conferencing view in the video conferencing views of the multiple participants. The process involves employing different image enhancement strategies for sub-image regions of varying quality across the multiple sub-image regions, including: Based on the image quality and area size of each video conferencing view, several sub-image regions to be enhanced are determined from the multiple sub-image regions. For the non-image quality enhancement sub-image regions within the multiple sub-image regions, no image quality enhancement processing is performed; Different degrees of image quality enhancement processing are applied to the sub-image regions to be enhanced, based on their different image qualities.

4. The method according to claim 3, characterized in that, Based on the image quality and area size of each video conferencing view, several sub-image regions to be enhanced are determined from the plurality of sub-image regions, including: The video conference view with image quality lower than a first threshold and area larger than a second threshold among multiple participants is identified as the target video conference view. The image region where the target video conference view is located is determined as the sub-image region to be enhanced.

5. The method according to claim 3 or 4, characterized in that, The process of performing different levels of image quality enhancement processing on the sub-image regions to be enhanced, based on different image qualities within the plurality of sub-image regions to be enhanced, includes: Based on the image quality of each of the several image quality enhancement sub-image regions, a target image quality enhancement model is selected for each image quality enhancement sub-image region from a set of candidate image quality enhancement models. The set of candidate image quality enhancement models includes multiple image quality enhancement models, and different image quality enhancement models are used to perform different degrees of image quality enhancement processing on images with different image quality. The target image enhancement model corresponding to each sub-image region to be enhanced is invoked to perform image enhancement processing on each sub-image region to be enhanced.

6. The method according to claim 5, characterized in that, The step of calling the target image enhancement model corresponding to each sub-image region to be enhanced to perform image enhancement processing on each sub-image region to be enhanced includes: A masking operation is performed on the image region outside each sub-image region to be enhanced in the first video frame image to obtain the first video frame image after masking. The first video frame image after the masking process is used as the input to the target image enhancement model corresponding to each sub-image region to be enhanced, and the first video frame image after the image enhancement process is applied to each sub-image region to be enhanced is output.

7. The method according to claim 6, characterized in that, The step of calling the target image enhancement model corresponding to each sub-image region to be enhanced to perform image enhancement processing on each sub-image region to be enhanced further includes: The image quality scores of each image block in the first video frame image after the masking process are compared before and after the image quality enhancement process. Based on the image quality score comparison, the mask image region of the first video frame image after masking is adjusted to obtain the first video frame image after masking adjustment. The target image quality enhancement model is invoked to perform image quality enhancement processing on the first video frame image after the mask adjustment.

8. The method according to any one of claims 1-7, characterized in that, The process of applying different image enhancement strategies to sub-image regions of different image quality across the multiple sub-image regions also includes: The image quality of multiple sub-image regions in the main image area is evaluated to obtain an image quality score for each sub-image region. The image quality score of each sub-image region is used to evaluate the image quality of each sub-image region.

9. The method according to claim 8, characterized in that, The image quality score for each sub-image region includes the resolution score for each sub-image region.

10. The method according to any one of claims 2-9, characterized in that, Before performing layout analysis on the image to be processed to obtain the layout analysis results, the process also includes: Scene recognition is performed on the first video frame image; The scene in which the first video frame image is located is determined to be a mainstream scene.

11. The method of claim 10, further comprising: If the scene in which the first video frame image is located is determined to be an auxiliary stream scene, then no image quality enhancement processing is performed on the first video frame image.

12. The method according to claim 10 or 11, characterized in that, The scene recognition of the first video frame image includes: Detect the number of video frames in the first video frame image whose difference is less than a third threshold, where K is a positive integer greater than 1; The scene in which the first video frame image is located is determined based on the number of video frames whose differences are less than the third threshold.

13. The method according to claim 12, characterized in that, Determining the scene of the first video frame image based on the number of video frames with a difference less than a third threshold includes: When the number of video frames with a difference less than the third threshold is less than the fourth threshold, the scene in which the first video frame image is located is determined to be the mainstream scene. When the number of video frames with a difference less than the third threshold is greater than or equal to the fourth threshold, the scene in which the first video frame image is located is determined to be an auxiliary stream scene.

14. The method according to any one of claims 2-13, characterized in that, The step of performing layout analysis on the image to be processed to obtain layout analysis results includes: Target detection is performed on the first video frame image to obtain the target detection result; Based on the target detection results, the layout analysis results are determined.

15. The method according to claim 14, characterized in that, The target detection result includes multiple detection box information, which includes the image region selected by a first type of detection box and the image region selected by a second type of detection box. The image region selected by the first type of detection box is the image region where the video conferencing view is located in the first video frame image, and the image region selected by the second type of detection box is the image region where the non-video conferencing view is located in the first video frame image. The determination of the layout analysis result based on the target detection result includes: When the number of detection boxes in the first category is equal to 1, and the proportion of the image area selected by the detection boxes in the first category to the first video frame image is equal to 1, then the view layout of the first video frame image is determined to be a full-screen conference view or an immersive conference view. When the number of detection boxes in the first category is greater than 1, the proportion of the image region selected by the smallest detection box in the first category to the first video frame image is greater than or equal to the fifth threshold, and the proportion of the image region selected by the largest detection box in the first category to the first video frame image is less than the sixth threshold, then the view layout of the first video frame image is determined to be a gallery view. If the number of detection boxes in the first category is greater than 1, and the proportion of the image region selected by the largest detection box in the first category to the first video frame image is greater than the sixth threshold, then the view layout of the first video frame image is determined to be the speaker view.

16. The method according to any one of claims 2-15, characterized in that, The first video frame image is obtained by capturing the screen image of a user device, which is the device used by the participant and is communicatively connected to the video conferencing terminal.

17. The method according to any one of claims 1-16, characterized in that, The image to be processed is the second video frame image in the screen-projected video stream data or the video stream data sent for display.

18. An image quality enhancement device, characterized in that, include: The acquisition module is used to acquire the image to be processed; The analysis module is used to perform layout analysis on the image to be processed and obtain layout analysis results. The layout analysis results indicate the main screen image area and non-main screen image area of ​​the image to be processed. The main screen image area is the image area where the main content screen is located, and the non-main screen image area is the image area where content screens unrelated to the main content screen are located. The main screen image area includes multiple sub-image areas. The image quality enhancement module is used to process sub-image regions of different image quality using different image quality processing strategies within the multiple sub-image regions.

19. A computing device, comprising a memory and a processor, characterized in that, The memory stores instructions that, when executed by a processor, cause the method described in any one of claims 1-17 to be implemented.

20. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it causes the method as described in any one of claims 1-17 to be implemented.