Quality evaluation method based on self-supervised feature extraction, storage medium and terminal
By using self-supervised training with collaborative dual autoencoders, the problem of low feature extraction efficiency in no-reference quality assessment is solved, achieving efficient extraction of quality-related features, improving assessment accuracy and generalization ability, and promoting the development of the digital image and video production industry.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN UNIV
- Filing Date
- 2023-02-17
- Publication Date
- 2026-06-26
AI Technical Summary
Existing no-reference quality assessment algorithms struggle to achieve high performance in complex distortion and rich image content. Traditional autoencoders cannot effectively extract distortion features from images and rely on insufficient subjective label data.
We design a collaborative dual autoencoder to extract content features from content images and distortion features from distortion images through self-supervised training. We also utilize the collaborative optimization of the dual autoencoder to achieve efficient extraction of quality-related features, avoiding dependence on subjective labels.
It enables efficient extraction of quality-related features from large-scale unlabeled sample data, improving the accuracy and generalization ability of no-reference quality assessment and guiding the development of digital image and video production workflows.
Smart Images

Figure CN116416216B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of digital image processing and computer vision technology, specifically a quality evaluation method, storage medium and terminal based on self-supervised feature extraction. Background Technology
[0002] With the continuous development of society, economy, and science and technology, while enjoying the convenience of digital media dissemination, people are increasingly demanding high-resolution, high-frame-rate, wide-brightness, and wide-color-gamut digital images and videos. To improve the consumer experience, the digital image and video production industry is constantly utilizing advanced hardware acquisition technologies and digital image processing software technologies to produce exquisite and high-quality images and videos. The final experience that the produced digital images and videos bring to consumers is determined by their intuitive visual effects presented on terminal devices. Therefore, consumers' subjective evaluation of visual quality has guiding significance for the development and progress of the digital image and video production industry.
[0003] Objective quality assessment methods aim to use computational models to simulate the human visual system (HVS) and evaluate given images or videos in a manner consistent with human visual characteristics, achieving high real-time performance, high repeatability, and high cost-effectiveness. The significance of objective quality assessment research lies in providing feedback information to the digital image and video production process through appropriate objective quality assessment methods, thereby promoting the rapid development of the production industry to meet the ever-increasing consumer demands.
[0004] Objective evaluation models can be categorized into full-reference (FR), semi-reference (RR), and no-reference (NR) models, depending on whether a complete reference image or video is required, partial information from the reference image or video, or no reference information is needed. No-reference quality assessment, due to the lack of reference information, is more challenging to study but has the widest range of applications. Compared to reference-based quality assessment methods, no-reference quality assessment algorithms need to estimate the distortion characteristics of the distorted image alone, without a reference image, and use this as the benchmark for quality assessment. Early no-reference image quality assessment algorithms focused on a single type of distortion, such as JPEG or JPEG2000 compression and blurring, or considered multiple factors affecting quality, such as sharpness, contrast, noise, and blockiness, for quality estimation. These algorithms typically had limited applicability. Later researchers proposed more general evaluation algorithms, mainly implemented through two aspects: Natural Scene Statistics (NSS) and dictionary construction. The first, and most mainstream, approach is NSS, which assumes that neural processing in HVS is adapted to processing visual information, and that the features of natural images follow certain typical statistical distributions. Once a natural image suffers some distortion, its image quality changes, and the statistical distribution of its features will change accordingly. Based on this, the extracted feature distribution statistics are related to quality, and this correlation is used to learn the mapping from features to quality. The most commonly used mapping learning tool is Support Vector Regression (SVR). In recent years, commonly used techniques for quality feature extraction based on NSS include Local Luminance Normalization (MSCN), Wavelet Transform (DWT), Discrete Cosine Transform (DCT), and Mixed Domain Transform. Commonly used fitting distributions include Generalized Gaussian Distribution (GGD) and Asymmetric Generalized Gaussian Distribution (AGGD). However, the above-mentioned NSS-based methods rely heavily on manually designed features and predictions of prior statistical distributions, making them difficult to apply to diverse or complex distortion modeling and rich image content information. Therefore, they struggle to achieve good performance in more complex IQA quality assessment requirements. Another general no-reference quality assessment algorithm is based on dictionary construction, that is, using an unsupervised learning-built dictionary to encode input image patches into features, and then mapping features to quality scores. However, dictionary construction is cumbersome, requiring a complete overhaul for new image sets, making it unsuitable for practical applications. Furthermore, the dictionary encoding performed only addresses image patches, lacking specific design considerations for distortion characteristics, resulting in low efficiency. Therefore, neither NSS-based nor dictionary-based evaluation algorithms can achieve satisfactory performance in more demanding quality assessment tasks.
[0005] With the breakthrough performance improvements achieved by Directed Neural Networks (DNNs) in image recognition and classification tasks, researchers have begun to explore their application in image quality assessment tasks to enhance performance. The most intuitive approach is to use sample data with subjective score labels (MOS) for end-to-end network training, simultaneously optimizing feature extraction and score mapping. However, due to the scarcity of data with subjective score labels—the largest existing labeled image quality assessment database only contains tens of thousands of images—it is difficult to design sufficiently deep networks to improve the generalization performance of assessment models under end-to-end training strategies. To alleviate this problem, researchers have proposed various learning strategies to fully utilize existing labeled data. Besides data gain techniques commonly used in other tasks, researchers often attempt to use additional tasks to assist in quality assessment, including multi-task learning of distortion types and degrees, gradient extraction, image inpainting to obtain reference images, content awareness aided by segmentation tasks, ImageNet pre-trained networks, multi-database training, meta-learning tasks, and so on. While these algorithms alleviate the problem of insufficient sample size to some extent, they still fail to achieve targeted training of the network using large-scale samples.
[0006] To design deeper neural networks and achieve referenceless quality assessment algorithms with sufficiently strong generalization performance, some researchers have begun to train the feature extraction module and the score regression module separately in deep networks. The feature extraction module is trained deeply on a large number of samples without subjective score labels using weak supervision or unsupervised methods. Finally, the fixed feature extractor is combined with the regression module for simple fine-tuning on a small number of samples with subjective score labels. This type of algorithm is called the second-order learning method in this invention. The surrogate labels designed in the feature extraction module are either quality correlation maps (such as SSIM maps) or pseudo-label scores obtained by the FR algorithm using distorted images and reference images, or ranking labels obtained from distorted image pairs with known prior knowledge and FR algorithm results. Then, pairwise training is performed using structures such as Siamese networks. Besides using known priors as the basis for ranking labels, distortion type priors can also be used as pre-training pre-synthetic distortion feature extractors; or distortion type priors can be used as labels for feature comparison learning on classification networks. However, surrogate labels based on prior knowledge are usually limited by the fact that the distortion information is clearly describable, which makes it unlikely that the labeled data will suffer from complex and diverse distortions; while surrogate labels based on FR algorithm results depend on the accuracy and generalization ability of the selected FR algorithm. If the FR algorithm estimates inaccurately, the training quality of the feature extraction process will be affected.
[0007] Therefore, it is of great significance to design a framework that can be fully trained on large-scale unlabeled sample data and effectively extract quality-related features, while ensuring that the training process is not constrained by any subjective labels or artificially designed or given labels; and to use the trained quality features to extract the encoder and further train the regression to obtain an efficient no-reference quality evaluator.
[0008] In traditional autoencoders, the decoder's input comes only from its own encoder, and the decoder's output is the encoder's input as the reconstruction target. In this case, the autoencoder completes a simple dimensionality reduction during the encoding and decoding process, without involving the extraction or separation of content features and distortion features. The extracted features are mixed features, and this type of autoencoder cannot drive it to pay more attention to the distortion information in the image. Summary of the Invention
[0009] The purpose of this invention is to provide a quality evaluation method, storage medium, and terminal based on self-supervised feature extraction to solve the above-mentioned technical problems.
[0010] Therefore, this invention provides a quality evaluation method based on self-supervised feature extraction, comprising:
[0011] A collaborative dual autoencoder is constructed. The self-supervised training collaborative dual autoencoder extracts content features from the input content image and distortion features from the input distortion image. The input content image and the input distortion image are the same or have the same content.
[0012] A content feature vector is obtained based on the content features, and a distortion feature vector is obtained based on the distortion features.
[0013] The content feature vector and the distortion feature vector are concatenated to obtain the quality feature vector;
[0014] The quality feature vector is passed through a fully connected layer to obtain a predicted quality score. The predicted quality score is obtained by training sample data with subjective opinion scores.
[0015] Preferably, the construction of the cooperative dual autoencoder includes constructing a content autoencoder and constructing a distortion autoencoder, wherein the construction of the content autoencoder includes constructing a content encoder and constructing a content decoder, and the construction of the distortion autoencoder includes constructing a distortion encoder and constructing a distortion decoder. The content encoder extracts content features from the input content image, and the content decoder decodes the content features to obtain a reconstructed content image. The distortion encoder extracts distortion features from the input distortion image, and the distortion decoder decodes the content features and distortion features to obtain a reconstructed distortion image.
[0016] Preferably, the self-supervised training of the collaborative dual autoencoder includes self-supervised training of the content autoencoder and self-supervised training of the distortion autoencoder.
[0017] The self-supervised training of the content autoencoder includes:
[0018] The content autoencoder is trained using the lossless image as training data, and the output constraint of the content decoder is the lossless image itself.
[0019] The content autoencoder is trained using distorted images as training data. Distortionless images that share the same content as the distorted images serve as constraints on the output of the content decoder, and the content features of the distortionless images serve as constraints on the content features of the distorted images.
[0020] The self-supervised training of the distortion autoencoder includes:
[0021] The distortion autoencoder is trained using the distorted image as training data. The output constraint of the distortion decoder is the distorted image itself. The loss function guides backpropagation to optimize the distortion autoencoder.
[0022] Preferably, the output constraints of the content decoder and the distortion decoder are functions of:
[0023] l overall =μ×l pixel (I o ,I r )+(1-μ)×l percp (I o ,I r (1)
[0024] Where, loveall is the global constraint, μ is the balance parameter, lpixel is the pixel-level constraint, Io is the distorted or undistorted image, Ir is the reconstructed content image or the reconstructed distorted image, and lpercp is the perceptual constraint.
[0025]
[0026] Where N is the total number of pixels, and k is the index of a pixel in the image;
[0027]
[0028] in The VGGNet network feature layer selected for the j-th layer, where L is the total number of selected feature layers, and C is the number of feature layers selected. j H j W j is the size of the output feature map of the j-th feature layer.
[0029] Preferably, when performing self-supervised training on the content autoencoder, sparsity constraints and / or distance constraints are also applied to the extracted content features.
[0030] Preferably, the construction of the distortion encoder includes first extracting different feature maps from multiple layers of the input distortion image, then feeding each feature map into a spatial pyramid pooling module for fusion to obtain multiple low-dimensional features of fixed length, and finally concatenating the multiple low-dimensional features as the final output of the distortion encoder.
[0031] Preferably, obtaining the content feature vector based on the content features includes passing the content features through a fusion network composed of convolutional layers and global pooling layers to obtain the content feature vector.
[0032] Preferably, constructing the distortion decoder includes:
[0033] The distortion features are passed through a fully connected layer to obtain extended features, which include the distortion features of each layer and the association information characterizing the relationship between the distortion features of each layer;
[0034] The extended feature is divided into multiple decomposition features, and these multiple decomposition features are used sequentially as one of the inputs to the sub-modulation residual block.
[0035] In addition, the present invention also provides a computer-readable storage medium storing one or more programs that can be executed by one or more processors to implement the steps in the quality assessment method based on self-supervised feature extraction as described above.
[0036] Furthermore, the present invention also provides a terminal, comprising: a processor and a memory, wherein the memory stores a computer-readable program that can be executed by the processor; when the processor executes the computer-readable program, it implements the steps in the quality evaluation method based on self-supervised feature extraction as described above.
[0037] Compared with the prior art, the features and beneficial effects of the present invention are as follows:
[0038] (1) This invention designs a collaborative dual autoencoder. First, a content autoencoder is trained primarily with high-quality, distortion-free sample data to obtain a content encoder that can encode content features weakly correlated with distortion information. Then, a distortion autoencoder is trained primarily with distorted sample data, and the aforementioned content features are used as auxiliary information to assist in the extraction of distortion features. Through the collaborative optimization of the two autoencoders, efficient extraction of quality-related features is achieved. The output of the collaborative dual autoencoder is trained under supervision by the input itself. This process does not require any MOS labels and belongs to a fully self-supervised mode. This invention utilizes a feature extraction method based on self-supervised learning, which reduces the dependence on data samples with subjective score labels and avoids the dependence on manually designed or artificially selected surrogate labels. It utilizes a rich variety of training samples to fully train the feature extractor.
[0039] (2) This invention designs a no-reference quality assessment method based on self-supervised learning feature extraction. It utilizes a content encoder and a distortion encoder trained on a quality feature extraction framework based on a collaborative autoencoder to extract content features and distortion features from the input samples, respectively. A lightweight regression network is then designed based on these two features to predict the quality score. This invention employs an efficient and universal quality assessment model, providing guidance for the digital image and video production process and possessing significant research value in promoting the development of the digital image and video production industry. Attached Figure Description
[0040] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0041] Figure 1 This is a flowchart illustrating the quality evaluation method based on self-supervised feature extraction of the present invention.
[0042] Figure 2 (a) is a scatter plot categorized by the type of distortion in the test images;
[0043] Figure 2 (b) is a scatter plot showing the classification and coloring of the test images based on their quality scores;
[0044] Figure 3 This is a performance comparison chart of no-reference quality assessment algorithms;
[0045] Figure 4 This is a schematic diagram of the overall framework of the cooperative dual self-encoder of the present invention;
[0046] Figure 5This is a schematic diagram of the network structure of the content autoencoder of the present invention;
[0047] Figure 6 This is a schematic diagram of the network structure of the distortion encoder of the present invention;
[0048] Figure 7 This is a schematic diagram of the network structure of the distortion decoder of the present invention;
[0049] Figure 8 This is a schematic diagram of a no-reference quality assessment network framework based on self-supervised feature extraction.
[0050] Figure 9 The structural schematic diagram of the terminal device provided by the present invention.
[0051] Figure captions: 11-Content encoder, 12-Content decoder, 21-Distortion encoder, 22-Distortion decoder, 30-Processor, 31-Display screen, 32-Memory, 33-Communication interface, 34-Bus. Detailed Implementation
[0052] This invention provides a quality evaluation method, storage medium, and terminal based on self-supervised feature extraction. To make the objectives, technical solutions, and effects of this application clearer and more explicit, the following detailed description is provided with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only for explaining this application and are not intended to limit this application.
[0053] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this application’s specification means the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. It should be understood that when the invention refers to an element as being “connected” or “coupled” to another element, it may be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein may include wireless connections or wireless coupling. The term “and / or” as used herein includes all or any units and all combinations of one or more associated listed items.
[0054] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.
[0055] To address the insufficient accuracy of existing quality assessment methods with subjective score labels, traditional dual autoencoders extract a mixture of content and distortion features, failing to drive them to focus more on distortion information in images. This invention proposes a novel self-supervised learning approach to efficiently extract quality-related features, and further trains the trained feature extractor on labeled data to obtain an efficient no-reference quality assessment model.
[0056] like Figure 1 As shown, the quality assessment method based on self-supervised feature extraction includes the following steps:
[0057] S10. Construct a collaborative dual autoencoder. The self-supervised collaborative dual autoencoder extracts content features from the input content image and distortion features from the input distortion image. The input content image and the input distortion image are the same or have the same content.
[0058] Specifically, constructing a collaborative dual autoencoder includes constructing a content autoencoder and constructing a distortion autoencoder. The construction of the content autoencoder includes constructing a content encoder 11 and a content decoder 12, and the construction of the distortion autoencoder includes constructing a distortion encoder 21 and a distortion decoder 22. The content encoder 11 extracts content features from the input content image, and the content decoder 12 decodes the content features to obtain a reconstructed content image. The distortion encoder 21 extracts distortion features from the input distortion image, and the distortion decoder 22 decodes the content features and distortion features to obtain a reconstructed distortion image.
[0059] The process of constructing the distortion encoder 21 includes first extracting different feature maps from multiple layers of the input distortion image, then feeding each feature map into the spatial pyramid pooling module for fusion to obtain multiple low-dimensional features of fixed length, and finally concatenating the multiple low-dimensional features as the final output of the distortion encoder 21.
[0060] The construction of the distortion decoder 22 includes: 1. The distortion features are passed through a fully connected layer to obtain extended features. The extended features include the distortion features of each layer and the correlation information representing the relationship between the distortion features of each layer. 2. The extended features are divided into multiple decomposition features, and the multiple decomposition features are used as one of the inputs of the sub-modulation residual blocks in turn.
[0061] The self-supervised training of the collaborative dual autoencoder includes self-supervised training of the content autoencoder and self-supervised training of the distortion autoencoder. The self-supervised training of the content autoencoder includes: 1. Training the content autoencoder using a distortion-free image as training data, with the output constraint of the content decoder 12 being the distortion-free image itself. 2. Training the content autoencoder using a distortion image as training data, with a distortion-free image sharing the same content as the distortion image serving as a constraint on the output of the content decoder 12, and the content features of the distortion-free image serving as constraints on the content features of the distortion image. The self-supervised training of the distortion autoencoder includes: training the distortion autoencoder using a distortion image as training data, with the output constraint of the distortion decoder 22 being the distortion image itself, and the loss function guiding backpropagation to optimize the distortion autoencoder.
[0062] The output constraint functions for content decoder 12 and distortion decoder 22 are:
[0063] l overall =μ×l pixel (I o ,I r )+(1-μ)×l percp (I o ,I r (1)
[0064] Where, loveall is the global constraint, μ is the balance parameter, lpixel is the pixel-level constraint, Io is the distorted or distortion-free image, Ir is the reconstructed content image or the reconstructed distorted image, and lpercp is the perceptual constraint.
[0065]
[0066] Where N is the total number of pixels, and k is the index of a pixel in the image;
[0067]
[0068] in The VGGNet network feature layer selected for the j-th layer, where L is the total number of selected feature layers, and C is the number of feature layers selected. j H j W j is the size of the output feature map of the j-th feature layer.
[0069] When performing self-supervised training on the content autoencoder, sparsity constraints and / or distance constraints are also applied to the extracted content features.
[0070] S20. Obtain the content feature vector based on the content features, and obtain the distortion feature vector based on the distortion features. Obtaining the content feature vector based on the content features involves passing the content features through a fusion network composed of convolutional layers and global pooling layers to obtain the content feature vector.
[0071] S30. Concatenate the content feature vector and the distortion feature vector to obtain the quality feature vector.
[0072] S40. The quality feature vector is passed through a fully connected layer to obtain the predicted quality score. The predicted quality score is obtained by training the sample data with subjective opinion scores.
[0073] Effective quality-related feature extraction can be achieved using the distortion encoder 21 in the distortion autoencoder. The performance of the distortion features is verified using the feature visualization tool t-SNE. In one specific implementation, 779 distorted images and 29 distortion-free images from the LIVE database are used as test samples. This dataset includes five distortion types (Gaussian blur (GB), white noise (WN), JPEG compression, JP2K compression, and Rayleigh fast decay (FF)), and each image has a corresponding quality score (the higher the score, the lower the quality). The feature 2D visualization tool t-SNE can map high-dimensional data to a point in a 2D image in a self-supervised manner, thus obtaining a total of 779 points on the 2D plane. The scatter plot is then colored according to the distortion type and quality score of the test image, resulting in the scatter plot shown in Figure 2. Figure 2 As shown in (a), the distortion-free image, lacking distortion, exhibits fewer feature activations, all concentrated near the origin. Other scattered points with different distortion types cluster well together, especially white noise and JPEG compression, which have the greatest impact on human visual perception. This indicates that the extracted distortion features can effectively characterize the distortion type. Figure 2 As shown in (b), the best quality images have scatter points concentrated in the central region; higher quality images have scatter points closer to the center, while lower quality images have scatter points farther from the center. This indicates a high correlation between the extracted distortion features and the distortion score.
[0074] The quality evaluator trained by this invention achieves industry-leading performance. In one specific implementation, the proposed quality evaluation model (QACoAE) is trained on five mainstream image quality evaluation databases. 80% of the data in each database is used as the training set, and 20% as the test set, with no shared image content between the two sets. The Pearson linear correlation coefficient (PLCC) and Spearman rank correlation coefficient (SRCC) between the predicted scores and subjective evaluation scores on the test set are used as metrics for prediction performance; higher coefficient values indicate better linearity and monotonicity of the predicted scores, respectively. The rating model is trained 10 times on each database, and the average PLCC and SRCC are calculated and compared with classic and currently mainstream no-reference evaluation algorithms. The results are shown below. Figure 3 As shown, the proposed algorithm achieves excellent performance with both synthetic and real distorted databases, and is the best in the final weighted average performance comparison.
[0075] The specific framework for self-supervised feature extraction is a collaborative dual autoencoder, such as... Figure 4 As shown, since the final visual quality of an image is related to its image content and the resulting distortion, this invention considers providing image content and distortion information separately. First, this invention designs a content autoencoder to encode image content information. To extract content features weakly correlated with distortion, the learning target for the input image is a distortion-free image. Next, another distortion autoencoder is designed to encode image distortion information. Unlike traditional distortion autoencoders, the distortion decoder 22 of this distortion autoencoder receives not only the distortion features output by the distortion encoder 21 but also content features encoded by the content encoder 11. With content features as auxiliary information, the distortion encoder 21 can selectively encode distortion information while constrained by the extraction of low-dimensional features. The content autoencoder and distortion autoencoder in this invention collaborate on reconstruction; therefore, this novel framework is called a collaborative autoencoder. This framework can extract content features and distortion features from the input image separately. These features are highly correlated with image quality and can play an efficient role in subsequent no-reference quality evaluation. The specific structures and designs of the two autoencoders are described below.
[0076] Content autoencoder
[0077] Traditional encoders extract the main information from the input data, which in the case of images is content information. This invention designs a simple autoencoder based on a CNN network to extract content features from images. The network structure of the content autoencoder is as follows: Figure 5As shown, in this embodiment, the content features are first obtained through a content encoder 11 consisting of three convolutional layers and four basic residual blocks. These content features have 256 channels, designed to better represent a variety of rich image content. Then, the content features are passed through a content decoder 12 consisting of four basic residual blocks and three deconvolutional layers to reconstruct the input image. It should be noted that... Figure 5 This is merely illustrating one implementation. The individual content encoder 11 and / or content decoder 12 are existing technologies well known to those skilled in the art. The content encoder 11 is not limited to consisting of three convolutional layers and four basic residual blocks, and the content decoder 12 is not limited to consisting of four basic residual blocks and three deconvolutional layers. Furthermore, the content features are not limited to having only 256 channels.
[0078] To ensure that the features extracted by the network for both lossless and distorted images are content features, this invention incorporates two constraints during network training: the first constraint is on the output of the content decoder 12, and the second constraint is on the content features extracted by the content encoder 11. It should be noted that, if the first constraint exists, the second constraint can be added to enhance its effectiveness.
[0079] Whether the image is undistorted or distorted, this invention aims to extract features that are relevant to the content and as unrelated to distortion as possible. Therefore, this invention uses only undistorted images (or the original undistorted version of distorted images) as constraints for the decoder output. In the generation task, in order to generate better visual reconstruction of the image, the first type of constraint often uses two types of constraints—pixel-level constraints and perceptual constraints—to form a total constraint, so that the reconstructed image can better approximate the target image, as shown in formula (1):
[0080] l overall =μ×l pixel (I o ,I r )+(1-μ)×l percp (I o ,I r (1)
[0081] In this context, loverall represents the global constraint, μ is a balance parameter between pixel-level and perceptual constraints, typically obtained by comparing differences between features extracted layer by layer from a pre-trained network such as VGGNet. lpixel represents the pixel-level constraint, the most commonly used being the root mean square error (MSE), Io represents the distorted or undistorted image, Ir represents the reconstructed content image or the reconstructed distorted image, and lpercp represents the perceptual constraint.
[0082]
[0083] Where N is the total number of pixels, and k is the index of a pixel in the image;
[0084]
[0085] in The VGGNet network feature layer selected for the j-th layer, where L is the total number of selected feature layers, and C is the number of feature layers selected. j H j W j is the size of the output feature map of the j-th feature layer.
[0086] Regarding the second constraint, firstly, to avoid overcomplete encoding due to high-dimensional representation leading to simple pixel-level copying of content features, this invention applies a sparsity constraint to the extracted features, meaning the average activation value of each channel of the feature is constrained to be within a certain small value. Furthermore, when the input image of the content autoencoder is a distorted image, in addition to constraining its target image to be its distortion-free version, this invention can also use features extracted from the distortion-free version of the distorted image to constrain the features extracted from the distorted image. The loss function is obtained by calculating the distance between features, i.e., a distance constraint.
[0087] Therefore, by constraining the output features of the content encoder 11 and the output image of the content decoder 12, we can train a content encoder that can robustly encode the content information of the image. The extracted content features can play a positive role in the extraction of distortion features and score prediction.
[0088] Distortion autoencoder
[0089] To achieve targeted extraction of distortion information from the input image, a special autoencoder needs to be designed. First, considering that distortion may occur in the overall structure, local details, and locally consistent regions of the image, we design a multi-layer feature extraction distortion encoder 21. The specific structure of the distortion encoder 21 is as follows... Figure 6 As shown, the network first extracts different feature maps from four layers, then feeds these feature maps into the Spatial Pyramid Pooling (SPP) module for fusion to obtain low-dimensional features of fixed length. Finally, the four features are concatenated into a feature vector of length 256, which serves as the final output of the distortion encoder 21. The design of the multi-layer distortion feature extractor lays a solid foundation for subsequent efficient distortion feature extraction.
[0090] There are two main types of existing distortion encoders: one is an encoder that extracts the last layer as a feature after passing through a multi-layer neural network; the other is an encoder that extracts multi-layer feature maps from the neural network but uses the result of simple global average pooling as a feature.
[0091] Unlike existing distortion encoders, the distortion encoder of this invention feeds feature maps from different layers into a spatial pyramid pooling module for fusion to obtain low-dimensional features of a fixed length. Please add that compared to existing distortion encoders, the distortion encoder of this invention has the following two advantages: First, by extracting features at different layers, the distortion encoder of this invention achieves a better perception of distortion information in both local texture and global structure of the image. Second, spatial pyramid pooling is applied to the extracted features to further analyze information from different receptive domains at multiple scales. Therefore, compared to existing distortion encoders, the features extracted by the distortion encoder of this invention can better reflect the distortion information of the image. It should be noted that... Figure 6 This is just one implementation of the distortion encoder 21 of the present invention. During feature extraction, it is not limited to dividing the input image into four layers for extraction. Instead, the input image can be divided into several layers as needed, and the length of the feature vector extracted from each layer is not limited to 64.
[0092] Secondly, this invention designs a distortion decoder 22, whose input includes not only the features obtained from the distortion encoder 21, but also the content features obtained from the content encoder 11. To enable the distortion encoder 21 to extract highly representative distortion features, the content features need to be effectively assisted in the distortion decoder 22. Therefore, we propose a distortion decoder 22 that modulates content features based on distortion features, and its network structure is as follows: Figure 7 As shown. First, similar to the content decoder 12, the content features are decoded through four sub-modulation residual blocks and three deconvolutional layers. In each sub-modulation residual block, distortion features are used as modulation information to adjust the content features, embedding the distortion information within them. Specifically, the distortion features first pass through a fully connected (FC) layer to obtain extended features. The extended features include not only the distortion characteristics of each layer but also correlation information representing the relationships between them. Then, the extended features are further divided into four decomposed features, which are sequentially used as one of the inputs to the sub-modulation residual blocks. It should be noted that in existing technologies, the input features are generally directly divided into multiple decomposed features. In this invention, the distorted features are extended after passing through a fully connected layer. The extended features are then divided into multiple decomposition features. Compared with the prior art, the advantages of this invention are: first, by setting different parameters for the fully connected layer, the feature length can be adaptively changed to obtain the most suitable decomposition feature length for subsequent reconstruction; second, the original distorted features are spliced from multiple scale features of the distortion encoder, and there is no mutual information between the features. The fully connected layer can enable certain information interaction between the features, which is more conducive to the reconstruction of distorted images.
[0093] In each sub-modulation residual block, the decomposed distortion features are first copied pixel-by-pixel to obtain a distortion feature block with the same size as the content features. This block is then concatenated twice within the residual convolutional structure of the content features to achieve good modulation of the content features. It should be noted that those skilled in the art can divide the extended features into multiple decomposed features as needed, and the number of deconvolutional layers can also be adjusted accordingly. Figure 7 The sub-modulation residual block in the present invention is only one implementation of the present invention. The present invention can also use other forms of sub-modulation residual blocks.
[0094] For the constraints on the distortion autoencoder network, only the input image needs to be used as the output constraint of the distortion decoder. The constraint loss function is the same as the first constraint for the content autoencoder, as shown in Equations (1), (2), and (3). Under this constraint, the distortion encoder can encode distortion features in a targeted manner with the help of existing robust content features.
[0095] Independent and Collaborative Training Strategies
[0096] A robust learning strategy is required for the complete training of the collaborative autoencoder. The feature extraction framework proposed in this invention requires two training steps. The first step is training the content autoencoder, and the second step is training the distortion autoencoder.
[0097] In training the content autoencoder, a large number of lossless images are first used as training data, and the output constraints of the content decoder are the input images themselves. Next, distorted images are used as training input data, and lossless images sharing the same content are used as constraints on the content decoder's output. The content features of the lossless images serve as constraints on the content features of the distorted images. Because lossless images are required as constraints, the distorted image data used at this stage must be synthetically distorted. After the content autoencoder completes independent training, its parameters are fixed to ensure that it provides consistent content features, facilitating the training of subsequent modules.
[0098] In training the distortion autoencoder, since a robust content encoder for extracting content features has already been provided, a large number of distortion images, including single-distortion, compound-distortion, and true-distortion images, can be used as training data for the distortion autoencoder. The output constraint of the distortion decoder is also the input image itself, and the loss function guides backpropagation to optimize the distortion autoencoder. Through these two independent and collaborative training steps, a complete training of the collaborative autoencoder framework can be achieved. Independent and efficient content features and distortion features can be extracted from any given image, playing a positive role in subsequent quality score prediction.
[0099] No-reference quality assessment method based on self-supervised feature extraction
[0100] By proposing the aforementioned feature extractor training method based on collaborative autoencoders, we can train a content encoder 11 and a distortion encoder 21 on large-scale unlabeled sample data. Since image quality is related to both image content and distortion, the information encoded by both encoders is used as a valid reference for predicting quality scores. However, the extracted content features are high-dimensional features, therefore preliminary fusion is required before regression. The proposed no-reference quality assessment network framework based on self-supervised feature extraction is as follows: Figure 8 As shown, the input image is first processed by a pre-trained content encoder 11 and distortion encoder 21 to obtain content features and distortion features, respectively. Then, the content features are processed by a fusion network consisting of convolutional layers and global pooling layers to obtain a content feature vector, which is then concatenated with the distortion feature vector to obtain a quality feature vector. The quality feature vector is then processed through three fully connected (FC) layers to obtain a quality prediction score. The entire network framework can be trained using existing database data with subjective score labels. The loss function is set to the MSE of the predicted score and the subjective score, resulting in a post-trained predicted quality score.
[0101] Based on the above-described quality assessment method based on self-supervised feature extraction, the present invention provides a computer-readable storage medium storing one or more programs, which can be executed by one or more processors to implement the steps in the quality assessment method based on self-supervised feature extraction in the above embodiments.
[0102] Based on the aforementioned quality evaluation method based on self-supervised feature extraction, this invention also provides a terminal, such as... Figure 9 As shown, the system includes at least one processor 30, a display screen 31, and a memory 32, and may also include a communication interface 33 and a bus 34. The processor 30, display screen 31, memory 32, and communication interface 33 can communicate with each other via the bus 34. The display screen 31 is configured to display a preset user guide interface in the initial setup mode. The communication interface 33 can transmit information. The processor 30 can call logical instructions in the memory 32 to execute the methods described in the above embodiments.
[0103] Furthermore, the logic instructions in the aforementioned memory 32 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium.
[0104] The memory 32, as a computer-readable storage medium, can be configured to store software programs, computer-executable programs, such as program instructions or modules corresponding to the methods in the embodiments of this disclosure. The processor 30 executes functional applications and data processing by running the software programs, instructions, or modules stored in the memory 32, thereby implementing the methods in the above embodiments.
[0105] The memory 32 may include a program storage area and a data storage area. The program storage area may store the operating system and at least one application program required for a given function; the data storage area may store data created based on the use of the terminal device. Furthermore, the memory 32 may include high-speed random access memory (RAM) and non-volatile memory. Examples include various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks; these may also be transient storage media.
[0106] Furthermore, the specific process of loading and executing multiple instruction processors in the aforementioned storage medium and terminal device has been described in detail in the above method, and will not be repeated here.
[0107] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A quality evaluation method based on self-supervised feature extraction, characterized in that... include: A collaborative dual autoencoder is constructed. The self-supervised training collaborative dual autoencoder extracts content features from the input content image and distortion features from the input distortion image. The input content image and the input distortion image are the same or have the same content. A content feature vector is obtained based on the content features, and a distortion feature vector is obtained based on the distortion features. The content feature vector and the distortion feature vector are concatenated to obtain the quality feature vector; The quality feature vector is passed through a fully connected layer to obtain a predicted quality score. The predicted quality score is obtained by training sample data with subjective opinion scores. The construction of the cooperative dual autoencoder includes constructing a content autoencoder and constructing a distortion autoencoder. The construction of the content autoencoder includes constructing a content encoder and constructing a content decoder. The construction of the distortion autoencoder includes constructing a distortion encoder and constructing a distortion decoder. The content encoder extracts content features from the input content image, and the content decoder decodes the content features to obtain a reconstructed content image. The distortion encoder extracts distortion features from the input distortion image, and the distortion decoder decodes the content features and distortion features to obtain a reconstructed distortion image. The construction of the distortion autoencoder includes: distortion features are passed through a fully connected layer to obtain extended features, the extended features including distortion features from each layer and correlation information representing the relationship between the distortion features from each layer; the extended features are divided into multiple decomposition features, each decomposition feature corresponding to a multiple sub-modulation residual block, the multiple decomposition features being used sequentially as one of the inputs to the corresponding sub-modulation residual block; the multiple sub-modulation residual blocks are cascaded, each sub-modulation residual block taking the current content feature and the corresponding decomposition feature as input and outputting the modulated content feature; the first sub-modulation residual block uses the content feature extracted by the content encoder as the current content feature; the remaining sub-modulation residual blocks take the modulated content feature output by the previous sub-modulation residual block as the current content feature input to the next sub-modulation residual block; the modulated content feature output by the last sub-modulation residual block is input to a deconvolution layer, and the reconstructed distortion image is output. Self-supervised training of the collaborative dual autoencoder includes self-supervised training of the content autoencoder and self-supervised training of the distortion autoencoder. The self-supervised training of the content autoencoder includes: The content autoencoder is trained using the lossless image as training data, and the output constraint of the content decoder is the lossless image itself. The content autoencoder is trained using distorted images as training data. Distortionless images that share the same content as the distorted images serve as constraints on the output of the content decoder, and the content features of the distortionless images serve as constraints on the content features of the distorted images. The self-supervised training of the distortion autoencoder includes: The distortion autoencoder is trained using the distorted image as training data. The output constraint of the distortion decoder is the distorted image itself. The loss function guides backpropagation to optimize the distortion autoencoder.
2. The quality evaluation method based on self-supervised feature extraction according to claim 1, characterized in that, The output constraints of the content decoder and the distortion decoder are defined by the following function: (1) Where, loveall is the global constraint, μ is the balance parameter, lpixel is the pixel-level constraint, Io is the distorted or distortion-free image, Ir is the reconstructed content image or the reconstructed distorted image, and lpercp is the perceptual constraint. (2) Where N is the total number of pixels. This is the index of a specific pixel position in the image. (3) in, The VGGNet feature layer selected for layer j, where L is the total number of selected feature layers. is the size of the output feature map of the j-th feature layer.
3. The quality evaluation method based on self-supervised feature extraction according to claim 2, characterized in that: When performing self-supervised training on the content autoencoder, sparsity constraints and / or distance constraints are also applied to the extracted content features.
4. The quality evaluation method based on self-supervised feature extraction according to claim 1, characterized in that: The construction of the distortion encoder involves first extracting different feature maps from multiple layers of the input distortion image, then feeding each feature map into a spatial pyramid pooling module for fusion to obtain multiple low-dimensional features of fixed length, and finally concatenating the multiple low-dimensional features as the final output of the distortion encoder.
5. The quality evaluation method based on self-supervised feature extraction according to claim 1, characterized in that: The process of obtaining the content feature vector based on the content features involves passing the content features through a fusion network composed of convolutional layers and global pooling layers to obtain the content feature vector.
6. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores one or more programs, which can be executed by one or more processors to implement the steps in the quality evaluation method based on self-supervised feature extraction as described in any one of claims 1 to 5.
7. A terminal, characterized in that... include: A processor and a memory, wherein the memory stores a computer-readable program that can be executed by the processor; when the processor executes the computer-readable program, it implements the steps of the quality assessment method based on self-supervised feature extraction as described in any one of claims 1 to 5.