Model training method, video quality evaluation method, device, equipment and medium
By training a deep learning-based video quality assessment model, and utilizing a 3D convolutional neural network and an attention model, the shortcomings of feature extraction and boundary detection in existing video quality assessment technologies are addressed, resulting in higher assessment accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZTE CORP
- Filing Date
- 2021-09-09
- Publication Date
- 2026-06-16
AI Technical Summary
Existing video quality assessment methods such as PSNR, SSIM, and VMAF have shortcomings in feature extraction and boundary differentiation, resulting in poor assessment results. Furthermore, existing objective methods lack accuracy in industrial applications.
An initial video quality assessment model based on deep learning is adopted. By obtaining the MOS values of reference and distorted video data, a three-dimensional convolutional neural network, an attention model, and a data fusion module are trained until the convergence condition is met, thus forming the final video quality assessment model.
It improves the accuracy of video quality assessment, fully extracts image features and accurately detects image boundaries, ensures the independence and diversity of the model, and enhances the precision of video quality assessment.
Smart Images

Figure CN115775218B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image processing technology, specifically to a model training method for video quality assessment, a video quality assessment method, a model training device for video quality assessment, a video quality assessment device, an electronic device, and a computer storage medium. Background Technology
[0002] With the advent of the 5G era, video applications are becoming increasingly widespread, such as live streaming, short videos, and video calls. In the internet age where everything relies on video, the ever-increasing data traffic poses a severe challenge to the stability of video service systems. How to correctly evaluate video quality has become a major bottleneck restricting the development of various technologies. It can even be said that video quality evaluation has become the most fundamental and important issue in the audio and video field, and urgently needs to be solved.
[0003] Currently, video quality assessment methods in the industry fall into two main categories: subjective video quality assessment and objective video quality assessment. Subjective methods allow observers to make intuitive judgments about video quality. While relatively accurate, they are also complex and their results are easily affected by various factors, making them unsuitable for direct application in industrial fields. Therefore, in practice, easily implemented artificial intelligence-based objective methods are typically used. However, current solutions utilizing these technologies, such as PSNR (Peak Signal to Noise Ratio), SSIM (structural similarity index measurement), and VMAF (Video Multi-Method Assessment Fusion), do not yield satisfactory results. Therefore, achieving more accurate video quality assessment remains a pressing challenge. Summary of the Invention
[0004] This disclosure addresses the aforementioned deficiencies in the prior art by providing a model training method for video quality assessment, a video quality assessment method, a model training device for video quality assessment, a video quality assessment device, an electronic device, and a computer storage medium.
[0005] In a first aspect, embodiments of this disclosure provide a model training method for video quality assessment, comprising:
[0006] Acquire training video data, wherein the training video data includes reference video data and distorted video data;
[0007] Determine the mean opinion value (MOS) of each training video data set;
[0008] The initial video quality assessment model is trained based on the training video data and its MOS value until the convergence condition is met, thus obtaining the final video quality assessment model.
[0009] In some embodiments, training a preset initial video quality assessment model based on the training video data and its MOS value until convergence is achieved includes:
[0010] The training set and the validation set are determined according to a preset ratio and the training video data, wherein the intersection of the training set and the validation set is an empty set;
[0011] The parameters of the initial video quality assessment model are adjusted based on the training set and the MOS values of each video data in the training set, and the hyperparameters of the initial video quality assessment model are adjusted based on the validation set and the MOS values of each video data in the validation set, until the convergence condition is met.
[0012] In some embodiments, the convergence condition includes that the evaluation error rate of each video data in both the training set and the validation set does not exceed a preset threshold, and the evaluation error rate is calculated using the following formula:
[0013] E = (|S - Mos|) / Mos, where,
[0014] E represents the evaluation error rate of the current video data;
[0015] S is the evaluation score of the current video data output by the initial quality assessment model after adjusting the parameters and hyperparameters;
[0016] Mos is the Mos value of the current video data.
[0017] In some embodiments, the initial video quality assessment model includes a three-dimensional convolutional neural network for extracting motion information from image frames.
[0018] In some embodiments, the initial video quality assessment model further includes an attention model, a data fusion processing module, a global pooling module, and a fully connected layer, wherein the attention module, the data fusion processing module, the three-dimensional convolutional neural network, the global pooling module, and the fully connected layer are cascaded in sequence.
[0019] In some embodiments, the attention model includes a cascaded multi-input network, a two-dimensional convolutional module, a dense convolutional network, a downsampling processing module, a hierarchical convolutional network, an upsampling processing module, and an attention mechanism network. The dense convolutional network includes at least two cascaded dense convolutional modules, and each dense convolutional module includes four cascaded densely connected convolutional layers.
[0020] In some embodiments, the attention mechanism network includes a cascaded attention convolution module, a linear correction unit activation module, a nonlinear activation module, and an attention upsampling processing module.
[0021] In some embodiments, the hierarchical convolutional network includes a first hierarchical network, a second hierarchical network, a third hierarchical network, and a fourth upsampling processing module. The first hierarchical network includes a cascaded first downsampling processing module and a first hierarchical convolutional module. The second hierarchical network includes a cascaded second downsampling processing module, a second hierarchical convolutional module, and a second upsampling processing module. The third hierarchical network includes a cascaded global pooling module, a third hierarchical convolutional module, and a third upsampling processing module. The first hierarchical convolutional module is also cascaded with the second downsampling processing module. The first hierarchical convolutional module and the second upsampling processing module are cascaded with the fourth upsampling processing module. The fourth upsampling processing module and the third upsampling processing module are also cascaded with the third hierarchical convolutional module.
[0022] In some embodiments, determining the mean opinion value (MOS) of each of the training video data includes:
[0023] The training video data are grouped into groups, each group including one reference video data and multiple distorted video data, and the resolution and frame rate of each video data in each group are the same.
[0024] Classify the video data in each group;
[0025] The video data for each category in each group is classified into different levels;
[0026] The MOS value of each training video data is determined based on the grouping, classification, and grading of the training video data.
[0027] Furthermore, this disclosure provides a video quality assessment method, including:
[0028] The final quality assessment model trained according to the method described above is used to process the video data to be assessed, thereby obtaining the quality assessment score of the video data to be assessed.
[0029] In another aspect, this disclosure provides a model training apparatus for video quality assessment, comprising:
[0030] An acquisition module is used to acquire training video data; wherein, the training video data includes reference video data and distorted video data;
[0031] The processing module is used to determine the mean opinion value (MOS) of each of the training video data.
[0032] The training module is used to train a preset initial video quality assessment model based on the training video data and its MOS value until the convergence condition is met, so as to obtain the final video quality assessment model.
[0033] In another aspect, this disclosure provides a video quality assessment device, comprising:
[0034] The evaluation module is used to process the video data to be evaluated according to the final quality evaluation model trained by the model training method for video quality evaluation as described above, and to obtain the quality evaluation score of the video data to be evaluated.
[0035] In another aspect, this disclosure provides an electronic device, comprising:
[0036] One or more processors;
[0037] A storage device on which one or more programs are stored;
[0038] When the one or more programs are executed by the one or more processors, the one or more processors perform at least one of the following methods:
[0039] The model training method for video quality assessment as described above;
[0040] The video quality assessment method described above.
[0041] In another aspect, this disclosure provides a computer storage medium having a computer program stored thereon, wherein the program, when executed, implements at least one of the following methods:
[0042] The model training method for video quality assessment as described above;
[0043] The video quality assessment method described above.
[0044] The video quality assessment model training method provided in this disclosure pre-defines an initial video quality assessment model to fully extract image features and accurately detect boundaries in images. Training video data, including reference video data and distorted video data, is then acquired. The initial video quality assessment model is trained using both the reference and distorted video data to obtain a final video quality assessment model. This method clearly distinguishes between distorted and non-distorted video data (i.e., reference video data), thus ensuring the independence and diversity of the video data used for model training. The final video quality assessment model obtained by training the initial model can fully extract image features and accurately detect boundaries in images. This final model can be directly used to assess the quality of the video data to be evaluated, improving the accuracy of video quality assessment. Attached Figure Description
[0045] Figure 1 This is a flowchart illustrating the model training method for video quality assessment provided in this disclosure;
[0046] Figure 2 This is a flowchart illustrating the training process of the initial video quality assessment model provided in this publication;
[0047] Figure 3 This is a schematic diagram of the three-dimensional convolutional neural network provided in this disclosure;
[0048] Figure 4 This is a flowchart illustrating the dense convolutional network disclosed in this publication;
[0049] Figure 5 This is a flowchart illustrating the attention mechanism network provided in this disclosure;
[0050] Figure 6 This is a flowchart illustrating the layered convolutional network provided in this publication;
[0051] Figure 7 This is a flowchart illustrating the initial video quality assessment model provided in this disclosure;
[0052] Figure 8a This is a schematic diagram of the 3D-PVQA method provided in this disclosure;
[0053] Figure 8b These are screenshots of the reference video data and the distorted video data provided in this public disclosure;
[0054] Figure 9 This is a flowchart illustrating the process for determining the Mean Opinion Value (MOS) of each training video data provided in this disclosure;
[0055] Figure 10 This is a flowchart illustrating the video quality assessment method disclosed herein;
[0056] Figure 11 This is a schematic diagram of a model training device for video quality assessment provided in this disclosure;
[0057] Figure 12 This is a schematic diagram of the modules of the video quality assessment device provided in this public disclosure;
[0058] Figure 13 This is a schematic diagram of the electronic device provided in this disclosure;
[0059] Figure 14 This is a schematic diagram of the computer storage medium provided in this disclosure. Detailed Implementation
[0060] Exemplary embodiments will be described more fully below with reference to the accompanying drawings; however, these exemplary embodiments may be embodied in different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will enable those skilled in the art to fully understand the scope of this disclosure.
[0061] As used herein, the term “and / or” includes any and all combinations of one or more related enumerated entries.
[0062] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit this disclosure. As used herein, the singular forms “a” and “the” are also intended to include the plural forms unless the context clearly indicates otherwise. It will also be understood that when the terms “comprising” and / or “made of” are used in this specification, the presence of the said feature, integral, step, operation, element, and / or component is specified, but the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof is not excluded.
[0063] The embodiments described herein can be described with reference to plan views and / or cross-sectional views using the ideal schematic diagrams of this disclosure. Therefore, the example illustrations can be modified according to manufacturing techniques and / or tolerances. Therefore, the embodiments are not limited to those shown in the drawings, but include modifications to configurations formed based on manufacturing processes. Therefore, the areas illustrated in the drawings are schematic in nature, and the shapes of the areas shown in the figures illustrate specific shapes of areas of an element, but are not intended to be limiting.
[0064] Unless otherwise specified, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art. It will also be understood that terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with their meaning in the context of the relevant art and this disclosure, and will not be interpreted as having an idealized or overly formal meaning, unless expressly so defined herein.
[0065] Currently, commonly used video quality assessment schemes such as PSNR, SSIM, and VMAF suffer from incomplete feature extraction and unclear boundary differentiation, resulting in unsatisfactory final results. This disclosure proposes a pre-set initial video quality assessment model for fully extracting features and accurately detecting boundaries in images. Reference video data and distorted video data are acquired, and the initial video quality assessment model is trained using the reference video data, distorted video data, and their MOS (Mean Opinion Score) values to obtain the final video quality assessment model, thereby improving the accuracy of video quality assessment.
[0066] like Figure 1As shown, this disclosure provides a model training method for video quality assessment, which may include the following steps:
[0067] In step S11, training video data is acquired, wherein the training video data includes reference video data and distorted video data.
[0068] In step S12, the MOS value of each training video data is determined.
[0069] In step S13, a preset initial video quality assessment model is trained based on the training video data and its MOS value until the convergence condition is met, thus obtaining the final video quality assessment model.
[0070] Reference video data can be considered as standard video data, which can be obtained from open-source datasets LIVE, CSIQ, IVP, and a self-made dataset CENTER, along with distorted video data. The MOS (Mean Offset Quality) value is a numerical indicator used to characterize the quality of video data. Video data in open-source datasets LIVE, CSIQ, and IVP typically carry corresponding MOS values, but video data in the self-made dataset CENTER does not. Therefore, it is necessary to determine the MOS values for each training video dataset. This can be done by directly obtaining the MOS values from the training video data in the open-source datasets LIVE, CSIQ, and IVP, and then generating corresponding MOS values for the training video data from the self-made dataset CENTER. Alternatively, MOS values can be generated directly for all training video data. During the training of the initial video quality assessment model, when the convergence condition is met, the model is considered to have met the video quality assessment requirements, and training is stopped, resulting in the final video quality assessment model.
[0071] As can be seen from steps S11-S13 above, the model training method for video quality assessment provided in this disclosure pre-sets an initial video quality assessment model to fully extract image features and accurately detect boundaries in the image. Training video data, including reference video data and distorted video data, is then acquired. Simultaneously, the initial video quality assessment model is trained using both the reference and distorted video data to obtain the final video quality assessment model. This method clearly distinguishes between distorted and non-distorted video data (i.e., reference video data), thus ensuring the independence and diversity of the video data used to train the model. The final video quality assessment model obtained by training the initial model can fully extract image features and accurately detect boundaries in the image. This final model can be directly used to assess the quality of the video data to be evaluated, improving the accuracy of video quality assessment.
[0072] Generally, given a fixed network structure, two factors influence a model's final performance: the model's parameters (e.g., weights, biases) and its hyperparameters (e.g., learning rate, number of layers). Using the same training data to optimize both parameters and hyperparameters can lead to absolute overfitting. Therefore, it's advisable to use two separate datasets to optimize the parameters and hyperparameters of the initial video quality assessment model independently.
[0073] Correspondingly, such as Figure 2 As shown, in some embodiments, training a preset initial video quality assessment model based on training video data and its MOS values until convergence is achieved (i.e., as described in step S13) may include the following steps:
[0074] In step S131, a training set and a validation set are determined according to a preset ratio and training video data, wherein the intersection of the training set and the validation set is an empty set.
[0075] In step S132, the parameters of the initial video quality assessment model are adjusted according to the training set and the MOS values of each video data in the training set, and the hyperparameters of the initial video quality assessment model are adjusted according to the validation set and the MOS values of each video data in the validation set, until the convergence condition is met.
[0076] This disclosure does not specify a particular preset ratio; for example, the training data can be divided into training and validation sets in a 6:4 ratio. Of course, other preset ratios such as 8:2 or 5:5 are also possible. To simply evaluate the generalization ability of the final video quality assessment model, the training video data can also be defined as training, validation, and test sets. For example, the training video data can be divided into training, validation, and test sets in a 6:2:2 ratio, with the intersection of each pair of training, validation, and test sets being empty. After partitioning, the training and validation sets are used to train the initial video quality assessment model to obtain the final video quality assessment model, and the test set is used to evaluate the generalization ability of the final video quality assessment model. It should be noted that the more test set data there is, the longer it takes to evaluate the generalization ability of the final video quality assessment model using the test set; the more video data used to train the initial video quality assessment model, the higher the accuracy of the final video quality assessment model. To further improve the efficiency and accuracy of video quality assessment, the amount of training video data and the proportion of training and validation sets in the training video data can be appropriately increased. For example, the training video data can be divided into training, validation, and test sets according to other ratios such as 10:1:1.
[0077] As can be seen from the above steps S131-S132, the model training method for video quality assessment provided in this disclosure determines the training set and validation set with an empty intersection based on a preset ratio and training video data. The parameters of the initial video quality assessment model are adjusted using the MOS values of the training set and each video data in the training set, and the hyperparameters of the initial video quality assessment model are adjusted using the MOS values of the validation set and each video data in the validation set. When the convergence condition is met, a final video quality assessment model with high accuracy in fully extracting image features and accurately detecting boundaries in the image can be obtained, thus improving the accuracy of video quality assessment.
[0078] Training an initial video quality assessment model based on training video data and its MOS (Mean Offset) values is a deep learning-based model training process. Essentially, it uses the MOS values of the training video data as a benchmark, striving to make the model's output values continuously approach the MOS values. When the difference between the model's output assessment result and the MOS value is small, the model can be considered to have met the requirements for video quality assessment.
[0079] Accordingly, the convergence condition includes that the evaluation error rate of each video data in both the training and validation sets does not exceed a preset threshold, and the evaluation error rate is calculated using the following formula:
[0080] E = (|S - Mos|) / Mos, where,
[0081] E represents the evaluation error rate of the current video data;
[0082] S is the evaluation score of the current video data output by the initial quality assessment model after adjusting the parameters and hyperparameters;
[0083] Mos is the Mos value of the current video data.
[0084] For any given video data, the MOS value of the current video data is predetermined. After inputting the current video data into the initial quality assessment model with adjusted parameters and hyperparameters, the initial quality assessment model will output the assessment score S of the current video data. Therefore, the assessment error rate E of the current video data can be calculated. When the error assessment rates of each video data in the training set and each video data in the validation set do not exceed a preset threshold, it indicates that the difference between the model's output assessment result and the MOS value is small, and the model has met the requirements for video quality assessment. At this point, training can be stopped. It should be noted that this disclosure does not specifically limit the preset threshold; for example, the preset threshold can be 0.28, 0.26, 0.24, etc.
[0085] Currently, commonly used video quality assessment schemes such as PSNR, SSIM, and VMAF still suffer from motion information loss, resulting in unsatisfactory final results. This disclosure presupposes that the initial video quality assessment model may include a three-dimensional convolutional neural network for extracting motion information, thereby improving the accuracy of video quality assessment. Accordingly, in some embodiments, the initial video quality assessment model includes a three-dimensional convolutional neural network for extracting motion information from image frames.
[0086] like Figure 3 The diagram shown is a schematic of the three-dimensional convolutional neural network provided in this disclosure. This three-dimensional convolutional neural network can stack multiple consecutive image frames into a cube, and then apply three-dimensional convolutional kernels within the cube. In the structure of the three-dimensional convolutional neural network, each feature map in the convolutional layer (such as...) Figure 3 The right half (as shown) will interact with multiple adjacent consecutive frames in the previous layer (such as... Figure 3 (As shown in the left half) are connected, so motion information between consecutive image frames can be captured.
[0087] Simply using a 3D convolutional neural network to extract motion information from image frames is insufficient for a complete evaluation of video data. Accordingly, in some embodiments, the initial video quality evaluation model may also include an attention model, a data fusion processing module, a global pooling module, and a fully connected layer, wherein the attention module, data fusion processing module, 3D convolutional neural network, global pooling module, and fully connected layer are cascaded in sequence.
[0088] In some embodiments, the attention model includes a cascaded multi-input network, a two-dimensional convolutional module, a dense convolutional network, a downsampling processing module, a hierarchical convolutional network, an upsampling processing module, and an attention mechanism network. The dense convolutional network includes at least two cascaded dense convolutional modules, and each dense convolutional module includes four cascaded densely connected convolutional layers.
[0089] like Figure 4 The diagram shown is a schematic of the dense convolutional network provided in this disclosure. The dense convolutional network includes at least two cascaded dense convolutional modules. Each dense convolutional module includes four cascaded densely connected convolutional layers. The input to each densely connected layer is the fusion of feature maps from all preceding densely connected layers of the current dense convolutional module. The feature map after each pooling layer of the encoder passes through a dense convolutional module. Each time the feature map passes through a dense convolutional module, a Batch Normalization (BN) operation, a ReLU (Rectified Linear Units) activation function operation, and a Convolution (Conv) operation are performed.
[0090] In some embodiments, the attention mechanism network includes a cascaded attention convolution module, a linear correction unit activation module, a nonlinear activation module, and an attention upsampling processing module.
[0091] like Figure 5 The diagram shown is a flowchart of the attention mechanism network provided in this disclosure. The input to the attention mechanism network is a low-dimensional feature g. i and high-dimensional features x l , where x l The output xi of the hierarchical convolutional network is obtained by upsampling it by a factor of 2; a portion of the output of the dense convolutional network is processed by the upsampling module and then input into the hierarchical convolutional network to output xi; g i This is another part of the output of the dense convolutional network; for g i Perform 1*1 convolution (W g :Conv 1*1), for x l Perform 1*1 convolution (W x The two convolution results are then matrix-added, processed by a linear rectified unit activation module (ReLU), followed by 1*1 convolution (ψ: Conv 1*1), non-linear activation (Sigmoid), and upsampling to obtain the linear attention coefficients. Finally, the linear attention coefficient Based on elements and low-dimensional features g i Perform matrix multiplication and retain relevant activations to obtain attention coefficients.
[0092] Linear attention coefficient It can be calculated using the following formula:
[0093]
[0094] Attention coefficient It can be calculated using the following formula:
[0095]
[0096] In formulas (1) and (2), W g That is, for g i The result of performing a 1x1 convolution, W x That is, for x l The result of the 1x1 convolution is given by T, where T is the matrix transpose, and ψ is the result of the 1x1 convolution of the output of the linear correction unit activation module. All of these were obtained when the linear correction unit was activated.
[0097] In some embodiments, the hierarchical convolutional network includes a first hierarchical network, a second hierarchical network, a third hierarchical network, and a fourth upsampling processing module. The first hierarchical network includes a cascaded first downsampling processing module and a first hierarchical convolutional module. The second hierarchical network includes a cascaded second downsampling processing module, a second hierarchical convolutional module, and a second upsampling processing module. The third hierarchical network includes a cascaded global pooling module, a third hierarchical convolutional module, and a third upsampling processing module. The first hierarchical convolutional module is also cascaded with the second downsampling processing module. The first hierarchical convolutional module and the second upsampling processing module are cascaded with the fourth upsampling processing module. The fourth upsampling processing module and the third upsampling processing module are also cascaded with the third hierarchical convolutional module.
[0098] The first and second downsampling processing modules are used to downsample the data. The second, third, and fourth upsampling processing modules are used to upsample the data.
[0099] like Figure 6 The diagram shown is a flowchart of the hierarchical convolutional network provided in this disclosure. After data is input into the hierarchical convolutional network, it is processed by the first, second, third, and fourth upsampling processing modules, respectively. The outputs of the first and second layers are fused and then input into the fourth upsampling processing module. After inputting into the hierarchical convolutional network, the data is processed by the global pooling module and then by the third layer convolutional module. The output X1 of the third layer convolutional module is matrix-multiplied with the output P(X) of the fourth upsampling processing module to obtain... Then, perform matrix addition with the output X2 of the third upsampling processing module to obtain... Finally, the data is input into the third layered convolutional module for processing, yielding the output of the layered convolutional network, i.e., the high-dimensional feature xi.
[0100] In some embodiments, the first layered convolutional module can perform a Conv 5*5 operation (i.e., 5*5 convolution) on the data, the second layered convolutional module can perform a Conv 3*3 operation (i.e., 3*3 convolution) on the data, and the third layered convolutional module can perform a Conv 1*1 operation (i.e., 1*1 convolution) on the data. It should be understood that the same convolutional module can also be used to perform Conv 5*5, Conv 3*3, and Conv 1*1 operations respectively.
[0101] like Figure 7The diagram shown is a flowchart of the initial video quality assessment model provided in this disclosure. The initial video quality assessment model may include a multi-input network, a two-dimensional convolutional module, a dense convolutional network, a downsampling processing module, a hierarchical convolutional network, an upsampling processing module, an attention mechanism network, a data fusion processing module, a three-dimensional convolutional neural network, a global pooling module, and fully connected layers.
[0102] The video quality assessment model provided in this disclosure can be called the 3D-PVQA (3Dimensions Pyramid Video Quality Assessment) model and the 3D-PVQA method. In the aforementioned step S132, each video data in the training set and each video data in the validation set are divided into distorted video data and residual video data, which are respectively input into the 3D-PVQA model, namely, Residual-Multi-Input and Distorted-Multi-Input. The residual video data can be obtained by processing the distorted video data and the reference video data using residual frames. The multi-input network outputs two sets of data from the input data: the first set of data is the original input data, and the second set of data is the data after the original input data has been reduced by half the data frame size.
[0103] Taking the Distorted Multi-Input (DMI) model in the lower half as an example, the DMI network outputs two sets of data. The first set of data is processed by a 2D convolutional module, then processed by a dense convolutional network, and finally processed by a downsampling module. The second set of data is processed by the 2D convolutional module, then fused (concat) with the output of the downsampling module, and then input into the dense convolutional network again. A portion of the output from the dense convolutional network is then input into the downsampling module again, and the output of the downsampling module is input into a hierarchical convolutional network. The output of the hierarchical convolutional network, along with another portion of the output from the dense convolutional network, is then input into the attention mechanism network for processing. The data fusion processing module performs data fusion processing on the output results of the residual video data obtained from the attention mechanism network and the output results of the distorted video data. The output of the data fusion processing module will be input into two three-dimensional convolutional neural networks. The three-dimensional convolutional neural networks will output a threshold for the perceptibility of frame loss. The threshold for the perceptibility of frame loss will be multiplied by the residual data frames obtained from the residual framework. Finally, it will be input into the global pooling module and the fully connected layer for processing, and the output will be the quality evaluation score of the video data.
[0104] It should be understood that the same modules can be reused. Figure 6The diagram shows two first-level convolutional modules, two second-level convolutional modules, and three third-level convolutional modules. This does not imply that the hierarchical convolutional network contains a total of two first-level convolutional modules, two second-level convolutional modules, and three third-level convolutional modules. The downsampling module can be the same as or different from the downsampling module in the hierarchical convolutional network. Similarly, the upsampling module can be the same as or different from the upsampling module in the hierarchical convolutional network and the attention upsampling module in the attention mechanism network.
[0105] like Figure 8a As shown, training video data can be divided into training, validation, and test sets according to a preset ratio. The training set is input into the 3D-PVQA model for training, the validation set is input into the 3D-PVQA model for validation, and the test set is input into the 3D-PVQA model for testing, all yielding corresponding quality assessment scores. As previously shown, the test set can be used to evaluate the generalization ability of the final video quality assessment model. Figure 8b As shown, the left side is a screenshot of the reference video data, and the right side is a screenshot of the distorted video data. Table 1 below shows the MOS value of the video data and the quality assessment score corresponding to the video data output by the 3D-PVQA model.
[0106] Table 1
[0107]
[0108]
[0109] like Figure 9 As shown, in some embodiments, determining the mean opinion value (MOS) of each training video data (i.e., step S12) may include the following steps:
[0110] In step S121, the training video data are grouped into groups. Each group includes one reference video data and multiple distorted video data. The resolution of each video data in each group is the same, and the frame rate of each video data in each group is the same.
[0111] In step S122, the distorted video data in each group are classified.
[0112] In step S123, the distorted video data of each category in each group are classified.
[0113] In step S124, the MOS value of each training video data is determined according to the grouping, classification and grading of each training video data.
[0114] Specifically, when classifying the distorted video data in each group, the distorted video data can be divided into different categories such as packet loss distortion and encoding distortion. When grading the distorted video data in each category in each group, the distorted video data can be divided into three different levels of distortion: mild, moderate, and severe.
[0115] After grouping, classifying, and grading the training video data, each group includes one reference video and multiple distorted video data. The multiple distorted video data belong to different categories, and the distorted video data under each category belong to different distortion levels. Based on the reference video data in each group, the MOS value of each training video data can be determined using the SAMVIQ (Subjective Assessment Method for Video Quality evaluation) method and the grouping, classification, and grading conditions.
[0116] like Figure 10 As shown, this disclosure also provides a video quality assessment method, which may include the following steps:
[0117] In step S21, the final quality assessment model trained according to the model training method for video quality assessment as described above is used to process the video data to be assessed, and a quality assessment score for the video data to be assessed is obtained.
[0118] An initial video quality assessment model is pre-set to fully extract image features and accurately detect boundaries in images. Training video data, including reference video data and distorted video data, is acquired. The initial video quality assessment model is then trained using both the reference and distorted video data to obtain the final video quality assessment model, ensuring the independence and diversity of the video data used for training the model. This final video quality assessment model can then be directly used to assess the quality of the video data to be evaluated, improving the accuracy of video quality assessment.
[0119] Based on the same technological concept, such as Figure 11 As shown, this disclosure also provides a model training apparatus for video quality assessment, which may include:
[0120] The acquisition module 101 is used to acquire training video data; wherein, the training video data includes reference video data and distorted video data.
[0121] Processing module 102 is used to determine the MOS value of each training video data.
[0122] Training module 103 is used to train a preset initial video quality assessment model based on training video data and its MOS value until the convergence condition is met, so as to obtain the final video quality assessment model.
[0123] In some embodiments, the training module 103 is used for:
[0124] The training set and validation set are determined based on a preset ratio and training video data, wherein the intersection of the training set and validation set is an empty set;
[0125] The parameters of the initial video quality assessment model are adjusted based on the MOS values of the training set and each video data in the training set, and the hyperparameters of the initial video quality assessment model are adjusted based on the MOS values of the validation set and each video data in the validation set, until the convergence condition is met.
[0126] In some embodiments, the convergence condition includes that the evaluation error rate of each video data in both the training and validation sets does not exceed a preset threshold, and the evaluation error rate is calculated using the following formula:
[0127] E = (|S - Mos|) / Mos, where,
[0128] E represents the evaluation error rate of the current video data;
[0129] S is the evaluation score of the current video data output by the initial quality assessment model after adjusting the parameters and hyperparameters;
[0130] Mos is the Mos value of the current video data.
[0131] In some embodiments, the initial video quality assessment model includes a three-dimensional convolutional neural network for extracting motion information from image frames.
[0132] In some embodiments, the initial video quality assessment model further includes an attention model, a data fusion processing module, a global pooling module, and a fully connected layer, wherein the attention module, the data fusion processing module, the three-dimensional convolutional neural network, the global pooling module, and the fully connected layer are cascaded in sequence.
[0133] In some embodiments, the attention model includes a cascaded multi-input network, a two-dimensional convolutional module, a dense convolutional network, a downsampling processing module, a hierarchical convolutional network, an upsampling processing module, and an attention mechanism network. The dense convolutional network includes at least two cascaded dense convolutional modules, and each dense convolutional module includes four cascaded densely connected convolutional layers.
[0134] In some embodiments, the attention mechanism network includes a cascaded attention convolution module, a linear correction unit activation module, a nonlinear activation module, and an attention upsampling processing module.
[0135] In some embodiments, the hierarchical convolutional network includes a first hierarchical network, a second hierarchical network, a third hierarchical network, and a fourth upsampling processing module. The first hierarchical network includes a cascaded first downsampling processing module and a first hierarchical convolutional module. The second hierarchical network includes a cascaded second downsampling processing module, a second hierarchical convolutional module, and a second upsampling processing module. The third hierarchical network includes a cascaded global pooling module, a third hierarchical convolutional module, and a third upsampling processing module. The first hierarchical convolutional module is also cascaded with the second downsampling processing module. The first hierarchical convolutional module and the second upsampling processing module are cascaded with the fourth upsampling processing module. The fourth upsampling processing module and the third upsampling processing module are also cascaded with the third hierarchical convolutional module.
[0136] In some embodiments, the processing module 102 is configured to:
[0137] The training video data are grouped into groups, each group including one reference video and multiple distorted video data, and the resolution and frame rate of each video data in each group are the same.
[0138] Classify the video data in each group;
[0139] The video data for each category in each group is classified into different levels;
[0140] The MOS value of each training video data is determined based on the grouping, classification, and grading of each training video data.
[0141] Based on the same technological concept, such as Figure 12 As shown, this disclosure also provides a video quality assessment apparatus, including:
[0142] The evaluation module 201 is used to process the video data to be evaluated according to the final quality evaluation model trained by the model training method for video quality evaluation as described above, and to obtain the quality evaluation score of the video data to be evaluated.
[0143] In addition, such as Figure 13 As shown in the embodiments of this disclosure, an electronic device is also provided, including:
[0144] One or more processors 301;
[0145] Storage device 302, on which one or more programs are stored;
[0146] When the one or more programs are executed by the one or more processors 301, the one or more processors 301 perform at least one of the following methods:
[0147] The model training methods for video quality assessment provided in the embodiments described above;
[0148] The video quality assessment methods provided in the embodiments described above.
[0149] In addition, such as Figure 14 As shown, this disclosure also provides a computer storage medium storing a computer program thereon, wherein the program, when executed, implements at least one of the following methods:
[0150] The model training methods for video quality assessment provided in the embodiments described above;
[0151] The video quality assessment methods provided in the embodiments described above.
[0152] It will be understood by those skilled in the art that all or some of the steps in the methods disclosed above, and the functional modules / units in the apparatus, can be implemented as software, firmware, hardware, and suitable combinations thereof. In hardware implementations, the division between functional modules / units mentioned in the above description does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may be performed collaboratively by several physical components. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application-specific integrated circuit. Such software may be distributed on a computer-readable medium, which may include computer storage media (or non-transitory media) and communication media (or transient media). As is known to those skilled in the art, the term computer storage media includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridges, magnetic tape, disk storage or other magnetic storage devices, or any other medium that can be used to store desired information and can be accessed by a computer. Furthermore, it is well known to those skilled in the art that communication media typically contain computer-readable instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms, and may include any information delivery medium.
[0153] Example embodiments have been disclosed herein, and while specific terminology has been used, it is for illustrative purposes only and should be construed as such, and is not intended to be limiting. In some instances, it will be apparent to those skilled in the art that features, characteristics, and / or elements described in connection with particular embodiments may be used alone, or in combination with features, characteristics, and / or elements described in connection with other embodiments, unless otherwise expressly indicated. Therefore, those skilled in the art will understand that various changes in form and detail may be made without departing from the scope of this disclosure as set forth by the appended claims.
Claims
1. A model training method for video quality assessment, comprising: Acquire training video data, wherein the training video data includes reference video data and distorted video data; Determine the mean opinion value (MOS) of each training video data set; The initial video quality assessment model is trained based on the training video data and its MOS value until the convergence condition is met, thus obtaining the final video quality assessment model. The determination of the Mean Opinion Value (MOS) of each training video data includes: grouping the training video data into groups, each group including one reference video data and multiple distorted video data, wherein the resolution and frame rate of each video data in each group are the same; classifying each video data in each group; grading each video data in each category in each group; and determining the MOS value of each training video data based on the grouping, classification, and grading of the training video data. The step of training a preset initial video quality assessment model based on the training video data and its MOS values until convergence is achieved includes: determining a training set and a validation set based on a preset ratio and the training video data, wherein the intersection of the training set and the validation set is an empty set; adjusting the parameters of the initial video quality assessment model based on the training set and the MOS values of each video data in the training set; and adjusting the hyperparameters of the initial video quality assessment model based on the validation set and the MOS values of each video data in the validation set, until convergence is achieved. The initial video quality assessment model includes an attention model, which comprises a hierarchical convolutional network. The hierarchical convolutional network includes a first hierarchical network, a second hierarchical network, a third hierarchical network, and a fourth upsampling processing module. The first hierarchical network includes a cascaded first downsampling processing module and a first hierarchical convolutional module. The second hierarchical network includes a cascaded second downsampling processing module, a second hierarchical convolutional module, and a second upsampling processing module. The third hierarchical network includes a cascaded global pooling module, a third hierarchical convolutional module, and a third upsampling processing module. The output of the third hierarchical convolutional module is multiplied by the output of the fourth upsampling processing module to obtain a multiplication result. The multiplication result is then added to the output of the third upsampling processing module to obtain a sum result. The sum result is input into the third hierarchical convolutional module for processing to obtain the output of the hierarchical convolutional network.
2. The method according to claim 1, wherein, The convergence condition includes that the evaluation error rate of each video data in both the training set and the validation set does not exceed a preset threshold. The evaluation error rate is calculated using the following formula: E = (|S - Mos|) / Mos, where, E represents the evaluation error rate of the current video data; S is the evaluation score of the current video data output by the initial quality assessment model after adjusting the parameters and hyperparameters; Mos is the Mos value of the current video data.
3. The method according to any one of claims 1 to 2, wherein, The initial video quality assessment model also includes a three-dimensional convolutional neural network for extracting motion information from image frames.
4. The method according to claim 3, wherein, The initial video quality assessment model also includes a data fusion processing module, a global pooling module, and a fully connected layer, wherein the attention model, the data fusion processing module, the three-dimensional convolutional neural network, the global pooling module, and the fully connected layer are cascaded in sequence.
5. The method according to claim 1, wherein, The attention model further includes a cascaded multi-input network, a two-dimensional convolutional module, a dense convolutional network, a downsampling processing module, an upsampling processing module, and an attention mechanism network. The dense convolutional network includes at least two cascaded dense convolutional modules, and each dense convolutional module includes four cascaded densely connected convolutional layers.
6. The method according to claim 5, wherein, The attention mechanism network includes a cascaded attention convolution module, a linear correction unit activation module, a nonlinear activation module, and an attention upsampling processing module.
7. A video quality assessment method, comprising: The final quality assessment model trained by the method according to any one of claims 1-6 is used to process the video data to be assessed to obtain a quality assessment score for the video data to be assessed.
8. A model training device for video quality assessment, comprising: An acquisition module is used to acquire training video data; wherein, the training video data includes reference video data and distorted video data; The processing module is used to determine the mean opinion value (MOS) of each of the training video data. The training module is used to train a preset initial video quality assessment model based on the training video data and its MOS value until the convergence condition is met, so as to obtain the final video quality assessment model. The determination of the Mean Opinion Value (MOS) of each training video data includes: grouping the training video data into groups, each group including one reference video data and multiple distorted video data, wherein the resolution and frame rate of each video data in each group are the same; classifying each video data in each group; grading each video data in each category in each group; and determining the MOS value of each training video data based on the grouping, classification, and grading of the training video data. Training a preset initial video quality assessment model based on the training video data and its MOS values until convergence is achieved includes: determining a training set and a validation set based on a preset ratio and the training video data, wherein the intersection of the training set and the validation set is an empty set; adjusting the parameters of the initial video quality assessment model based on the training set and the MOS values of each video data in the training set; and adjusting the hyperparameters of the initial video quality assessment model based on the validation set and the MOS values of each video data in the validation set, until convergence is achieved. The initial video quality assessment model includes an attention model, which comprises a hierarchical convolutional network. The hierarchical convolutional network includes a first hierarchical network, a second hierarchical network, a third hierarchical network, and a fourth upsampling processing module. The first hierarchical network includes a cascaded first downsampling processing module and a first hierarchical convolutional module. The second hierarchical network includes a cascaded second downsampling processing module, a second hierarchical convolutional module, and a second upsampling processing module. The third hierarchical network includes a cascaded global pooling module, a third hierarchical convolutional module, and a third upsampling processing module. The output of the third hierarchical convolutional module is multiplied by the output of the fourth upsampling processing module to obtain a multiplication result. The multiplication result is then added to the output of the third upsampling processing module to obtain a sum result. The sum result is input into the third hierarchical convolutional module for processing to obtain the output of the hierarchical convolutional network.
9. A video quality assessment device, comprising: An evaluation module is used to process the final quality evaluation model trained by the model training method for video quality evaluation according to any one of claims 1-6, and obtain the quality evaluation score of the video data to be evaluated.
10. An electronic device, comprising: One or more processors; A storage device on which one or more programs are stored; When the one or more programs are executed by the one or more processors, the one or more processors perform at least one of the following methods: The model training method for video quality assessment as described in any one of claims 1-6; The video quality assessment method as described in claim 7.
11. A computer storage medium having a computer program stored thereon, wherein, When the program is executed, it implements at least one of the following methods: The model training method for video quality assessment as described in any one of claims 1-6; The video quality assessment method as described in claim 7.