A blind reference video quality assessment method based on time sequence self-supervision
By constructing a Siamese neural network based on Video Swin Transformer, the temporal feature differences between sequential and out-of-order videos are learned, solving the problem of insufficient utilization of video temporal relationships in existing technologies and achieving accurate prediction of video quality.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- THE 54TH RESEARCH INSTITUTE OF CHINA ELECTRONICS TECHNOLOGY GROUP CORPORATION
- Filing Date
- 2026-03-30
- Publication Date
- 2026-06-12
AI Technical Summary
Existing blind reference video quality assessment methods do not fully consider the temporal relationships of videos, resulting in inaccurate video quality prediction results.
We construct a Siamese neural network based on Video Swin Transformer, capture the temporal relationship of videos by learning the temporal feature differences between sequential and out-of-order videos, and optimize the network parameters using a temporal self-supervised loss function and a quality regression loss function.
It achieves accurate prediction of video quality, improves the model's performance on multiple benchmark datasets, and is particularly robust to video quality assessment in complex scenarios.
Smart Images

Figure CN122200514A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of multimedia processing technology, and in particular to a blind reference video quality assessment method based on temporal self-supervision. Background Technology
[0002] In the field of blind / no-reference video quality assessment (NR-VQA), a variety of deep learning-based methods have emerged in recent years. Smith et al. [Smith L, Johnson M, Lee K, et al. No-reference video quality assessment using deep convolutional neural networks [J]. IEEE Transactions on Image Processing, 2017, 26 (11):5372-5385.] proposed a no-reference evaluation framework based on convolutional neural networks (CNN). This framework extracts spatial distortion features within video frames through multi-layer convolutional structures and combines temporal pooling layers to capture inter-frame mutations, significantly improving the sensitivity to spatial distortions such as blur and noise. Li et al. [Li X, Wang Z, Bovik AC. Deep learning of temporal features for no-reference video quality assessment [J]. IEEE Signal Processing Letters, 2018, 25 (10): 1510-1514.] further utilized Long Short-Term Memory (LSTM) networks. LSTM (Laser-Based Spatio-Temporal Modeling) is used to model the temporal dynamic characteristics of video sequences. By learning the correlation pattern between motion vectors and quality degradation, it effectively solves the evaluation problem of temporal distortions such as motion blur and frame loss. Wang et al. [Wang Y, Sun C, Liu Z, et al. Transformer-based spatio-temporal modeling for no-reference video quality assessment [C]. Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition, 2021: 8324-8333.]To address spatiotemporal coupling distortion in complex scenes, a Transformer-based cross-frame attention mechanism was proposed. By dynamically associating key feature regions at different times, the robustness of quality assessment for fast-moving scenes was improved. Zhang et al. [Zhang H, Yang W, An P, et al. Perceptual-aware no-reference video quality assessment via deep reinforcement learning [J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2022, 18 (4): 1-20.] combined deep learning with visual perception mechanisms. By simulating the human eye's visual attention allocation process through reinforcement learning, they prioritized focusing on salient regions in the video, reducing the interference of background redundancy information on the evaluation results. In addition, Chinese patent CN113837652A [Zhao Ming, Li Na, Wang Bo]. A No-Reference Video Quality Assessment Method Based on Generative Adversarial Networks [P]. China: CN113837652A, 2021-12-24.] This paper proposes to construct a distortion simulator using generative adversarial networks (GANs). By comparing the distribution differences between generated samples and the video to be evaluated, a comprehensive assessment of mixed distortions such as compression distortion and transmission packet loss is achieved.
[0003] In recent years, the field of blind reference video quality assessment has seen continuous innovation and development. Zhao et al. [Zhao Y, Chen X, Zhang Y, et al. Semi-supervised blind video quality assessment via knowledge distillation and incremental learning [C]. Proceedings of the IEEE / CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023:321-330.] proposed a semi-supervised framework based on knowledge distillation and incremental learning. By assigning pseudo-labels to unlabeled data and selecting representative examples through multiple learning iterations, they effectively improved the model's evaluation performance when data is insufficient, achieving leading results on multiple benchmark datasets and demonstrating its potential to address data challenges in real-world applications. Liu et al. [Liu Z, Wang H, Li J, et al. Frequency-domain aware blind video quality assessment using deep neural networks [J]. Signal Processing: Image Communication, 2024, 125: 116932.] Innovatively, from a frequency domain perspective, a deep neural network is constructed to mine the relationship between video frequency domain features and quality degradation. It can accurately capture the loss of frequency domain information caused by video compression and transmission, thereby effectively evaluating video quality. Especially when processing videos in complex scenes, it shows better accuracy than traditional spatial domain methods.In 2025, a research team from the Communication University of China designed a multi-branch encoder architecture to address the spatiotemporal distortion problem unique to AI-generated videos. This architecture modeled the video from three dimensions: technical quality, motion quality, and semantic content. Combined with a multimodal cue word engineering framework and semantic anchors, it leveraged a large language model for associative reasoning and improved prediction accuracy through LoRA fine-tuning technology. The solution achieved over 60% consistency with the subjective quality assessment label (MOS score), opening a new path for quality assessment of AI-generated videos. (CVPR 2025 NTIRE Workshop [Liu X, Min X, Hu Q, et al. NTIRE2025 XGC Quality Assessment Challenge: Methods and Results[C] / / Proceedings of the Computer Vision and Pattern Recognition Conference. 2025: 1389-1402.]).
[0004] However, none of the aforementioned existing technologies have fully considered the impact of video temporal relationships on video quality, and have failed to effectively utilize the temporal relationships of videos, resulting in inaccurate predictions of video quality, leaving room for further improvement. Summary of the Invention
[0005] In view of this, this invention proposes a blind reference video quality assessment method based on temporal self-supervision. This method constructs a Siamese neural network based on the Video Swin Transformer and effectively captures the temporal relationship of videos by learning the temporal feature differences between sequential and out-of-order videos, thus achieving accurate prediction of video quality.
[0006] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0007] A blind reference video quality assessment method based on temporal self-supervision includes the following steps:
[0008] Step 1: Obtain sample videos and construct training and testing sets. The training set includes the original training videos and the shuffled videos corresponding to the original training videos, while the testing set only contains the original test videos.
[0009] Step 2: Construct a Siamese neural network based on Video Swin Transformer. The Siamese neural network contains two branches with shared structural parameters, which process the original video and the out-of-order video respectively. The first branch of the Siamese neural network is also connected to a quality regression layer, which takes the temporal features of the original video as input and outputs a single-value quality score through multi-layer mapping.
[0010] Step 3: Construct a total loss function that includes a temporal self-supervised loss function and a quality regression loss function. Use the Adam optimizer to simultaneously optimize the parameters of the Siamese neural network and the quality regression layer. Then, use a test set to test the trained network to obtain a trained video quality assessment neural network.
[0011] Step 4: Input the original video to be evaluated into the first branch of the Siamese neural network. The quality regression layer outputs the quality score, which is the quality evaluation result.
[0012] Furthermore, the method for obtaining the out-of-order video is as follows:
[0013] The original training video is broken down into consecutive video frames;
[0014] The order of video frames is randomly shuffled and then reassembled to obtain a shuffled video with the same duration as the original training video.
[0015] Furthermore, the Siamese neural network contains two branches with identical structures, each of which is a Video Swim Transformer with the feature mapping layer removed, retaining only multi-layer self-attention modules and temporal convolution modules.
[0016] Furthermore, the quality regression layer is composed of a 512-dimensional fully connected layer, a random dropout layer with a dropout rate of 0.3, a 128-dimensional fully connected layer, and a nonlinear mapping layer using the ReLU activation function, all connected in series; the quality score output by the quality regression layer ranges from [0, 100].
[0017] Furthermore, the videos input to the Siamese neural network are all scaled to a size of 384×384 pixels, and the pixel values are normalized to the range of [0, 1].
[0018] Furthermore, step 3 is performed as follows:
[0019] Step 301, Construct the total loss function :
[0020]
[0021] in, For time-series self-supervised loss function, The quality regression loss function is calculated as follows:
[0022]
[0023]
[0024] In the formula, This represents the difference in temporal information between the original video and the out-of-order video. The temporal coherence score of the original video is calculated based on inter-frame optical flow continuity. The temporal coherence score for out-of-order videos calculated based on inter-frame optical flow continuity; The feature differences are those extracted by the Video Swin Transformer. The temporal features of the original video output by the first branch. The temporal characteristics of the out-of-order video output by the second branch; Represents the L1 function; Describes the Smooth L1 loss function. The quality score is indicated. The quality score output by the quality regression layer;
[0025] Step 302: Set the training iterations to 45 rounds, with an initial learning rate of 0.001. After every 5 rounds, decrease the learning rate to 0.9 times its original value. During training, input the original training videos and their corresponding shuffled videos from the training set into the two branches of the Siamese neural network, respectively, and apply the total loss function. The loss was calculated, and the parameters of the Siamese neural network and the quality regression layer were simultaneously optimized using the Adam optimizer.
[0026] Step 303: Input the test videos in the test set into the first branch of the Siamese neural network, output the quality score through the quality regression layer, calculate the Spearman coefficient of the quality score of the entire test set, stop training when the Spearman coefficient reaches the highest, otherwise return to step 302 to continue training.
[0027] The beneficial effects of this invention are as follows:
[0028] 1. The neural network structure used in this invention is simple and easy to implement.
[0029] 2. This invention constructs a Siamese neural network based on the Video Swin Transformer and effectively captures the temporal relationship of videos by learning the temporal feature differences between sequential and out-of-order videos. It fully considers the impact of video temporal relationship on video quality and can achieve accurate prediction of video quality. Attached Figure Description
[0030] Figure 1 This is a schematic diagram illustrating the principle of the present invention. Detailed Implementation
[0031] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings.
[0032] A blind reference video quality assessment method based on temporal self-supervision includes the following steps:
[0033] Step 1: Obtain sample videos and construct training and testing sets. The training set includes the original training videos and the corresponding shuffled videos, while the testing set contains only the original test videos. The shuffled videos are obtained by decomposing the video into a series of video frames and randomly arranging and combining the video frames.
[0034] Step 2: Construct a Siamese neural network based on Video Swin Transformer to support feature extraction of the original video data and the corresponding out-of-order video data. The Siamese neural network contains two branches with shared structural parameters, which process the original video and the out-of-order video respectively. The first branch of the Siamese neural network is also connected to a quality regression layer, which takes the temporal features of the original video as input and outputs a single-value quality score through multi-layer mapping.
[0035] The Siamese neural network is used to extract the temporal features of the original video and the temporal features of the out-of-order video. The backbone network of the Siamese neural network is the Video Swin Transformer with the feature mapping layer removed.
[0036] Step 3: Construct a total loss function that includes a temporal self-supervised loss function and a quality regression loss function. Use the Adam optimizer to simultaneously optimize the parameters of the Siamese neural network and the quality regression layer, and use a test set to test the trained network to obtain a trained video quality assessment neural network. Among them, the temporal self-supervised loss function can be used to learn the temporal differences between sequential and out-of-order videos.
[0037] The temporal self-supervised loss function first calculates the difference in temporal information between the original video and the out-of-order video, and then calculates the difference in temporal information between the features of the original video extracted by the Video Swin Transformer and the features of the out-of-order video. Finally, it constrains the distance between the two differences based on the L1 function.
[0038] The network parameters are updated using backpropagation until the Spearman coefficients on the test set reach their maximum, at which point the training process stops. To optimize the network parameters, this method uses Adam as the optimizer, trains for 45 epochs, and sets the learning rate to 0.9 times the original rate every 5 epochs.
[0039] Step 4: Input the original video to be evaluated into the first branch of the Siamese neural network. The quality regression layer outputs the quality score, which is the quality evaluation result.
[0040] To avoid video memory overflow, the video images are uniformly scaled to 384×384 pixels.
[0041] Here is a more specific example:
[0042] A blind reference video quality assessment method based on temporal self-supervision includes the following steps:
[0043] Step 1: Construct the training set and the test set.
[0044] This method is trained and tested on the KoNViD-1k, LIVE-VQC, and YouTube-UGC video quality assessment datasets. These datasets cover various distortion types, including compression distortion, motion blur, and packet loss. The method employs 5-fold cross-validation for training and testing, selecting 80% of the data as the training set and 20% as the test set each time, and using the average of the five tests as the model's final performance.
[0045] For the training set, it is necessary to obtain the shuffled video of each original training video. The specific method is as follows:
[0046] The original training video is decomposed into a series of consecutive video frames, denoted as { , , ..., } (where n is the total number of video frames); subsequently, the frame sequence is shuffled and reassembled randomly to generate an out-of-order frame sequence { , , ..., This process yields a scrambled video with the same duration as the original video. By constructing temporal comparison samples between the original video and the scrambled video, a supervisory signal is provided for subsequent temporal feature learning.
[0047] Step 2: Construct a twin feature extraction network.
[0048] To simultaneously extract temporal features from the original video and the out-of-order video, this method constructs a Siamese neural network architecture based on the Video SwinTransformer. This Siamese neural network contains two branches with shared structural parameters, processing the original video and the out-of-order video respectively. The backbone network adopts the basic framework of the Video Swin Transformer. To effectively extract the temporal features of the video, the original feature projection layer is removed, while multiple self-attention modules and temporal convolutional modules are retained to enhance the ability to capture inter-frame dependencies. The two branches output the temporal features of the original video in parallel. Temporal characteristics of out-of-order videos .
[0049] Step 3: Construct the quality regression layer.
[0050] To predict video quality scores, this method constructs a quality regression layer, which consists of a fully connected layer (512 dimensions), a random dropout layer (dropout rate 0.3), another fully connected layer (128 dimensions), and a nonlinear mapping layer (ReLU activation function) connected in series. This quality regression layer uses the original video temporal features F... o As input, a single-value quality score S (ranging from [0, 100]) is output through multi-layer mapping, realizing the mapping from time-series features to quality assessment.
[0051] A well-constructed overall network, such as Figure 1 As shown. The principle of this network is as follows:
[0052] First, the original videos are randomly shuffled along the time dimension to obtain scrambled videos, and the original videos and their corresponding scrambled videos are paired as training data. Second, a Siamese neural network is used as the model architecture, with VideoSwin Transformer as the backbone network. Then, the dual-branch structure of the Siamese neural network is used to extract the temporal features of the original videos and scrambled videos respectively, and the differences between the two sets of features are learned to effectively understand the temporal relationship of the videos. The spatiotemporal features of the original videos extracted by VideoSwin Transformer are mapped to quality scores through fully connected layers.
[0053] Step 4: Construct the time-series self-supervised loss function.
[0054] To guide the Siamese network in learning the temporal differences between sequential and out-of-order videos, this method designs a temporal self-supervised loss function. The specific process is as follows:
[0055] First, the difference in temporal information between the original video and the out-of-order video is calculated and defined as:
[0056] ,
[0057] in, The temporal coherence score of the original video is calculated based on the inter-frame optical flow continuity. Scoring the temporal coherence of out-of-order videos;
[0058] Secondly, the feature differences extracted by the Video Swin Transformer are calculated and defined as follows:
[0059] ,
[0060] Finally, by constraining the consistency of the two differences using the L1 function, the loss function formula is as follows:
[0061] ,
[0062] This loss function allows the network to learn the correlation between temporal structure and feature representation.
[0063] Furthermore, a quality regression loss function is constructed based on Smooth L1 loss:
[0064]
[0065] in, The quality score is indicated. The score is the score predicted by the network from the original video.
[0066] Joint temporal self-supervised loss With quality regression loss function Construct the total loss function:
[0067]
[0068] Step 5: Train the network.
[0069] To avoid memory overflow, all video frames were uniformly scaled to 384×384 pixels, and pixel values were normalized to the range [0, 1]. During the training phase, random horizontal flipping was used to augment the data and improve the model's robustness to changes in video capture conditions. The specific training process is as follows:
[0070] First, features are extracted using a Siamese neural network, taking the original training video and the disordered video pair as input.
[0071] Subsequently, the parameters of the Siamese network and the quality regression layer were simultaneously optimized using the Adam optimizer. The number of training rounds was set to 45, the initial learning rate was 0.001, and the learning rate was reduced to 0.9 times the original value every 5 training rounds.
[0072] Finally, the network parameters at the point where the Spearman's Rank Correlation Coefficient (SROCC) on the test set reaches its maximum are used as the stopping criterion, and the network parameters at this point are saved as the optimal model.
[0073] Step six: Input the original video to be evaluated into the first branch of the Siamese neural network, and the quality regression layer outputs the quality score, which is the quality evaluation result.
[0074] The following is a simulation test of this method:
[0075] The hardware platform for model training is an NVIDIA 3090 GPU equipped with 24GB of video memory, the running environment is Ubuntu 22.04 LTS, and the deep learning framework used is PyTorch 2.0.
[0076] Table 1 below shows a comparison of the quality prediction performance of different schemes on the KoNViD-1k, LIVE-VQC, and YouTube-UGC datasets:
[0077]
[0078] In the table above, comparison scheme 1 is the scheme proposed by Li et al. [Li D, Jiang T, Jiang M. Quality assessment of in-the-wild videos[C] / / Proceedings of the 27th ACM international conference on multimedia. 2019: 2351-2359.], comparison scheme 2 is the scheme proposed by Tu et al. [Tu Z, Yu X, Wang Y, et al. RAPIQUE: Rapid and accurate video quality prediction of user generated content[J]. IEEE Open Journal of Signal Processing, 2021, 2: 425-440.], and comparison scheme 3 is the scheme proposed by Ying et al. [Ying Z, Mandal M, Ghadiyaram D, et al. Patch-vq:'patching up'the video quality problem[C] / / Proceedings of the IEEE / CVF conference on computer vision and pattern recognition. 2021: [14019-14029.], Comparison scheme 4 is the scheme proposed by Li et al. [Li B, Zhang W, Tian M, et al. Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(9): 5944-5958.], Comparison scheme 5 is the scheme proposed by Madhusudana et al. [Madhusudana PC, Birkbeck N, Wang Y, et al. CONVIQT:Contrastive video quality estimator[J]. IEEE Transactions on Image Processing, 2023, 32: 5138-5152.].
[0079] In Table 1, SROCC represents the Spearman correlation coefficient and PLCC represents the Pearson correlation coefficient. The values of SROCC and PLCC range from 0 to 1; a larger value indicates better model performance. As shown in Table 1, this method achieved optimal performance on all three datasets, demonstrating that this invention can effectively predict video quality.
[0080] This invention can accurately assess video quality and assist streaming media platforms in evaluating the quality of uploaded videos.
Claims
1. A blind reference video quality assessment method based on temporal self-supervision, characterized in that, Includes the following steps: Step 1: Obtain sample videos and construct training and testing sets. The training set includes the original training videos and the shuffled videos corresponding to the original training videos, while the testing set only contains the original test videos. Step 2: Construct a Siamese neural network based on Video Swin Transformer. The Siamese neural network contains two branches with shared structural parameters, which process the original video and the out-of-order video respectively. The first branch of the Siamese neural network is also connected to a quality regression layer, which takes the temporal features of the original video as input and outputs a single-value quality score through multi-layer mapping. Step 3: Construct a total loss function that includes a temporal self-supervised loss function and a quality regression loss function. Use the Adam optimizer to simultaneously optimize the parameters of the Siamese neural network and the quality regression layer. Then, use a test set to test the trained network to obtain a trained video quality assessment neural network. Step 4: Input the original video to be evaluated into the first branch of the Siamese neural network. The quality regression layer outputs the quality score, which is the quality evaluation result.
2. The method for blind reference video quality assessment based on temporal self-supervision as described in claim 1, characterized in that, The method for obtaining the disordered video is as follows: The original training video is broken down into consecutive video frames; The order of video frames is randomly shuffled and then reassembled to obtain a shuffled video with the same duration as the original training video.
3. The blind reference video quality assessment method based on temporal self-supervision according to claim 1, characterized in that, The Siamese neural network contains two branches with identical structures. Each branch is a Video Swim Transformer with the feature mapping layer removed, retaining only multi-layer self-attention modules and temporal convolution modules.
4. The blind reference video quality assessment method based on temporal self-supervision according to claim 1, characterized in that, The quality regression layer consists of a 512-dimensional fully connected layer, a random dropout layer with a dropout rate of 0.3, a 128-dimensional fully connected layer, and a nonlinear mapping layer using the ReLU activation function, all connected in series. The quality score output by the quality regression layer ranges from [0, 100].
5. The blind reference video quality assessment method based on temporal self-supervision according to claim 1, characterized in that, The videos input to the Siamese neural network are all scaled to 384×384 pixels, and the pixel values are normalized to the range of [0,1].
6. The blind reference video quality assessment method based on temporal self-supervision according to claim 1, characterized in that, The specific method for step 3 is as follows: Step 301, Construct the total loss function : in, For time-series self-supervised loss function, The quality regression loss function is calculated as follows: In the formula, This represents the difference in temporal information between the original video and the out-of-order video. The temporal coherence score of the original video is calculated based on inter-frame optical flow continuity. The temporal coherence score for out-of-order videos calculated based on inter-frame optical flow continuity; The feature differences are those extracted by the Video Swin Transformer. The temporal features of the original video output by the first branch. The temporal characteristics of the out-of-order video output by the second branch; Represents the L1 function; Describes the Smooth L1 loss function. The quality score is indicated. The quality score output by the quality regression layer; Step 302: Set the training iterations to 45 rounds, with an initial learning rate of 0.
001. After every 5 rounds, decrease the learning rate to 0.9 times its original value. During training, input the original training videos and their corresponding shuffled videos from the training set into the two branches of the Siamese neural network, respectively, and apply the total loss function. The loss was calculated, and the parameters of the Siamese neural network and the quality regression layer were simultaneously optimized using the Adam optimizer. Step 303: Input the test videos in the test set into the first branch of the Siamese neural network, output the quality score through the quality regression layer, calculate the Spearman coefficient of the quality score of the entire test set, stop training when the Spearman coefficient reaches the highest, otherwise return to step 302 to continue training.