A method, system and terminal for non-reference video quality evaluation of interpolated video

By extracting multi-scale features from interpolated videos and performing similarity calculations and pooling, combined with a quality regression network, the accuracy problem of interpolated video quality evaluation in existing technologies is solved, achieving efficient evaluation in the absence of reference.

CN117478973BActive Publication Date: 2026-06-16SHANGHAI JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI JIAOTONG UNIV
Filing Date
2023-10-31
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing no-reference video quality assessment methods cannot accurately evaluate the quality of interpolated videos, especially when there are unnatural changes such as artifacts or misalignments, which leads to reduced assessment accuracy and fails to meet the needs of real-time detection systems.

Method used

A pre-trained neural network is used to extract multi-scale feature maps from the interpolated video. By calculating the similarity between low-level image features and high-level semantic features and pooling, combined with a quality regression network, an objective quality evaluation score for the interpolated video is obtained.

🎯Benefits of technology

In the absence of a reference, this improves the accuracy of quality evaluation for interpolated videos and effectively reflects the overall perceived quality of the interpolated video.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117478973B_ABST
    Figure CN117478973B_ABST
Patent Text Reader

Abstract

The application provides a kind of no reference video quality evaluation method, system and terminal for frame insertion video, comprising: obtaining single frame insertion image from the video after frame insertion, simultaneously taking the adjacent two frames of single frame insertion image as a group of three frames as a group of three frames;Using pre-trained neural network to extract multi-scale feature map from each group of three frames, learn the correlation between adjacent frames, obtain the feature map of each group of three frames;The multi-scale features of the feature map of each group of three frames are divided into low-level image features and high-level semantic features, the context similarity of low-level image features is calculated, and the similarity feature vector of three frames in low-level features is obtained;The high-level semantic features are pooled to obtain the feature vector of three frames in time dimension;The feature vector is fused and input into quality regression network to obtain the objective quality evaluation score of frame insertion video.The application can effectively evaluate the overall perceived quality of experience of frame insertion video.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of multimedia quality assessment technology, and more specifically, to a method, system, and terminal for non-reference video quality assessment of frame-interpolated video. Background Technology

[0002] With the development of mobile internet and smart devices, video data volume has experienced explosive growth, and video has increasingly become an indispensable part of people's daily lives. The proportion of video in internet streaming data and stored data is increasing daily. Video can be considered as a continuous sequence of images, and the size of video data is determined by the spatial resolution and number of images. The number of images per second, or video frame rate, significantly affects the amount of video data. The frame rate also affects the viewer's perceptual experience; the human eye is accustomed to the continuity of the real world, so a higher video frame rate can provide viewers with a greater sense of realism.

[0003] Due to bandwidth limitations in communication systems, video is compressed during transmission, restricting its frame rate. This introduces various distortions into the video signal, encompassing spatial and temporal dimensions, thus reducing the user's perceived experience. To compensate for these distortions on a temporal scale, video presentation terminals use frame interpolation to increase the frame rate. However, frame interpolation technology is still under development. Current techniques cannot accurately reproduce undistorted video signals and may even introduce new distortions due to information uncertainty, resulting in a lower perceived experience than uninterpolated video. Therefore, to ensure a good user experience, a quality evaluation system for interpolated video needs to be developed. This system should be able to take measures to protect the user experience when the quality of the interpolated video falls below a threshold.

[0004] Past research has proposed a large number of objective quality assessment algorithms, but there are few methods for quality assessment of interpolated videos. Danier et al. first raised the issue of frame interpolation video quality in "FloLPIPS: A Bespoke Video Quality Metric for Frame Interpolation," 2022 Picture Coding Symposium (PCS), San Jose, CA, USA, 2022, pp. 283-287., but the method had low accuracy. Hou et al. improved the performance of the method to a usable level in "Hou, Q., Ghildyal, A., Liu, F. (2022). A Perceptual Quality Metric for VideoFrame Interpolation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, GM, Hassner, T. (eds) Computer Vision–ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13675. Springer, Cham." However, both of the above methods are full-reference objective quality assessment methods, requiring corresponding distortion-free video as a reference, which makes them impractical in real-world applications. Currently, there is a lack of research on referenceless video quality assessment for pin-inserted videos. Since no reference video is needed, it is more easily applied to real-time detection systems.

[0005] In general no-reference video quality assessment studies, due to the redundancy of video information, researchers typically extract video features using the same frame intervals. Furthermore, interpolated videos may exhibit unnatural changes in content, such as artifacts or misalignments. General methods cannot evaluate based on these distortion features, further reducing their accuracy. Summary of the Invention

[0006] To overcome the aforementioned shortcomings in the prior art, the purpose of this invention is to provide a method, system, and terminal for evaluating the quality of frame-interpolated video without reference.

[0007] The first objective of this invention is to provide a no-reference video quality assessment method for interpolated video, comprising:

[0008] A single frame interpolated image is obtained from the interpolated video. At the same time, two consecutive frames before and after the single frame interpolated image are taken to form a group of three consecutive frames as a ternary frame, and multiple groups of ternary frames are obtained.

[0009] The pre-trained neural network extracts multi-scale feature maps from each group of ternary frames, learns the correlation between adjacent frames, and obtains the feature map of each group of ternary frames.

[0010] The multi-scale features of the feature map of each set of ternary frames are divided into low-level image features and high-level semantic features. The context similarity of the low-level image features is calculated to obtain the similarity feature vector of the ternary frames on the low-level features. The high-level semantic features are pooled to obtain the feature vector of the ternary frames in the time dimension.

[0011] The similarity feature vector of the three-element frame in the low-level features and the feature vector of the three-element frame in the time dimension are fused and input into the quality regression network to obtain the objective quality evaluation score of the interpolated video.

[0012] Optionally, the step of extracting multi-scale feature maps from each group of ternary frames using a pre-trained neural network includes:

[0013] For each set of ternary frames, the same neural network is used to extract multi-scale features;

[0014] The neural network is a pre-trained video convolutional neural network, wherein the last fully connected layer and pooling layer of the neural network are removed.

[0015] Optionally, the extraction of multi-scale feature maps, wherein:

[0016] During the feature extraction process, the feature maps generated after each downsampling stage of the video convolutional neural network are retained to form a feature pyramid with multiple scales.

[0017] Optionally, the multi-scale feature pyramid includes feature pyramid extraction at five scales, wherein:

[0018] The first three scales extract low-level image features while preserving the temporal dimension, while the latter two scales extract high-level semantic features and fuse the temporal dimension, learning the correlation between adjacent frames at different scales.

[0019] Optionally, the step of performing contextual similarity calculation on low-level image features yields a similarity feature vector of the ternary frame on the low-level features.

[0020] The calculation of the similarity between the preceding and following texts is expressed as follows:

[0021]

[0022] Among them, f i t and f i t+1 CS represents two feature maps that are adjacent in the time dimension at the i-th scale.i (f i t ,f i t+1 () indicates the contextual similarity between adjacent feature maps; This represents the local standard deviation of the feature map. This represents the local covariance between adjacent feature maps, where T represents the time scale of the feature map, and C represents a constant.

[0023] Optionally, the high-level semantic features are pooled to obtain the feature vector of each group of ternary frames in the time dimension, wherein:

[0024] After obtaining the multi-scale features of the ternary frame, global average pooling is performed on the high-level semantic features fused with the temporal dimension to obtain the temporal feature vector of the semantics of the ternary frame.

[0025] Optionally, the similarity feature vector of the three-element frame in the low-level features and the feature vector of the three-element frame in the time dimension are fused and input into the quality regression network to obtain the objective quality evaluation score of the interpolated video, wherein:

[0026] The fusion method employs a connection approach along the channel dimension;

[0027] The quality regression network consists of two fully connected layers of neural network. Each fully connected layer is followed by an activation function, which is either the ReLU function or the Sigmoid function. Finally, the three-element frame quality score is output as the objective quality evaluation score of the interpolated video.

[0028] A second objective of this invention is to provide a no-reference video quality assessment system for interpolated video, comprising:

[0029] Feature extraction module: Obtain a certain number of interpolated frame images from the interpolated video. Each interpolated frame image and its two preceding and following frames form a ternary frame. Use a pre-trained convolutional neural network to extract multi-scale features from each group of ternary frames, learn the correlation between adjacent frames, and obtain the feature map of each group of ternary frames.

[0030] Feature vector calculation module: Divide the multi-scale features of the feature map of each group of ternary frames into low-level image features and high-level semantic features. Perform contextual similarity calculation on the low-level image features to obtain the similarity feature vector of the ternary frames on the low-level features. Perform pooling on the high-level semantic features to obtain the feature vector of each group of ternary frames in the time dimension.

[0031] Feature fusion and regression module: The similarity feature vector of the obtained ternary frame in the low-level features and the feature vector of the ternary frame in the time dimension are fused and input into the quality regression network to obtain the objective quality evaluation score of the interpolated video.

[0032] A third objective of this invention is to provide a no-reference video quality evaluation terminal for interpolated video, comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to perform the no-reference video quality evaluation method for interpolated video.

[0033] A fourth objective of the present invention is to provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the no-reference video quality evaluation method for interpolated video.

[0034] Compared with the prior art, the present invention has at least one of the following beneficial effects:

[0035] This invention provides a no-reference video quality evaluation method and system for interpolated video. It utilizes the multi-scale features of ternary frames in the interpolated video to effectively evaluate the experience quality of the interpolated video in the absence of a reference. This method outperforms current cutting-edge no-reference video objective quality evaluation methods. It not only considers the influence of the correlation between adjacent frames on the quality of the interpolated video, but also designs a context similarity calculation to incorporate multi-scale features into the final score, thereby improving the accuracy of no-reference video quality evaluation. Attached Figure Description

[0036] Other features, objects, and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings:

[0037] Figure 1 This is a flowchart of a no-reference video quality evaluation method for frame-interpolated video according to an embodiment of the present invention;

[0038] Figure 2 This is a general block diagram of a no-reference video quality evaluation system for frame-interpolated video according to an embodiment of the present invention;

[0039] Figure 3 This is a block diagram of the pre-trained ResNet3D network according to an embodiment of the present invention. Detailed Implementation

[0040] The embodiments of the present invention are described in detail below: These embodiments are implemented based on the technical solution of the present invention, and provide detailed implementation methods and specific operation processes. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention, and these all fall within the protection scope of the present invention.

[0041] Currently, there is a lack of research on no-reference video quality assessment for interpolated videos. Since no reference video is needed, it is easier to apply to real-time detection systems. In current no-reference video quality assessment research, due to the redundancy of video information, features are typically extracted using the same frame interval. However, because the newly generated frames exist between the original frames, adjacent frames have strong correlations. Ignoring this information leads to general no-reference video quality assessment methods failing to accurately monitor and evaluate the quality. Furthermore, interpolated videos may contain unnatural variations in content, such as artifacts or misalignments. General methods cannot evaluate based on these distortion features, further reducing their accuracy.

[0042] Based on the aforementioned research, this invention provides a no-reference video quality assessment method for interpolated videos. The method mainly includes: first, obtaining the data structure of ternary frames from the interpolated video; then, extracting multi-scale feature maps of the ternary frames using a video convolutional neural network, dividing them into low-level image features and high-level semantic features; calculating contextual similarity and average pooling operations on both types of features respectively; learning feature vectors representing the correlation between adjacent frames; and finally, fusing the multi-scale features and inputting them into a quality regression network to obtain the final objective quality assessment score. This invention can effectively evaluate the overall perceived quality of interpolated videos.

[0043] Reference Figure 1 As shown, this embodiment of the invention provides a method for evaluating the quality of no-reference video for frame-interpolated video, including the following steps:

[0044] S1: Obtain a single frame interpolated image from the interpolated video, and take the two adjacent frames before and after the single frame interpolated image to form a continuous three-frame group as a ternary frame, and obtain multiple groups of ternary frames.

[0045] In this step, a certain number of interpolated frame images are obtained from the interpolated video, and each interpolated frame image and its two preceding and following frames form a ternary frame.

[0046] S2: Using the pre-trained neural network, multi-scale feature maps are extracted from each set of ternary frames obtained in S1, and the correlation between adjacent frames is learned to obtain the feature map of each set of ternary frames.

[0047] S3: Divide the multi-scale features of the feature map obtained in S2 into low-level image features and high-level semantic features. Perform contextual similarity calculation on the low-level image features to obtain the similarity feature vector of the ternary frame on the low-level features. Perform pooling on the high-level semantic features to obtain the feature vector of the ternary frame in the time dimension.

[0048] S4: The similarity feature vectors of the ternary frames in the low-level features obtained in S3 and the feature vectors of the ternary frames in the time dimension are fused and input into the quality regression network to obtain the objective quality evaluation score of the interpolated video.

[0049] The embodiments of the present invention utilize the multi-scale features of ternary frames in interpolated videos to effectively evaluate the overall perceptual experience quality of interpolated videos.

[0050] Based on the same concept described above, in another embodiment of the present invention, a referenceless video quality evaluation system for frame-interpolated video is also provided, such as... Figure 2 As shown, it includes:

[0051] Feature extraction module: Obtain a certain number of interpolated frame images from the interpolated video. Each interpolated frame image and its two preceding and following frames form a ternary frame. Use a pre-trained convolutional neural network to extract multi-scale features from each group of ternary frames, learn the correlation between adjacent frames, and obtain the feature map of each group of ternary frames.

[0052] Feature vector calculation module: Divide the multi-scale features of the feature map of each group of ternary frames into low-level image features and high-level semantic features. Perform contextual similarity calculation on the low-level image features to obtain the similarity feature vector of the ternary frames on the low-level features. Perform pooling on the high-level semantic features to obtain the feature vector of each group of ternary frames in the time dimension.

[0053] Feature fusion and regression module: The similarity feature vector of the obtained ternary frames in the low-level features and the feature vector of the ternary frames in the time dimension are fused and input into the quality regression network to obtain the objective quality evaluation score of the interpolated video.

[0054] In order to make the final evaluation more accurate and more effective in evaluating the experience quality of the frame-interpolated video, in some preferred embodiments, the following technical features can be further preferred based on the above embodiments.

[0055] Specifically, refer to Figure 2 As shown, the feature extraction module is an important part of this invention.

[0056] In some possible embodiments, the feature extraction module may include two parts: extraction of the ternary frame image sequence and extraction of the ternary frame feature map.

[0057] (1) Extraction of ternary frame image sequence

[0058] First, extract the interpolated frames from the interpolated video, and extract the frames before and after them simultaneously. Then, arrange them into multiple sets of ternary frames in chronological order, which serve as the input to the feature extraction module.

[0059] In some embodiments, distorted frames are extracted at equal time intervals. Specifically, each distorted frame is taken, and extraction is performed simultaneously in the frame preceding and following each interpolated image. These three frames are then combined with the current distorted frame to form a set of ternary frames, resulting in multiple sets of ternary frames with equal time intervals. Preferably, before inputting to the feature extraction module, the spatial resolution of all ternary frames is resampled to 224×224. Of course, this sampling parameter can be adjusted according to the actual situation and is not limited to 224×224.

[0060] (2) Extraction of feature maps of three-element frames

[0061] Take a set of ternary frames and input them into the pre-trained video convolutional neural network. According to the parameters of the convolutional network, take the feature map after each downsampling of the network. In this way, the network can be divided into five stages and five feature map scales can be obtained as multi-scale features corresponding to the interpolated video ternary frames.

[0062] Video convolutional neural networks (CNNs) can be used for video classification, such as ResNet3D. Similar networks include ResNet(2+1)D, ResNet MC, and so on. CNNs used for video classification typically consist of five stages, generating feature maps at five scales. Research has shown that the latter two scales are closer to high-level semantics. Generally, the first three stages represent low-level image features, while the latter two stages represent high-level semantic features.

[0063] For example, in one specific embodiment, such as Figure 3 As shown, the video convolutional neural network uses a ResNet3D network pre-trained on the Kinetics-400 dataset, with the last fully connected layer and pooling layer removed. After resampling the input resolution into triplets, the feature maps at five scales have the following channel numbers and temporal and spatial scales: {64×3×112×112}, {64×3×112×112}, {128×2×56×56}, {256×1×28×28}, and {512×1×14×14}.

[0064] In the above embodiments of the present invention, the feature extraction module uses multi-scale features of ternary frames in the interpolated video, which can effectively evaluate the experience quality of the interpolated video without reference.

[0065] After obtaining the extracted features, the present invention further requires feature vector calculation, which can be implemented through a feature vector calculation module. In some embodiments, the feature vector calculation module may include two parts: low-level similarity feature vector calculation and high-level semantic feature vector calculation, thereby enabling...

[0066] (1) Calculation of low-level similarity feature vectors

[0067] After obtaining the multi-scale features of the ternary frames, context similarity is calculated on the low-level image features that retain the time dimension, resulting in similarity feature vectors for the ternary frames on the low-level features. Based on the data structure of the multi-scale feature map, the context similarity calculation is designed as follows:

[0068]

[0069] Among them, f i t and f i t+1 CS represents two feature maps that are adjacent in the time dimension at the i-th scale. i (f i t ,f i t+1 () indicates the contextual similarity between adjacent feature maps; This represents the local standard deviation of the feature map. This represents the local covariance between adjacent feature maps, where T represents the time scale of the feature maps, and C represents a constant. The constant is used to prevent division by zero; specifically, it is a small value, such as 10. -6 .

[0070] For example, in some specific embodiments, the low-level image features of the time dimension, namely the feature maps generated in the first three stages of the ResNet3D network, are retained. The calculated similarity feature vector output sizes are {64×1}, {64×1}, and {128×1}, respectively, thereby realizing the feature vector formed by fusing the feature maps on the time scale. Of course, the specific parameters of the feature vector can be adjusted according to the actual situation and are not limited to this specific embodiment.

[0071] (2) Calculation of advanced semantic feature vectors

[0072] After obtaining the multi-scale features of the ternary frame, global average pooling is performed on the high-level semantic features fused in the temporal dimension to obtain the semantic feature vector of the ternary frame in the temporal dimension.

[0073] For example, in some specific embodiments, the feature maps generated in the last two stages of the ResNet3D network have a time scale of 1. These are high-level semantic features obtained through learning and fusion by a convolutional neural network, containing semantic relationships between adjacent frames. Global average pooling is then used to calculate high-level semantic feature vectors, with output sizes of {256×1} and {512×1}, respectively. Of course, the specific parameters of the feature vectors can be adjusted according to actual conditions and are not limited to this specific embodiment.

[0074] The feature vector calculation module in the above embodiments of the present invention, based on the multi-scale features of the triples, considers the influence of the correlation between adjacent frames on the quality of the interpolated video, and further incorporates the multi-scale features into the final score by calculating the similarity between the preceding and following texts, thereby improving the accuracy of the quality evaluation of the no-reference video.

[0075] Third, feature fusion regression

[0076] After obtaining the feature vectors of the ternary frames at multiple scales (the similarity feature vectors on low-level features and the feature vectors of high-level features in the time dimension), all features are fused and input into a quality regression network to obtain a quality score. Here, "all features" refers to the feature vectors generated by the feature maps in the first three stages (low-level features) and the feature vectors generated by the feature maps in the last two stages (high-level features), which are generated from the same set of ternary frames.

[0077] The fusion method involves connecting features along the channel dimension. For example, the features to be connected include three low-level features (64x1), (64x1), and (128x1) and two high-level features (256x1) and (512x1). The connection rule is to use a matrix connection along the channel dimension to obtain a feature vector of (64+64+128+256+512=1024x1).

[0078] The quality regression network consists of two fully connected layers of neural network. Each fully connected layer is followed by an activation function to obtain a ternary frame quality score.

[0079] For example, in some specific embodiments, the size of the feature vector after feature fusion is {1024×1}, and the activation functions are ReLU and Sigmoid, respectively. Finally, the objective quality score of the interpolated video is obtained by averaging the quality scores of each group of ternary frames.

[0080] Furthermore, to reduce computational load, in other preferred embodiments, frame skipping can be performed during training and testing based on the video scene and content. This involves selecting tripartite frames from the video frames at equal time intervals for training and testing. In this embodiment, one distorted frame is taken every second to form a tripartite frame for perceptual quality calculation. Finally, the sampled video frames are fused to obtain the final quality prediction score. Frame skipping reduces computational load, eliminating the need to calculate for each inserted frame.

[0081] Implementation results:

[0082] To verify the effectiveness of the referenceless video quality assessment method for frame-interpolated videos provided in the above embodiments of the present invention, experimental tests were conducted on the BVI Video Frame Interpolation (BVI-VFI) Database. The BVI-VFI Database consists of 36 original videos and 540 distorted versions at three spatial resolutions. Subjective quality assessments were performed on each video. The video signals exhibited five types of distortion impairment: two loss types generated by traditional frame-interpolation algorithms and three loss types generated by deep frame-interpolation algorithms. Each loss type included three different frame rates as losses for frame rate variations. Based on the standards proposed by the Video Quality Experts Group (VQEG) in the Video Quality Experts Group (VQEG) Phase I Full Reference-TV test, the following two evaluation criteria were selected to measure the performance of the audio and video quality assessment method: Pearson linear correlation coefficients (PLCC) and Spearman rank order correlation coefficients (SROCC).

[0083] Specifically, the following video quality assessment model will be used as a comparison method:

[0084] BRISQUE ("A. Mittal, AK Moorthy and AC Bovik, "No-Reference ImageQuality Assessment in the Spatial Domain," in IEEE Transactions on ImageProcessing, vol. 21, no. 12, pp. 4695-4708, Dec. 2012. ");

[0085] TLVQM ("A. Mittal, AK Moorthy and AC Bovik, "No-Reference Image Quality Assessment in the Spatial Domain," in IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695-4708, Dec. 2012. ");

[0086] VIDEVAL(《J.Korhonen,"Two-Level Approach for No-Reference ConsumerVideo Quality Assessment,"in IEEE Transactions on Image Processing,vol.28,no.12,pp.5923-5938,Dec.2019.》);

[0087] CVQA-NR(《W.Sun,T.Wang,X.Min,F.Yi and G.Zhai,"Deep Learning BasedFull-Reference and No-Reference Quality Assessment Models for Compressed UGCVideos,"2021IEEE International Conference on Multimedia&Expo Workshops(ICMEW),Shenzhen,China,2021,pp.1-6.》);

[0088] SimpleVQA(《Wei Sun,Xiongkuo Min,Wei Lu,and Guangtao Zhai.2022.A DeepLearning based No-reference Quality Assessment Model for UGC Videos.InProceedings of the 30th ACM International Conference on Multimedia(MM'22).Association for Computing Machinery,New York,NY,USA,856–865.》);

[0089] FastVQA(《Wu, H. et al. (2022). FAST-VQA: Efficient End-to-End VideoQuality Assessment with Fragment Sampling. In: Avidan, S., Brostow, G., Cissé, M., Farinella, GM, Hassner, T. (eds) Computer Vision–ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol. 13666. Springer, Cham.》).

[0090] Table 1

[0091]

[0092] The performance test results under no-reference conditions are shown in Table 1. During the experiment, the BVI-VFI dataset was divided into different subsets according to different spatial resolutions and frame rates for testing. The VFIVQA-NR method represents a no-reference video quality evaluation method for interpolated videos proposed in this embodiment of the invention. As can be seen from the table, the method proposed in this embodiment of the invention can effectively evaluate the experience quality of interpolated videos under no-reference conditions.

[0093] This invention provides a no-reference video quality evaluation method for frame-interpolated videos, which can effectively evaluate the experience quality of frame-interpolated videos in the absence of a reference.

[0094] Based on the same concept described above, in another embodiment of the present invention, a no-reference video quality evaluation terminal for interpolated video is also provided, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. When the processor executes the program, it is used to perform the no-reference video quality evaluation method for interpolated video.

[0095] Optionally, the memory is used to store programs; the memory may include volatile memory, such as random-access memory (RAM), such as static random-access memory (SRAM), double data rate synchronous dynamic random-access memory (DDR SDRAM), etc.; the memory may also include non-volatile memory, such as flash memory. The memory is used to store computer programs (such as application programs, functional modules, etc. that implement the above methods), computer instructions, etc., and the aforementioned computer programs, computer instructions, etc., can be partitioned and stored in one or more memories. Furthermore, the aforementioned computer programs, computer instructions, data, etc., can be accessed by the processor.

[0096] The aforementioned computer programs, computer instructions, etc., can be stored in partitions within one or more memory locations. Furthermore, the aforementioned computer programs, computer instructions, data, etc., can be accessed by a processor.

[0097] A processor is used to execute a computer program stored in memory to implement the various steps of the methods involved in the above embodiments. For details, please refer to the relevant descriptions in the preceding method embodiments.

[0098] The processor and memory can be separate structures or integrated structures. When the processor and memory are separate structures, they can be coupled together via a bus.

[0099] In this embodiment of the invention, a computer-readable storage medium is also provided, on which a computer program is stored, which, when executed by a processor, implements the steps of the no-reference video evaluation method in any of the above method embodiments.

[0100] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0101] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0102] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0103] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0104] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the invention.

[0105] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.

Claims

1. A method for evaluating the quality of frame-interpolated video without reference, characterized in that, include: A single frame interpolated image is obtained from the interpolated video. At the same time, two consecutive frames before and after the single frame interpolated image are taken to form a group of three consecutive frames as a ternary frame, and multiple groups of ternary frames are obtained. The pre-trained neural network extracts multi-scale feature maps from each group of ternary frames, learns the correlation between adjacent frames, and obtains the feature map of each group of ternary frames. The multi-scale features of the feature map of each set of ternary frames are divided into low-level image features and high-level semantic features. The context similarity of the low-level image features is calculated to obtain the similarity feature vector of the ternary frames on the low-level features. Pooling the high-level semantic features yields the feature vector of the ternary frame in the temporal dimension; The similarity feature vector of the three-element frame in the low-level features and the feature vector of the three-element frame in the time dimension are fused and input into the quality regression network to obtain the objective quality evaluation score of the interpolated video.

2. The method for evaluating the quality of no-reference video for frame-interpolated video according to claim 1, characterized in that, The process of extracting multi-scale feature maps from each group of ternary frames using a pre-trained neural network includes: For each set of ternary frames, the same neural network is used to extract multi-scale features; The neural network is a pre-trained video convolutional neural network, wherein the last fully connected layer and pooling layer of the neural network are removed.

3. The method for evaluating the quality of frame-interpolated video without reference according to claim 2, characterized in that, The extraction of multi-scale feature maps, wherein: During the feature extraction process, the feature maps generated after each downsampling stage of the video convolutional neural network are retained to form a feature pyramid with multiple scales.

4. The method for evaluating the quality of no-reference video for frame-interpolated video according to claim 3, characterized in that, The multi-scale feature pyramid includes feature pyramid extraction at five scales, among which: The first three scales extract low-level image features while preserving the temporal dimension, while the latter two scales extract high-level semantic features and fuse the temporal dimension, learning the correlation between adjacent frames at different scales.

5. The method for evaluating the quality of no-reference video for frame-interpolated video according to claim 1, characterized in that, The textual similarity calculation is performed on the low-level image features to obtain the similarity feature vector of the ternary frame on the low-level features; The calculation of the similarity between the preceding and following texts is expressed as follows: Among them, f i t and f i t+1 CS represents two feature maps that are adjacent in the time dimension at the i-th scale. i (f i t ,f i t+1 () indicates the contextual similarity between adjacent feature maps; This represents the local standard deviation of the feature map. This represents the local covariance between adjacent feature maps, where T represents the time scale of the feature map, and C represents a constant.

6. The method for evaluating the quality of no-reference video for frame-interpolated video according to claim 1, characterized in that, The high-level semantic features are pooled to obtain the temporal feature vector of the ternary frame, where: After obtaining the multi-scale features of the ternary frame, global average pooling is performed on the high-level semantic features fused in the temporal dimension to obtain the semantic feature vector of the ternary frame in the temporal dimension.

7. The method for evaluating the quality of no-reference video for frame-interpolated video according to claim 1, characterized in that, The similarity feature vectors of the three-element frames in the low-level features and the feature vectors of the three-element frames in the time dimension are fused and input into the quality regression network to obtain the objective quality evaluation score of the interpolated video, where: The fusion method employs a connection approach along the channel dimension; The quality regression network consists of two fully connected layers of neural network. Each fully connected layer is followed by an activation function, which is either the ReLU function or the Sigmoid function. Finally, the three-element frame quality score is output as the objective quality evaluation score of the interpolated video.

8. A no-reference video quality assessment system for frame-interpolated video, characterized in that, include: Feature extraction module: Obtains a certain number of interpolated frame images from the interpolated video, and each interpolated frame image and its two preceding and following frames form a ternary frame; The pre-trained convolutional neural network is used to extract multi-scale features from each group of ternary frames, learn the correlation between adjacent frames, and obtain the feature map of each group of ternary frames. Feature vector calculation module: Divide the multi-scale features of the feature map of each group of ternary frames into low-level image features and high-level semantic features, perform context similarity calculation on the low-level image features, and obtain the similarity feature vector of the ternary frames on the low-level features. The high-level semantic features are pooled to obtain the feature vector of each group of ternary frames in the time dimension; Feature fusion and regression module: The similarity feature vector of the obtained ternary frame in the low-level features and the feature vector of the ternary frame in the time dimension are fused and input into the quality regression network to obtain the objective quality evaluation score of the interpolated video.

9. A referenceless video quality evaluation terminal for frame-interpolated video, comprising a memory, a processor, and a computer program stored in the memory and capable of running on the processor, characterized in that, When the processor executes the program, it is used to perform the method described in any one of claims 1-7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method described in any one of claims 1-7.