Video quality evaluation method and system based on modular design
By employing modular design and supervised learning strategies, a multi-dimensional feature fusion-based method for evaluating the quality of Braille videos is constructed. This method addresses the issues of reliance on manual annotation and evaluation consistency in existing Braille video quality evaluations, achieving efficient and accurate video quality assessment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI JIAOTONG UNIV
- Filing Date
- 2025-10-27
- Publication Date
- 2026-06-23
Smart Images

Figure CN121505508B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision and video quality assessment technology, and more specifically, to a method and system for evaluating the quality of textual videos based on modular design. Background Technology
[0002] Textured video quality assessment is an important research direction in computer vision, aiming to quantify the perceptual quality of textured videos and providing important performance indicators for other visual tasks (such as textured video super-resolution). Currently, textured video quality assessment can be divided into two main categories: subjective quality assessment and objective quality assessment. Subjective quality assessment relies on human scoring, which, while accurately reflecting human perception of textured video quality, is limited by high labor and time costs and lacks real-time capability, making large-scale application difficult. Objective quality assessment, as an alternative to subjective quality assessment, utilizes computational models to automatically evaluate the quality of textured videos, offering high efficiency and good scalability.
[0003] Based on whether or not they rely on a reference raw video, objective quality assessment can be further divided into full-reference quality assessment and no-reference quality assessment, also known as blind quality assessment. Full-reference quality assessment quantifies the quality of raw videos by directly comparing the differences between the reference raw video and the test raw video, and significant progress has been made in related research in recent years. However, in practical applications, especially in raw video super-resolution tasks, the original reference raw video is often unavailable, thus limiting the practical application scenarios of full-reference quality assessment. In contrast, blind raw video quality assessment does not rely on a reference raw video but evaluates the quality of raw videos solely based on the information from the distorted raw video itself.
[0004] In recent years, with the rapid development of deep learning, deep learning-based methods for evaluating the quality of Braille videos have gradually become a research hotspot. However, these methods typically require a large number of manually annotated Braille video-quality rating pairs for supervised training, and obtaining high-quality manually annotated data is often impractical due to its high cost. Furthermore, existing methods often rely on a single model or feature dimension, making it difficult to comprehensively capture the quality features of videos in terms of temporal consistency, spatial structure, semantic alignment, and other aspects, leading to discrepancies between the evaluation results and human subjective perception.
[0005] Therefore, how to reduce the need for manually labeled data in model training and construct a textual video quality evaluation method and system that can integrate multi-dimensional video features and has high data efficiency has become a key problem that urgently needs to be solved in this field. Summary of the Invention
[0006] To address the shortcomings of existing textual criticism video quality assessment methods, such as heavy reliance on manually labeled data, limited evaluation dimensions, and insufficient consistency with human subjective perception, this invention aims to provide a modular design-based method and system for textual criticism video quality assessment. This method constructs a multi-dimensional feature extraction module and integrates a supervised ranking learning strategy to achieve efficient, accurate, and data-efficient automatic assessment of textual criticism video quality.
[0007] To achieve the above objectives, the present invention adopts the following technical solution:
[0008] A method for evaluating the quality of text-based videos based on modular design is characterized by the following steps:
[0009] S1. Extract temporal features, spatial features, and image-text matching features from the original text-based video dataset;
[0010] S2. Use a pre-trained multimodal base model to perform initial quality scoring on video-text pairs to generate base scores;
[0011] S3. The multidimensional features extracted in step S1 are fused with the basic scores obtained in step S2 to generate the predicted scores for the video dataset;
[0012] S4. Map the temporal features, spatial features, and image-text matching features to a unified dimension, and generate corresponding weight coefficients and bias coefficients in blocks. Use the weight coefficients and bias coefficients to perform weighted correction on the base score to obtain the final quality prediction score.
[0013] A minimum batch learning strategy is adopted, which combines the predicted labels generated in step S3 with the real labels in the existing Wensheng video quality assessment dataset to train the Wensheng video quality assessment model in a supervised learning manner.
[0014] Furthermore, in step S1, the temporal features capture motion consistency by analyzing the temporal dynamic changes of the video frame sequence; the spatial features characterize structural and texture quality by analyzing the visual content of a single video frame; and the image-text matching feature evaluates the degree of semantic alignment by calculating the semantic similarity between the video content and the text description.
[0015] Furthermore, the pre-trained multimodal base model includes mPLUG-OWL2 or SigLIP2; the input prompts for the base model are formatted text, and the output is a weighted score of a preset category label.
[0016] Furthermore, the weighted correction is specifically implemented as follows:
[0017] Temporal features, spatial features, and image-text matching features are each mapped to dimensions using independent multilayer perceptrons.
[0018] The mapped features are divided into blocks for processing, generating weight terms and bias terms;
[0019] The weighted average of the weighted and biased terms with the base score is calculated as follows:
[0020]
[0021] in For feature weights, Based on scores, This is the final score.
[0022] Furthermore, the method of training the Wensheng video quality evaluation model using supervised learning specifically includes:
[0023] For each pair of video samples in the training batch, calculate the 0-1 relative quality label;
[0024] Calculate the preference probability between sample pairs based on the Thurstone model;
[0025] Calculate each sample pair based on the relative quality label and preference probability. ) fidelity loss;
[0026] , ,
[0027] Where n is the total number of samples;
[0028] The fidelity loss between the model's output predicted labels and the ground truth labels in the existing video quality assessment dataset is calculated separately. After setting weighting factors, these losses are summed to obtain the total loss. Backpropagation is then performed to update the model parameters. The total loss function is as follows:
[0029]
[0030] in, These represent the model prediction and the true label, respectively, with loss0 and loss1 being the fidelity losses described in section 5.
[0031] The present invention also provides a text-based video quality assessment system for implementing the above method, characterized in that it includes:
[0032] The feature extraction module is used to extract temporal features, spatial features, and image-text matching features from text-based videos;
[0033] The basic scoring module is used to generate initial quality scores for video-text pairs based on the base model;
[0034] The score correction module applies the weights and biases generated after feature mapping to the base score to obtain the corrected predicted score.
[0035] The training module is used to train the Wensheng video quality assessment model in a supervised manner using a minimum batch learning strategy.
[0036] Furthermore, the feature extraction module supports traditional CNN models or mainstream multimodal large models as feature extractors, extracting the output of the last hidden layer as a feature representation.
[0037] Third, an electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method described above when executing the program.
[0038] A computer-readable storage medium having a computer program stored thereon, characterized in that the program, when executed by a processor, implements the method described above.
[0039] Compared with the prior art, the present invention has the following beneficial effects:
[0040] 1) The lightweight network demonstrates good performance, comparable to large, normal models. It effectively improves the performance of existing textual criticism video quality assessment models, yielding prediction results with better consistency with subjective evaluations. The paradigm for constructing textual criticism video quality assessment models is analyzed, providing a more efficient method for subsequent models.
[0041] 2) By comprehensively capturing key factors affecting the quality of text-based videos through three modules—temporal, spatial, and image-text matching—the model significantly improves the consistency between evaluation results and human subjective perception. It introduces a feature fusion mechanism to effectively capture both overall and fine-grained local information, reducing dependence on the amount of data and enhancing the model's applicability in data-scarce scenarios. It supports flexible combinations of various basic models and feature extractors, facilitating adaptation to different task requirements and technological evolution. Through supervised learning strategies, it significantly improves training efficiency and generalization ability while ensuring model performance. Attached Figure Description
[0042] Figure 1 This is a flowchart of the video quality evaluation method based on modular design according to the present invention. Detailed Implementation
[0043] The embodiments of the present invention are described in detail below. These embodiments are implemented based on the technical solution of the present invention, and provide detailed implementation methods and specific operation processes. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention, and these all fall within the protection scope of the present invention.
[0044] Please see Figure 1 , Figure 1 The flowchart of the text-based video quality evaluation method based on modular design of the present invention is shown in the figure. The implementation process of the method mainly includes five steps: multi-dimensional feature extraction, basic scoring, predicted label generation, score correction and fusion, and semi-supervised model training.
[0045] S1: All data from the original raw video dataset are sent to the sub-module for feature extraction; the original raw video dataset is a labeled, unreferenced AIGC video dataset from the real world.
[0046] S2: Base scores for the video and corresponding text output by the pedestal model; two pedestal models were selected, namely mPLUG-OWL2 and SigLIP2.
[0047] For the base model, the input prompt is "{c}. Rate the quality of this video.", where c is the text description of the corresponding video. For different datasets, this can be expanded to "Rate the text-videoalignment quality of the video." The response is set to "The quality of this video is: r," where r represents several pre-defined labels, and the raw score is a weighted sum of the labels and their corresponding weights.
[0048] base score The corresponding text is Its characteristics are represented as The corresponding video frame is represented as The corresponding cosine similarity is:
[0049]
[0050] The base score is:
[0051] in, The temperature parameter is the learnable parameter, and K is the total number of keyframes.
[0052] S3: Map the features obtained in S1 and divide them into blocks of weights and biases;
[0053] For different modules, the weights and biases are calculated as follows:
[0054]
[0055]
[0056]
[0057] in, , , These represent the MLPs after feature extraction from the spatial, temporal, and text matching degree modules, respectively. , , The corresponding features are represented by K, where K is the total number of keyframes, y is the continuous video block near the keyframe, and x is the keyframe.
[0058] S4: Weight the coefficients obtained in S3 with the base scores obtained in S2 to obtain the predicted output:
[0059]
[0060] S5: Train a literature quality assessment model by combining the predicted output obtained in S4 with the real labels of the existing literature quality assessment dataset in a supervised learning manner.
[0061] Wherein, given the true input label With model output The loss is calculated using the following formula:
[0062]
[0063]
[0064]
[0065] in , and This represents the normalized output and its corresponding true label.
[0066] The model is trained using this PLCC loss.
[0067] As a preferred embodiment, the existing textual video quality assessment datasets selected are FETV, LGVQ, T2VQA-DB, and GenAI-Bench.
[0068] Then, backpropagation of the error is performed and the model parameters are updated.
[0069] Implementation results:
[0070] To verify the effectiveness of the text-based video quality evaluation method and system based on modular design provided in the above embodiments of the present invention, the SRCC and PLCC indices of the model with different module selections can be calculated based on the results of comparative tests.
[0071] The performance test results are shown in Table 1.
[0072] Table 1
[0073]
[0074] To verify the versatility of the text-based video quality assessment method and system based on modular design provided in the above embodiments of the present invention, tests were conducted on multiple AIGC video quality assessment datasets, including FETV, LGVQ, T2VQA-DB, and GenAI-Bench. The experiments primarily used two classic quality assessment metrics to measure model performance: PLCC and SRCC. Performance test results are shown in Tables 2 and 3. Several baseline models were also tested for comparison, demonstrating that the embodiments of the present invention outperform most baseline models.
[0075] Table 2
[0076] Table 3
[0077] The above embodiments of the present invention provide a method and system for evaluating the quality of textual content based on modular design, which can effectively alleviate the problem that the quality evaluation method for Braille textual content requires a large amount of manually labeled data for supervised training, effectively improve the performance of existing textual content evaluation models, and obtain prediction results with better consistency with subjective evaluation.
[0078] It should be noted that the steps in the method provided by the present invention can be implemented using the corresponding modules, devices, units, etc. in the system. Those skilled in the art can implement the steps of the method by referring to the technical solution of the system. That is, the embodiments in the system can be understood as preferred examples of implementing the method, and will not be elaborated here.
[0079] Those skilled in the art will understand that, in addition to implementing the system and its various devices provided by this invention in the form of purely computer-readable program code, the same functions can be achieved entirely through logical programming of the method steps, making the system and its various devices of this invention function as logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers. Therefore, the system and its various devices provided by this invention can be considered as a hardware component, and the devices contained therein for implementing various functions can also be considered as structures within the hardware component; alternatively, the devices for implementing various functions can be considered as both software modules implementing the method and structures within the hardware component.
Claims
1. A method for evaluating the quality of text-based videos based on modular design, characterized in that, Includes the following steps: S1. Extract temporal features, spatial features, and image-text matching features from the original text-based video dataset; S2. Use a pre-trained multimodal base model to perform initial quality scoring on video-text pairs to generate base scores; S3. The multidimensional features extracted in step S1 are fused with the basic scores obtained in step S2 to generate the predicted scores for the video dataset; S4. Map the temporal features, spatial features, and image-text matching features to a unified dimension, and generate corresponding weight coefficients and bias coefficients in blocks. Use the weight coefficients and bias coefficients to perform weighted correction on the base score to obtain the final quality prediction score. The specific implementation method of the weighted correction is as follows: Temporal features, spatial features, and image-text matching features are each mapped to dimensions using independent multilayer perceptrons. The mapped features are divided into blocks for processing, generating weight terms and bias terms; The weighted average of the weighted and biased terms with the base score is calculated as follows: in For feature weights, Based on scores, This is the final score; The minimum batch learning strategy is adopted, and the predicted labels generated in step S3 are combined with the real labels in the existing Wensheng video quality assessment dataset to train the Wensheng video quality assessment model in a supervised learning manner. The method of training the Wensheng video quality assessment model using supervised learning specifically includes: For each pair of video samples in the training batch, calculate the 0-1 relative quality label; Calculate the preference probability between sample pairs based on the Thurstone model; Calculate each sample pair based on the relative quality label and preference probability. ) fidelity loss; , , Where n is the total number of samples; The fidelity loss between the predicted labels output by the model and the true labels of the existing video quality evaluation dataset is calculated separately. After setting weight factors, the two are added together to obtain the total loss. Backpropagation is then performed to update the model parameters. The temporal features in step S1 capture motion consistency by analyzing the temporal dynamic changes of the video frame sequence; the spatial features characterize structural and texture quality by analyzing the visual content of a single video frame; and the image-text matching feature evaluates the degree of semantic alignment by calculating the semantic similarity between the video content and the text description.
2. The method for evaluating the quality of text-based videos based on modular design according to claim 1, characterized in that, The pre-trained multimodal base model includes mPLUG-OWL2 or SigLIP2; the input prompts for the base model are formatted text, and the output is a weighted score of a preset category label.
3. The method for evaluating the quality of text-based videos based on modular design according to claim 1, characterized in that, The total loss function is as follows: in, These represent the model prediction and the true label, respectively, with loss0 and loss1 being the fidelity losses described in section 5.
4. A text-based video quality assessment system for implementing the method of any one of claims 1-3, characterized in that, include: The feature extraction module is used to extract temporal features, spatial features, and image-text matching features from text-based videos; The basic scoring module is used to generate initial quality scores for video-text pairs based on the base model; The score correction module applies the weights and biases generated after feature mapping to the base score to obtain the corrected predicted score. The training module is used to train the Wensheng video quality assessment model in a supervised manner using a minimum batch learning strategy.
5. The text-based video quality evaluation system according to claim 4, characterized in that, The feature extraction module supports traditional CNN models or mainstream multimodal large models as feature extractors, extracting the output of the last hidden layer as feature representation.
6. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1-3.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1-3.