Flat panel display visual comfort degree prediction method and system based on multi-modal fusion model
By generating diverse samples and fusing multimodal features through generative adversarial networks, and combining physical and physiological characteristics, a multimodal fusion model is constructed. This solves the problems of data scarcity and model adaptability in the prediction of visual comfort of flat panel display devices, and achieves efficient and accurate visual comfort prediction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING TECH UNIV
- Filing Date
- 2025-04-10
- Publication Date
- 2026-06-26
AI Technical Summary
Existing methods for predicting visual comfort of flat panel display devices suffer from problems such as a scarcity of high-quality labeled samples, lack of coupling relationships in single-modal analysis, and poor scene adaptability of data and models, resulting in insufficient model robustness and inadequate prediction accuracy.
A multimodal fusion model based on generative adversarial networks (GANs) to generate enhanced materials and fuse them with the original images is adopted. Combining physical and physiological features, the multimodal fusion model is constructed through a stacked ensemble framework. The GAN generates diverse samples, extracts multivariate heterogeneous features, and dynamically activates the base classifier to achieve visual comfort prediction.
The robustness and generalization ability of the model have been improved, the accuracy and physiological interpretability of visual comfort prediction have been enhanced, and the model is adapted to practical application scenarios under different data collection conditions.
Smart Images

Figure CN120299099B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the intersection of display technology and human-computer interaction, and in particular to a method and system for predicting visual comfort of flat panel displays based on a multimodal fusion model. Background Technology
[0002] With the widespread use of new flat panel display technologies, the extended usage time of flat panel display devices in recent decades has led to increasingly serious visual discomfort problems, which have become an important issue in the field of public health. Visual discomfort refers to the subjective feelings and corresponding physiological changes caused by prolonged viewing or significant sensory stress during the human visual process, including symptoms such as fatigue, eye strain, and blurred vision.
[0003] The display industry traditionally assesses visual comfort through subjective questionnaires, where participants view displayed content and complete questionnaires to adjust display parameters or determine research directions. However, this method requires numerous repeated trials, is time-consuming, labor-intensive, and inefficient, and cannot meet the demands of rapid iterative research and development. With the development of artificial intelligence technology, task models based on visual comfort indices have been proposed, but existing technologies suffer from the following core shortcomings:
[0004] First, due to limitations such as a shortage of professional personnel and long experimental cycles, the number of high-quality labeled samples is limited. To train models, it is necessary to rely on data augmentation techniques to expand the limited original samples. Commonly used augmentation methods, such as adjusting brightness / contrast and other color transformations, synthesizing a few oversampled samples, and injecting random noise, essentially generate new samples by modifying the features of the original image and using the same labels. However, such operations have limitations: augmentation methods are not based on the probability distribution of the data, which may lead to distortion of the sample distribution; data transformation may distort the perceptual characteristics of the human visual system in real scenes, resulting in a mismatch between the actual visual comfort of the new samples and the original labels; augmented samples are only generated within the known data neighborhood and cannot cover a wider range of visual perception scenarios, ultimately leading to insufficient model robustness.
[0005] Second, existing research largely focuses on single-modal analysis of display characteristics or physiological signals. Models using display characteristics as input rely on subjective user evaluations and are easily affected by individual differences and environmental factors, resulting in lower data quality. While models using physiological signals as input offer high-quality data, they require costly psychophysical experiments. Furthermore, neither of these models captures the coupling relationship between light signals and the response of the human visual system, leading to insufficient accuracy and stability in predictions.
[0006] Third, existing models rely excessively on single-modal data, resulting in a strong binding between the data acquisition paradigm and model design. This manifests as support for only input data in preset formats, making them incompatible with heterogeneous data in industrial scenarios. This rigid constraint on data format and model severely restricts their flexible deployment in actual R&D within the display industry.
[0007] Fourth, most existing models are designed for stereoscopic 3D displays, but the data processing methods of existing 3D display models cannot be transferred to flat panel display scenarios. The main reason for visual discomfort caused by 3D displays is that, due to differences in depth information, the visual system needs to constantly adjust the focal length and convergence angle of both eyes, leading to convergence-accommodation conflict. Over time, it becomes more difficult for viewers to fuse the images from both eyes, resulting in visual discomfort, mainly manifested as dizziness, generally related to EEG and semicircular canal activity. In contrast, the visual system of flat panel display users does not need to maintain a dynamic adjustment process to adapt to the changing depth of the image. Visual discomfort is less related to convergence-accommodation conflict, mainly manifested as a decline in specific visual functions such as dry eyes, soreness, and blurred vision, generally related to eye movements and ECG activity. Obviously, the influencing factors, mechanisms, and physiological manifestations of visual discomfort in 3D displays and flat panel displays are quite different. The two scenarios have fundamental differences in feature engineering and algorithmic logic, making it impossible to simply replicate research methods. Summary of the Invention
[0008] The technical problem to be solved by the present invention is to provide a method and system for predicting visual comfort of flat panel displays based on a multimodal fusion model, which addresses the above-mentioned deficiencies of the prior art. The aim is to solve the problems of scarcity of high-quality labeled samples, lack of coupling relationship in single-modal analysis, and poor scene adaptability of data and models in the prior art.
[0009] To solve the above technical problems,
[0010] Firstly, a method for predicting visual comfort of flat panel displays based on a multimodal fusion model is provided, including the following steps: Step 1, expanding image samples.
[0011] A raw image set is collected, which includes several raw images. For each raw image, a generative adversarial network model is used to convert the input random sequence into augmented material. The augmented material and the raw image are then fused together according to a preset ratio to obtain several new images. The new images corresponding to the original image set constitute a new image set.
[0012] Step 2: Obtain training samples
[0013] The original image set is played to the viewer, and the viewer's physiological characteristics and visual comfort evaluation are collected; the physical parameters of the flat panel display screen are measured, and the physical characteristics are obtained by inputting the original image set and the new image set; the physical and physiological characteristics are preprocessed to obtain physical-comfort training samples and physiological-comfort training samples.
[0014] Step 3: Train the multimodal fusion model
[0015] The physical-base classifier and the physiological-base classifier are trained based on physical-comfort training samples and physiological-comfort training samples, respectively; the stacked classifier is trained based on the stacked ensemble framework and the trained physical-base classifier and physiological-base classifier; the base classifier and the stacked classifier together constitute a multimodal fusion model.
[0016] Step 4: Predict visual comfort
[0017] The test data is input into the multimodal fusion model to obtain the visual comfort prediction results.
[0018] In one implementation, step one of the generative adversarial network model includes a generator and a discriminator:
[0019] The generator consists of several cascaded deep 2D convolutional blocks, with each convolutional layer employing the ReLU activation function;
[0020] The discriminator consists of several cascaded deep 2D convolutional blocks, with each convolutional layer employing the leaky ReLU activation function;
[0021] By using unsupervised deep learning, both the generator and the discriminator are trained simultaneously. The generator is trained to generate augmented material that is topologically equivalent to the original image; the discriminator is trained to accurately distinguish between the augmented material and the original image.
[0022] In one implementation, the original image is rotated, translated, and reflected before being input into the generative adversarial network model.
[0023] In one embodiment, in step two, the preprocessing of physical features involves extracting imaging features and non-imaging features from the physical features; the physical features include at least the brightness, color coordinates, and spectrum of all input signal values of all pixels on the flat panel display screen; the imaging features include the brightness, chromaticity, and hue matrix in the LCH color space, as well as the phase consistency matrix in the frequency domain; and the non-imaging features include the retinal irradiance map corresponding to the retinal receptors.
[0024] Thus, the physical-comfort training samples are divided into imaging-comfort training samples and non-imaging-comfort training samples; correspondingly, in step three, the physical-base classifier is divided into a first classifier and a second classifier. The first classifier is trained based on the imaging-comfort training samples, and the second classifier is trained based on the non-imaging-comfort training samples.
[0025] In one implementation, step two involves preprocessing the physiological features by extracting heart rate variability (HRV) features and eye-tracking features. The physiological features include at least electrocardiogram (ECG) signals and eye-tracking video. The HRV features include the mean RR interval; the standard deviation of the NN interval (SDNN); the root mean square (RMSSD) of the difference between adjacent NN intervals; the number of heartbeats (NN50) with a difference greater than 50 ms between adjacent NN intervals; the proportion of NN50 among all NN intervals (pNN50); the HRV triangular index (HRVI); and the total heart rate variability (HRV). The histogram of the NN interval approximates the width of the base triangle TINN; the power of the three frequency bands: very low frequency (VLF), low frequency (LF), and high frequency (HF); the power ratios of LF / HF and VLF / HF; the approximate entropy ApEn, sample entropy SampEn, and Shannon entropy ShanEn; the SD1 and SD2 values of the Poincaré cross section; eye-tracking features including blink frequency and duration calculated from consecutive binocular images of each video frame; scan rate, duration, amplitude, delay, and velocity; fixation frequency, duration, and dispersion; and pupil diameter.
[0026] Therefore, the physiological-comfort training samples are divided into heart rate-comfort training samples and eye-tracking-comfort training samples; correspondingly, in step three, the physiological-basic classifier is divided into a third classifier and a fourth classifier. The third classifier is trained based on the heart rate-comfort training samples, and the fourth classifier is trained based on the eye-tracking-comfort training samples.
[0027] In one implementation, in step four, the test data includes test images and corresponding viewer physiological characteristics. Before inputting the test data into the multimodal fusion model, the physical parameters of the tablet display screen are measured first, and the physical characteristics are calculated in combination with the input test images. Then, the physical and physiological characteristics corresponding to the test images are processed according to the preprocessing procedure in step two, including extracting imaging and non-imaging features from the physical features, and extracting heart rate variability features and eye-tracking features from the physiological features. Finally, the data is input into the multimodal fusion model to obtain the visual comfort prediction results.
[0028] In one implementation, step four further includes inputting only the test image or only any physiological characteristic of the viewer corresponding to the test image. At this time, depending on the type of input data, imaging features and non-imaging features are extracted from physical features, or heart rate variability features or eye-tracking features are extracted from physiological features, and then input into the multimodal fusion model to obtain the visual comfort prediction result.
[0029] In one implementation, in step four, when only a test image or only any physiological feature of the viewer corresponding to the test image is input, the corresponding base classifier is automatically matched according to the extracted feature type; by disabling the output of the unmatched base classifier, the stacked classifier adaptively fuses the matched base classifier and outputs the visual comfort prediction result.
[0030] Secondly, a flat panel display visual comfort prediction system based on a multimodal fusion model is provided, including:
[0031] An expanded image sample module is used to acquire a raw image set, which includes several raw images. For each raw image, a generative adversarial network model is used to convert the input random sequence into augmented material. The augmented material and the raw image are then fused together according to a preset ratio to obtain several new images. The new images corresponding to the raw image set constitute a new image set.
[0032] The training sample acquisition module is used to play the original image set to the viewer and collect the viewer's physiological characteristics and visual comfort evaluation; measure the physical parameters of the tablet display screen, input the original image set and the new image set to obtain physical characteristics; preprocess the physical characteristics and physiological characteristics to obtain physical-comfort training samples and physiological-comfort training samples.
[0033] The module for training a multimodal fusion model is used to train the physical-base classifier and the physiological-base classifier based on physical-comfort training samples and physiological-comfort training samples, respectively; based on the stacked ensemble framework and the trained physical-base classifier and physiological-base classifier, the stacked classifier is trained; the base classifier and the stacked classifier together constitute the multimodal fusion model;
[0034] The visual comfort prediction module is used to input test data into a multimodal fusion model to obtain visual comfort prediction results.
[0035] In one implementation, the visual comfort prediction module includes an adaptive unit, which automatically matches the corresponding base classifier based on the extracted feature type when only a test image or only any physiological feature of the viewer corresponding to the test image is input; by disabling the output of unmatched base classifiers, the stacked classifier adaptively fuses the matched base classifiers and outputs the visual comfort prediction result.
[0036] The beneficial effects of this invention are:
[0037] 1. Generative Adversarial Network (GAN) models learn the probability distribution of the original image to generate samples that retain its core features but are rich in detail. This invention further proposes a proportional fusion strategy, which mixes the generated enhanced samples with the original image. This not only preserves the visual baseline of the original image but also utilizes the diversity of the fused samples to avoid the generated content from being overly focused on local details, thus overcoming the potential pattern collapse limitation of the generator. The fused samples maintain semantic consistency at the visual level, ensuring that they have a similar visual perception to the original image and can be labeled with the same visual comfort label to alleviate the problem of scarce training data. At the same time, new information is introduced through the enhanced samples at the matrix value level, thereby improving the model's robustness to input perturbations and noise and its generalization ability, achieving a balance between preserving the original features and creating innovative generated samples.
[0038] 2. This invention is based on two types of data sources: physical features and physiological features. It automatically extracts diverse and heterogeneous features that conform to the working principles of the human visual system to construct a basic classifier. Utilizing the complementarity of physical and physiological features, it comprehensively characterizes the factors influencing visual comfort. Then, through a stacked structure, it integrates the output of the basic classifier. This not only reduces uncertainty between different data sources, improving prediction accuracy, but also reduces the interference of outliers from a single data source by providing more comprehensive and accurate information, thus enhancing model robustness. Furthermore, because the features are extracted based on the working principles of the human visual system, the prediction results of the model trained on these features are physiologically interpretable. Users can analyze the specific factors causing visual discomfort and its specific manifestations based on the contribution of different input features to the results, providing guidance for display design.
[0039] 3. This invention proposes pre-training each basic classifier and then constructing a stacked classifier. During testing, the corresponding basic classifier is activated based on the input data type, while the basic classifiers for unmatched modalities are frozen. This design preserves the domain knowledge of each modality's basic classifier during training, avoids interference caused by missing or changed modalities by freezing them, saves computation time through the stacked classifier, and achieves efficient generalization to the input scenario. It overcomes the limitation of traditional multimodal models that rely on full data input and is suitable for practical application scenarios with varying data acquisition conditions.
[0040] 4. This invention achieves a positive cycle of improved data quality, enhanced model performance, and expanded scene generalization through the synergistic effect of diverse samples generated by generative adversarial networks (GANs), multimodal feature extraction and stacking structures, and dynamic modality activation mechanisms. In particular, the complementary effect of diverse samples and stacking structures is significant: diverse samples alleviate the data scarcity problem, while the stacking structure, by integrating the outputs of the basic classifiers for physical and physiological modalities, maps the diverse features generated by the GAN to a shared feature space. This avoids mode collapse in the generated samples and enhances the model's robustness to input perturbations and noise through multimodal information redundancy. The synergy of these two aspects significantly improves the model's accuracy. Attached Figure Description
[0041] The invention will now be further described with reference to the accompanying drawings.
[0042] Figure 1 This is the overall design framework of the flat panel display visual comfort prediction method based on a multimodal fusion model according to an embodiment of the present invention.
[0043] Figure 2 This is a flowchart of sample diversification based on a generative adversarial network model according to an embodiment of the present invention.
[0044] Figure 3 This is a flowchart illustrating the extraction of imaging features and non-imaging features based on the physical features of a flat panel display, according to an embodiment of the present invention. Detailed Implementation
[0045] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit the present invention or its application or use. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.
[0046] This invention provides a method for predicting visual comfort of flat panel displays based on a multimodal fusion model, such as... Figure 1 As shown, it includes the following steps:
[0047] Step 1: Expand image samples
[0048] A raw image set is collected, which includes several raw images. For each raw image, a generative adversarial network model is used to convert the input random sequence into augmented material. The augmented material and the raw image are then fused together according to a preset ratio to obtain several new images. The new images corresponding to the original image set constitute a new image set.
[0049] Step 2: Obtain training samples
[0050] The original image set is played to the viewer, and the viewer's physiological characteristics and visual comfort evaluation are collected; the physical parameters of the flat panel display screen are measured, and the physical characteristics are obtained by inputting the original image set and the new image set; the physical and physiological characteristics are preprocessed to obtain physical-comfort training samples and physiological-comfort training samples.
[0051] Step 3: Train the multimodal fusion model
[0052] The physical-base classifier and the physiological-base classifier are trained based on physical-comfort training samples and physiological-comfort training samples, respectively; the stacked classifier is trained based on the stacked ensemble framework and the trained physical-base classifier and physiological-base classifier; the base classifier and the stacked classifier together constitute a multimodal fusion model.
[0053] Step 4: Predict visual comfort
[0054] The test data is input into the multimodal fusion model to obtain the visual comfort prediction results.
[0055] Specifically, in the first step, the Generative Adversarial Network (GAN) model, as a form of generative artificial intelligence, learns the joint probability distribution P(X,Y) of samples (X) and labels (Y) through observed data, thus learning sufficient feature representations. Numerous comparative studies have confirmed that GAN models have many advantages over other generative models: compared to Variational Autoencoders (VAEs), they have smaller biases; compared to Deep Boltzmann Machines and Generative Stochastic Networks (GSNs), samples can be generated in one go; and compared to Non-linear Independent Components Estimation (NICE) and Neural Variational Posterior (Real NVP), there is no limitation on the size of the latent code.
[0056] This invention does not employ the method of directly generating new samples after inputting the original image into the generative adversarial network model, because directly generated samples are prone to pattern collapse, i.e., over-concentration on local data patterns. This invention uses a proportional fusion strategy, specifically reflected in the mathematical formula as follows:
[0057] I S =ratio×I R +(1-ratio)×I G
[0058] In the formula, ratio is the fusion ratio, which in the example is a random value within the range of 0.1-0.3, I S For the new image matrix values, I R I represents the original image matrix values. G The augmented material matrix values are generated for the GAN model. The purpose of this step is to select new samples that do not excessively alter human visual perception but are clearly different from the original images in terms of matrix values, thus avoiding local optima problems that may occur with the GAN model. The added augmented material can be viewed as a perturbation to the original image, helping the training model to be more robust to noise and transformations, thereby performing better in practical applications. The fused image maintains semantic consistency at the visual level, ensuring a similar visual perception to the original image; therefore, the same visual comfort label can be assigned to the new image as to the original image.
[0059] Specifically, in step two, playing the original image set to the viewer and collecting the viewer's physiological characteristics and visual comfort evaluation means arranging for the viewer to watch the original image set on a flat panel display in a dark room. For each original image viewed, the electrocardiogram (ECG) signal and eye-tracking video (EM) are recorded throughout the process. After watching, the viewer is asked to fill out a questionnaire to evaluate the degree of visual comfort. This obtains the viewer's physiological characteristics and corresponding visual comfort when watching the original images, which is the physiological-comfort training sample.
[0060] Measuring a flat panel display screen involves using a photometer, colorimeter, and spectrometer to measure the brightness, color coordinates, and spectrum of all levels of input signals for the red, green, and blue primary colors, as well as the full-screen white field, of the flat panel display. This yields the display's true gamma curve, color gamut, and spectral distribution. Based on this, the physical characteristics corresponding to the original and new image sets can be calculated by inputting the signal values of the original and new image sets.
[0061] Since the original images have already been played to the viewer, the physical features corresponding to the original image set can be directly labeled with the visual comfort labels filled in by the viewer, i.e., the physical-comfort training samples corresponding to the original image set. For new images, as mentioned above, they are generated based on the fusion of the original images and maintain semantic consistency with their corresponding original images at the visual level, having similar visual perception. Therefore, they can be labeled with the same visual comfort labels as their corresponding original images, thereby obtaining the physical-comfort training samples corresponding to the new image set. The physical-comfort training samples corresponding to the original image set and the physical-comfort training samples corresponding to the new image set together constitute the physical-comfort training samples, which are used to train the subsequent physical base classifier.
[0062] In this embodiment, visual comfort level labels are divided into three categories: uncomfortable, neutral, and comfortable. In practical applications, the classification method for subjective evaluation of visual comfort can be adjusted as needed, such as being divided into very uncomfortable, uncomfortable, average, comfortable, and very comfortable. Since this embodiment uses the method of expanding image samples in step one, only a small number of viewers' physiological characteristics and visual comfort evaluations need to be collected to build the subsequent multimodal fusion model, and the prediction results have good accuracy.
[0063] In one implementation, such as Figure 2 As shown, in step one, the generative adversarial network model includes a generator and a discriminator:
[0064] The generator consists of several cascaded deep 2D convolutional blocks, with each convolutional layer employing the ReLU activation function;
[0065] The discriminator consists of several cascaded deep 2D convolutional blocks, with each convolutional layer employing the leaky ReLU activation function;
[0066] By using unsupervised deep learning, both the generator and the discriminator are trained simultaneously. The generator is trained to generate augmented material that is topologically equivalent to the original image, i.e., augmented material that is as similar to the original image as possible. The discriminator is trained to accurately distinguish between the augmented material and the original image.
[0067] Specifically, the design of the generator and discriminator in this embodiment is shown in the table below:
[0068]
[0069] Specifically, the training objective of the generator is set to generate augmented material that is as similar as possible to the original image, that is, to make the loss function with respect to the samples Loss G The value should be as small as possible so that the generated enhanced material can pass the discriminator's judgment. Specifically, this is reflected in the mathematical formula as: Loss G =-average(log(P) G-R The discriminator training objective is set to distinguish between augmented and original images as accurately as possible, maximizing the recognition probability of the original image and minimizing the misclassification probability of the augmented sample, i.e., setting the loss function for the sample as Loss D To be as small as possible, specifically in mathematical formulas: Loss D =-average(log(P) F-R ))-average(log(1-P G-R In the formula, P G-R P is the probability that the discriminator will identify the generated enhanced material as real. R-R It is the probability that the discriminator will identify the original image as real.
[0070] In one implementation, the original image is rotated, translated, and reflected before being input into the generative adversarial network model.
[0071] The implementation uses rotation (positive / negative 10°), translation (10 pixels in the left / right / up / down direction), and reflection (50% probability) to improve the robustness of the generative adversarial network model in sequence.
[0072] In one embodiment, in step two, the preprocessing of physical features involves extracting imaging features and non-imaging features from the physical features; the physical features include at least the brightness, color coordinates, and spectrum of all input signal values of all pixels on the flat panel display screen; the imaging features include the brightness, chromaticity, and hue matrix in the LCH color space, as well as the phase consistency matrix in the frequency domain; and the non-imaging features include the retinal irradiance map corresponding to the retinal receptors.
[0073] Thus, the physical-comfort training samples are divided into imaging-comfort training samples and non-imaging-comfort training samples; correspondingly, in step three, the physical-base classifier is divided into a first classifier and a second classifier. The first classifier is trained based on the imaging-comfort training samples, and the second classifier is trained based on the non-imaging-comfort training samples.
[0074] Specifically, such as Figure 3 As shown, the features extracted from physical features are not limited to the following categories; other features can be extracted according to actual needs. Imaging features are divided into two parts:
[0075] (1) Based on the measured RGB gamma curves and color coordinates, according to the transition from RGB to XYZ, and from XYZ to CIE La... * b * From CIELa * b * The color space is converted to LCH in sequence. In the LCH color space, feature matrices of lightness, chroma, and hue are extracted.
[0076] (2) The image is transformed to the frequency domain by Fast Fourier Transform, and the phase congruency feature matrix is calculated as follows:
[0077]
[0078] Where PC stands for phase coherence. It is a local phase, A n It represents local amplitude or energy. When all phases are aligned, the PC value equals 1.
[0079] For non-imaging features, firstly, based on the measured RGB spectrum, the actual spectrum of each pixel in the flat panel display is calculated by adding the primary color spectra. Then, according to the retinal photoreceptor sensitivity curve provided by the International Commission on Illumination (CIE) S 026:2018 international standard, the characteristic matrix of retinal irradiance for five types of photoreceptor cells on the retina—S-cone cells, M-cone cells, L-cone cells, rod cells, and ipRGC cells—is calculated, as shown in the following formula:
[0080] E α =∫E e,λ (λ)s α (λ)dλ
[0081] Where α corresponds to five types of photoreceptor cells, E α It is the retinal irradiance of this photoreceptor cell, E e,λ (λ) is the spectral irradiance at wavelength λ, s α (λ) is the spectral sensitivity curve of the photoreceptor cell at wavelength λ.
[0082] Specifically, the physics-based classifier is divided into a first classifier and a second classifier, as described in this embodiment:
[0083] Input the imaging features, construct a convolutional neural network (CNN network), classify the subjective evaluation of visual comfort, and build a first classifier;
[0084] Input the non-imaging features, construct a convolutional neural network (CNN network) to classify the subjective evaluation of visual comfort, and construct a second classifier;
[0085] The first and second classifier models have the same structure, as shown in the table below:
[0086] Input layer Projection and Reshaping Layers 2D convolutional layers (3 layers in total) Fully connected layer Weighted Softmax layer (the weight of each class is inversely proportional to the probability of samples of that class)
[0087] Most of the computation is performed in three 2D convolutional layers. In each layer, a 5×5 filter scans the entire input matrix, calculating the dot product with each pixel, which is then used as the output array. The first layer focuses on simple features. As the layers progress, the complexity gradually increases, enabling the extraction of more scale information from the input matrix. In the fully connected layers, each node in the output layer is directly connected to a node in the previous layer. In the final Softmax layer, the final class probability values are output.
[0088]
[0089] Where P is the output probability, j is the j-th category, x is the sample vector, w is the weighted vector, and K is related to the number of nodes in the previous layer. The category with the highest probability is the prediction result.
[0090] In this embodiment, the CNN network training strategy employs 10x cross-validation. Each time, 10% of the dataset is randomly selected as the test set, and the remainder is used as the training set, repeated 10 times. In this embodiment, the first and second classifier models are trained using Stochastic Gradient Descent with Momentum (SGDM). The advantage of this method is that it adjusts parameters based on the update trend, avoiding getting stuck at points with small gradients, resulting in more stable convergence. In practical applications, a suitable model can be selected for training based on the specific circumstances.
[0091] In one implementation, step two involves preprocessing the physiological features by extracting heart rate variability (HRV) features and eye-tracking features. The physiological features include at least electrocardiogram (ECG) signals and eye-tracking video. The HRV features include the mean RR interval; the standard deviation of the NN interval (SDNN); the root mean square (RMSSD) of the difference between adjacent NN intervals; the number of heartbeats (NN50) with a difference greater than 50 ms between adjacent NN intervals; the proportion of NN50 among all NN intervals (pNN50); the HRV triangular index (HRVI); and the total heart rate variability (HRV). The histogram of the NN interval approximates the width of the base triangle TINN; the power of the three frequency bands: very low frequency (VLF), low frequency (LF), and high frequency (HF); the power ratios of LF / HF and VLF / HF; the approximate entropy ApEn, sample entropy SampEn, and Shannon entropy ShanEn; the SD1 and SD2 values of the Poincaré cross section; eye-tracking features including blink frequency and duration calculated from consecutive binocular images of each video frame; scan rate, duration, amplitude, delay, and velocity; fixation frequency, duration, and dispersion; and pupil diameter.
[0092] Therefore, the physiological-comfort training samples are divided into heart rate-comfort training samples and eye-tracking-comfort training samples; correspondingly, in step three, the physiological-basic classifier is divided into a third classifier and a fourth classifier. The third classifier is trained based on the heart rate-comfort training samples, and the fourth classifier is trained based on the eye-tracking-comfort training samples.
[0093] Specifically, the features extracted from physiological characteristics are not limited to the following categories; other features can be extracted according to actual needs. The following features are just examples:
[0094] Using software such as Matlab (with dedicated toolboxes for ECG analysis, Eyelink, etc.), physiological features of human electrocardiogram (ECG) signals and eye tracking were extracted. The biological features extracted from the ECG signals included heart rate (HR), mean RR interval, standard deviation of NN intervals (SDNN), root mean square of the difference between adjacent NN intervals (RMSSD), number of heartbeats with a difference greater than 50 ms between adjacent NN intervals (NN50), the proportion of NN50 in all NN intervals (pNN50), HRV triangular index (HRVI), the width of the base of the approximate triangle of the histogram of all NN intervals (TINN), very low frequency (VLF), low frequency (LF), and high frequency. (HF) power, LF / HF, VLF / HF power ratio, approximate entropy (ApEn), sample entropy (SampEn), Shannon entropy (ShanEn), and Poincaré cross-section SD1 and SD2 values; biological features extracted from the eye-tracking video include the frequency (Hz) and duration (ms) of blinking, saccades, and fixation, the amplitude (°), delay (ms), and velocity (° / s) of saccades, the dispersion of fixation (px), and the pupil diameter (mm) calculated from consecutive frames of the binocular images.
[0095] Specifically, the physiological-basic classifier is divided into a third classifier and a fourth classifier, as described in this embodiment:
[0096] Input biological features extracted from the electrocardiogram signal, construct a decision tree model, classify the subjective evaluation of visual comfort, and construct a third classifier;
[0097] The biological features extracted from the eye-tracking images are used as input to construct a decision tree model, which is used to classify the subjective evaluation of visual comfort, thus constructing a fourth classifier.
[0098] Decision trees, as a tree structure, exhibit good robustness to noisy data sources such as physiological signals. Each non-leaf node in a decision tree represents a test on a feature attribute, each leaf node stores a class, and each branch represents the output of that feature attribute over a certain value range. Training of decision trees also employs 10x cross-validation. During training, the training set is split into multiple subsets. For each subset, starting from the root node, the feature attribute is tested, and the output branch is selected according to its value until a leaf node is reached. The class stored in the leaf node is taken as the classification result. This recursive process stops when all classification labels in a subset are the same. In this embodiment, the classification with the minimum entropy (maximum probability) is selected as the prediction result.
[0099]
[0100] Where E is the entropy, i is the i-th category, and p iIt represents the probability of the i-th class.
[0101] In practical applications, a suitable model can be selected for training based on the actual situation.
[0102] In one implementation, in step four, the test data includes test images and corresponding viewer physiological characteristics. Before inputting the test data into the multimodal fusion model, the physical parameters of the flat panel display screen are measured first. The physical characteristics are calculated by combining the signal values of the input test images. Then, the physical and physiological characteristics corresponding to the test images are processed according to the preprocessing procedure in step two, including extracting imaging and non-imaging features from the physical features, and extracting heart rate variability features and eye-tracking features from the physiological features. Finally, the data is input into the multimodal fusion model to obtain the visual comfort prediction results.
[0103] Specifically, most existing studies input the RGB value matrix of the displayed image and raw physiological data into the model without processing for training, leading to poor model classification performance. This application, based on existing experimental research results on visual comfort, extracts meaningful features from measurement results and physiological data to construct a basic model. The advantages of this approach are that it allows for the measurement of influencing factors and physiological manifestations of visual comfort from more dimensions, extracts diverse and heterogeneous information from limited data, and thus improves the prediction accuracy of stacked classifiers. Furthermore, because the features are extracted based on the working principles of the human visual system, the prediction results of the model trained on these features are physiologically interpretable. Users can analyze the specific factors causing visual discomfort and its specific manifestations based on the contribution of different inputs to the results, providing guidance for display design.
[0104] In one implementation, step four further includes inputting only the test image or only any physiological characteristic of the viewer corresponding to the test image. At this time, depending on the type of input data, imaging features and non-imaging features are extracted from physical features, or heart rate variability features or eye-tracking features are extracted from physiological features, and then input into the multimodal fusion model to obtain the visual comfort prediction result.
[0105] For example, by inputting only the user's ECG signal or only the displayed image matrix, the stacked classifier can use the corresponding classifier as the base model to output the predicted visual comfort. The more input data and the more base models there are, the higher the prediction accuracy will be. This application achieves information complementarity based on multimodal feature fusion. Compared to other models that can only predict visual comfort by inputting specific data, this application only needs to use existing experimental data and one or more features corresponding to the input screen image. The algorithm automatically extracts optical and biological information, and the stacked classifier can output the predicted visual comfort, significantly broadening its applicability. For example, when other laboratories only collect one or a few features, they can apply the method of this application to predict visual comfort based on limited experiments. Furthermore, this application also significantly improves the accuracy and stability of the stacked classifier's predictions through ensemble learning.
[0106] In one implementation, in step four, when only a test image or only any physiological feature of the viewer corresponding to the test image is input, the corresponding base classifier is automatically matched according to the extracted feature type; by disabling the output of the unmatched base classifier, the stacked classifier adaptively fuses the matched base classifier and outputs the visual comfort prediction result, so as to flexibly adapt to different scenarios.
[0107] Specifically, the advantage of stacked classifiers is that their predictions (or errors) are uncorrelated or low-correlated. By providing more comprehensive and accurate information, they can prevent misleading classification results from single-modality errors. This embodiment uses a random forest model, commonly used in related fields, as the meta-model, with the first to fourth classifiers as the base models. The input is the predicted scores of the base models for the training set samples, i.e., the class probabilities of the three labels. Training also uses 10-fold cross-validation. When the input test data decreases, the stacked classifier can also more flexibly fuse their results to produce a final prediction by disabling the outputs of unmatched base classifiers and assigning different weights to the outputs of each base classifier based on their performance and characteristics, thus disabling the outputs of unmatched base classifiers. This adaptive capability makes the stacked classifier more robust to changes in base classifiers or data distribution.
[0108] Furthermore, ensemble learning, as a new artificial intelligence technology, has already been applied in fields such as medical data analysis and security detection. Bagging considers homogeneous learners for parallel learning, while Boosting considers homogeneous learners for sequential learning. In this embodiment, firstly, heterogeneous learners are designed as base models for datasets of different modalities, and trained in parallel to save computation time. Then, based on the prediction scores of the trained base models, a stacked classifier is trained to learn complementary patterns between modalities. These advantages are beyond the reach of majority voting or averaging in bagging and boosting, making it more practical.
[0109] This invention also provides a flat panel display visual comfort prediction system based on a multimodal fusion model, comprising:
[0110] An expanded image sample module is used to acquire a raw image set, which includes several raw images. For each raw image, a generative adversarial network model is used to convert the input random sequence into augmented material. The augmented material and the raw image are then fused together according to a preset ratio to obtain several new images. The new images corresponding to the raw image set constitute a new image set.
[0111] The training sample acquisition module is used to play the original image set to the viewer and collect the viewer's physiological characteristics and visual comfort evaluation; measure the physical parameters of the tablet display screen, input the original image set and the new image set to obtain physical characteristics; preprocess the physical characteristics and physiological characteristics to obtain physical-comfort training samples and physiological-comfort training samples.
[0112] The module for training a multimodal fusion model is used to train the physical-base classifier and the physiological-base classifier based on physical-comfort training samples and physiological-comfort training samples, respectively; based on the stacked ensemble framework and the trained physical-base classifier and physiological-base classifier, the stacked classifier is trained; the base classifier and the stacked classifier together constitute the multimodal fusion model;
[0113] The visual comfort prediction module is used to input test data into a multimodal fusion model to obtain visual comfort prediction results.
[0114] In one implementation, the visual comfort prediction module includes an adaptive unit, which automatically matches the corresponding base classifier based on the extracted feature type when only a test image or only any physiological feature of the viewer corresponding to the test image is input; by disabling the output of unmatched base classifiers, the stacked classifier adaptively fuses the matched base classifiers and outputs the visual comfort prediction result.
[0115] The advantages of this invention are:
[0116] 1. Generative Adversarial Network (GAN) models learn the statistical patterns of original images to generate samples that retain their core features but are rich in detail. This invention further proposes a proportional fusion strategy, which mixes the generated enhanced samples with the original image. This not only preserves the visual baseline of the original image but also utilizes the diversity of the fused samples to avoid the generated content from being overly focused on local details, thus overcoming the potential pattern collapse limitation of the generator. The fused samples maintain semantic consistency at the visual level, ensuring that they have a similar visual perception to the original image and can be labeled with the same visual comfort label to alleviate the problem of scarce training data. At the same time, new information is introduced through the enhanced samples at the matrix value level, thereby improving the model's robustness to input perturbations and noise and its generalization ability, achieving a balance between preserving original features and generating innovative samples.
[0117] 2. This invention is based on two types of data sources: physical features and physiological features. It automatically extracts diverse and heterogeneous features that conform to the working principles of the human visual system to construct a basic classifier. Utilizing the complementarity of physical and physiological features, it comprehensively characterizes the factors influencing visual comfort. Then, through a stacked structure, it integrates the output of the basic classifier. This not only reduces uncertainty between different data sources, improving prediction accuracy, but also reduces the interference of outliers from a single data source by providing more comprehensive and accurate information, thus enhancing model robustness. Furthermore, because the features are extracted based on the working principles of the human visual system, the prediction results of the model trained on these features are physiologically interpretable. Users can analyze the specific factors causing visual discomfort and its specific manifestations based on the contribution of different input features to the results, providing guidance for display design.
[0118] 3. This invention proposes pre-training each basic classifier and then constructing a stacked classifier. During testing, the corresponding basic classifier is activated based on the input data type, while the basic classifiers for unmatched modalities are frozen. This design preserves the domain knowledge of each modality's basic classifier during pre-training, avoids cross-modal interference by freezing the classifiers, saves computation time by stacking the classifiers, and achieves efficient generalization to the input scenario. It overcomes the limitation of traditional multimodal models that rely on full data input and is suitable for practical application scenarios with varying data acquisition conditions.
[0119] 4. This invention achieves a positive cycle of improved data quality, enhanced model performance, and expanded scenario generalization through the synergistic effect of diverse samples generated by generative adversarial networks (GANs), multimodal feature extraction and stacking structures, and dynamic modality activation mechanisms. In particular, the complementary effect of diverse samples and stacking structures is significant: diverse samples alleviate the data scarcity problem, while the stacking structure, by integrating the outputs of the basic classifiers for physical and physiological modalities, maps the diverse features generated by the GAN to a shared feature space. This avoids mode collapse in the generated samples and enhances the model's robustness to input perturbations through multimodal information redundancy. The synergy of these two aspects significantly improves the model's accuracy.
[0120] The effectiveness of the embodiments of the present invention was verified using the following methods: common evaluation metrics such as accuracy (ACC), precision (Pre), recall (Rec), and weighted F1 score (F1) were used. W To evaluate the merits of this application, the closer the value is to 1, the higher the prediction accuracy of the multimodal fusion model.
[0121]
[0122] (i for each category)
[0123] in:
[0124] Pre w =W1×Pre1+W2×Pre2+W3×Pre3
[0125] Rec W =W1×Rec1+W2×Rec2+W3×Rec3
[0126] Where i represents each category. For category i, TP i (True Positive) refers to the weighted sample that is correctly predicted by the model. i (True Negative) refers to other class samples that are correctly predicted by the model. i (False Positive) refers to samples from other classes that the model predicts to be of that class. i (False Negative) refers to samples of that class that the model predicts to be of another class. i The weight of this class is proportional to the frequency of occurrence of sample i, and the sum is 1.
[0127] A saturation-visual comfort dataset containing 1120 samples (database source: Yunyang Shi, Yan Tu, Lili Wang, Xin Gao. 2019. P-33: Effects of luminance, contrast and saturation of HDR QLED display on visual system based on eye movement. SIDSymposium Digest of Technical Papers, 50) was used for training and testing. The corresponding experiment involved 14 participants. Each participant viewed images from a television display at five different saturation settings in a dark room in random order. Electrocardiogram (ECG) signals and eye-tracking videos were recorded throughout the process, and participants provided psychological assessment scores for visual comfort at different settings after viewing. Each experiment lasted approximately 50 minutes. The resulting dataset samples included the displayed content, ECG signals, eye-tracking videos, and subjective evaluations of visual comfort as labels. The label distribution was: 21.52% discomfort, 49.37% insensitivity, and 29.11% comfort.
[0128] The prediction performance of the trained model on the test set is shown in the table below:
[0129]
[0130] In the table, "no sample diversity" refers to directly using I R (Original samples not mixed with GAN output) instead of I G (Diverse samples mixed with GAN output) are used as input to the "Display Feature Extraction" section. "Does not contain human information" means only physical features (imaging and non-imaging features) are input; "Does not contain display information" means only physiological features are input. "Does not employ ensemble learning" means using only a single-model classifier with four inputs for prediction, rather than a stacked fusion of multiple models.
[0131] To verify the model's performance under different display usage scenarios, the model corresponding to the table above was used as a pre-trained model for transfer learning on the luminance-display quality dataset (dataset source: Xin Gao, Yan Tu, Lili Wang, Yunyang Shi, Wei Zhang. 2019. The effect of luminance on visual perception based on eyemovement and ECG. 2019 3rd International Conference on Circuits, System and Simulation (ICCSS), 221-224). In the experiment corresponding to the dataset, another 26 participants viewed images on a television monitor at five brightness settings and assessed their visual fatigue level (other experimental settings were the same as the saturation-visual comfort dataset). The resulting dataset samples contained display information, ECG, and EM signals, with the following label distribution: 8.15% moderate fatigue, 43.70% mild fatigue, and 48.15% no fatigue. The model's prediction performance was: ACC = 0.78 ± 0.17, F1 W =0.72±0.24. This indicates that the model in this embodiment of the invention has the advantages of being flexible, stable, and adaptable to different environments and tasks.
Claims
1. A method for predicting visual comfort of flat panel displays based on a multimodal fusion model, comprising the following steps: Step 1: Expand image samples A raw image set is collected, which includes several raw images. For each raw image, a generative adversarial network model is used to convert the input random sequence into augmented material. The augmented material and the raw image are then fused together according to a preset ratio to obtain several new images. The new images corresponding to the original image set constitute a new image set. Step 2: Obtain training samples The original image set is played to the viewer, and the viewer's physiological characteristics and visual comfort evaluation are collected; the physical parameters of the flat panel display screen are measured, and the physical characteristics are obtained by inputting the original image set and the new image set. Physical and physiological characteristics are preprocessed to obtain physical-comfort training samples and physiological-comfort training samples; Step 3: Train the multimodal fusion model The physical-base classifier and the physiological-base classifier are trained based on physical-comfort training samples and physiological-comfort training samples, respectively; the stacked classifier is trained based on the stacked ensemble framework and the trained physical-base classifier and physiological-base classifier; the base classifier and the stacked classifier together constitute a multimodal fusion model. Step 4: Predict visual comfort The test data is input into the multimodal fusion model to obtain the visual comfort prediction results.
2. The method for predicting visual comfort of flat panel displays based on a multimodal fusion model as described in claim 1, characterized in that: In step one, the generative adversarial network model includes a generator and a discriminator: The generator consists of several cascaded deep 2D convolutional blocks, with each convolutional layer employing the ReLU activation function; The discriminator consists of several cascaded deep 2D convolutional blocks, with each convolutional layer employing the leaky ReLU activation function; By using unsupervised deep learning, both the generator and the discriminator are trained simultaneously. The generator is trained to generate augmented material that is topologically equivalent to the original image; the discriminator is trained to accurately distinguish between the augmented material and the original image.
3. The method for predicting visual comfort of flat panel displays based on a multimodal fusion model as described in claim 2, characterized in that: The original image is rotated, translated, and reflected before being input into the generative adversarial network model.
4. The method for predicting visual comfort of flat panel displays based on a multimodal fusion model as described in claim 1, characterized in that: In step two, the preprocessing of physical features involves extracting imaging features and non-imaging features from the physical features; physical features include at least the brightness, color coordinates, and spectrum of all input signal values for all pixels on the flat panel display screen; Imaging features include the luminance, chromaticity, and hue matrices in the LCH color space, as well as the phase coherence matrix in the frequency domain; Non-imaging features include retinal irradiance maps corresponding to retinal receptors; Thus, the physical-comfort training samples are divided into imaging-comfort training samples and non-imaging-comfort training samples; correspondingly, in step three, the physical-base classifier is divided into a first classifier and a second classifier. The first classifier is trained based on the imaging-comfort training samples, and the second classifier is trained based on the non-imaging-comfort training samples.
5. The method for predicting visual comfort of flat panel displays based on a multimodal fusion model as described in claim 1, characterized in that: In step two, the preprocessing of physiological features involves extracting heart rate variability features and eye-tracking features; the physiological features include at least electrocardiogram signals and eye-tracking videos. Heart rate variability features include mean RR interval; standard deviation of NN interval SDNN; root mean square of the difference between adjacent NN intervals RMSSD; number of heartbeats with a difference greater than 50 ms between adjacent NN intervals NN50; proportion of NN50 in all NN intervals pNN50; HRV triangular index HRVI; width of the base of the approximate triangle of the histogram of all NN intervals TINN; power of the three frequency bands: very low frequency (VLF), low frequency (LF), and high frequency (HF); power ratios of LF / HF and VLF / HF; approximate entropy ApEn, sample entropy SampEn, and Shannon entropy ShanEn; SD1 and SD2 values of the Poincaré cross section; eye-tracking features include blink frequency and duration calculated from consecutive binocular images of each video frame; scan rate, duration, amplitude, delay, and velocity; fixation frequency, duration, and dispersion. Pupil diameter; Therefore, the physiological-comfort training samples are divided into heart rate-comfort training samples and eye-tracking-comfort training samples; correspondingly, in step three, the physiological-basic classifier is divided into a third classifier and a fourth classifier. The third classifier is trained based on the heart rate-comfort training samples, and the fourth classifier is trained based on the eye-tracking-comfort training samples.
6. A method for predicting visual comfort of flat panel displays based on a multimodal fusion model as described in claim 4 or 5, characterized in that: In step four, the test data includes test images and corresponding viewer physiological characteristics. Before inputting the test data into the multimodal fusion model, the physical parameters of the tablet display screen are measured first, and the physical characteristics are calculated in combination with the test images. Then, the physical and physiological characteristics corresponding to the test images are processed according to the preprocessing procedure in step two, including extracting imaging and non-imaging features from the physical features, and extracting heart rate variability features and eye-tracking features from the physiological features. Finally, the data is input into the multimodal fusion model to obtain the visual comfort prediction results.
7. The method for predicting visual comfort of flat panel displays based on a multimodal fusion model as described in claim 6, characterized in that: Step four also includes inputting only the test image or only any physiological feature of the viewer corresponding to the test image. At this time, depending on the type of input data, imaging features and non-imaging features are extracted from physical features, or heart rate variability features or eye tracking features are extracted from physiological features. These are then input into the multimodal fusion model to obtain the visual comfort prediction result.
8. The method for predicting visual comfort of flat panel displays based on a multimodal fusion model as described in claim 7, characterized in that: In step four, when only a test image or only any physiological feature of the viewer corresponding to the test image is input, the corresponding base classifier is automatically matched according to the extracted feature type; by disabling the output of the unmatched base classifier, the stack classifier adaptively fuses the matched base classifier and outputs the visual comfort prediction result.
9. A flat panel display visual comfort prediction system based on a multimodal fusion model, comprising: An expanded image sample module is used to acquire a raw image set, which includes several raw images. For each raw image, a generative adversarial network model is used to convert the input random sequence into augmented material. The augmented material and the raw image are then fused together according to a preset ratio to obtain several new images. The new images corresponding to the raw image set constitute a new image set. The training sample acquisition module is used to play the original image set to the viewer and collect the viewer's physiological characteristics and visual comfort evaluation; it measures the physical parameters of the flat panel display screen and obtains physical features by inputting the original image set and the new image set; Physical and physiological characteristics are preprocessed to obtain physical-comfort training samples and physiological-comfort training samples; The module for training a multimodal fusion model is used to train the physical-base classifier and the physiological-base classifier based on physical-comfort training samples and physiological-comfort training samples, respectively; based on the stacked ensemble framework and the trained physical-base classifier and physiological-base classifier, the stacked classifier is trained; the base classifier and the stacked classifier together constitute the multimodal fusion model; The visual comfort prediction module is used to input test data into a multimodal fusion model to obtain visual comfort prediction results.
10. The flat panel display visual comfort prediction system based on a multimodal fusion model as described in claim 9, characterized in that: The visual comfort prediction module includes an adaptive unit, which automatically matches the corresponding base classifier based on the extracted feature type when only a test image or only any physiological feature of the viewer corresponding to the test image is input. By disabling the output of unmatched base classifiers, the stacked classifiers adaptively fuse the matched base classifiers and output the visual comfort prediction result.