An artistic material aesthetic evaluation method and system
By simulating the human visual scanning path through the ConvNeXt architecture and visual attention module, and combining multiple expert analysis modules, the accuracy of aesthetic evaluation of artistic images in existing technologies is insufficient, thus achieving efficient and multi-dimensional aesthetic evaluation of artistic materials.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2025-08-05
- Publication Date
- 2026-06-19
AI Technical Summary
Existing image aesthetic computational models are unable to accurately evaluate images with highly artistic styles, especially paintings. The evaluation results deviate significantly from human subjective aesthetics and lack the ability to explain aesthetic differences in local details of images.
A shared visual feature extraction module based on the ConvNeXt architecture, combined with a visual attention module and multiple expert analysis modules, generates a simulated human visual scanning path through visual working memory and priority maps, segments the image into local regions, performs multi-layer encoding processing using a fusion Transformer module, outputs local and global aesthetic feature vectors, and finally performs weighted average scoring through a linear regression layer.
It significantly improves the accuracy of aesthetic evaluation of art materials, enabling comprehensive analysis of multi-dimensional features such as composition, color, and content, providing local aesthetic scores, and enhancing the transparency of the model and the interpretability of the results.
Smart Images

Figure CN120913036B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of aesthetic evaluation technology of art materials, and in particular relates to a method and system for aesthetic evaluation of art materials. Background Technology
[0002] With the rapid development of computer vision and artificial intelligence technologies, computational aesthetics—a field that demands significant attention and presents numerous challenges—has become a key area of research. It involves enabling computer models to simulate human aesthetic abilities and automate the aesthetic evaluation of images and videos. Current computational models for image aesthetics generally rely on extracting multi-layered visual features (such as color, composition, and semantic information) from images and employing data-driven methods like deep learning for aesthetic scoring. While these methods perform well when processing natural or photographic images, they often struggle to accurately grasp the complex artistic language and aesthetic intentions of highly stylized images (such as Van Gogh's *Sunflowers* and Piet Mondrian's *Composition with Red, Yellow, and Blue*), leading to significant discrepancies between the evaluation results and human subjective aesthetic judgment. From a psychological perspective, image aesthetics is a highly subjective and dynamic perceptual process involving multiple dimensions such as emotions, experiences, interests, and attention. The visual attention mechanism is a crucial bridge connecting external image features with internal subjective feelings; it determines which areas a person first focuses on when viewing an image and how information is processed in the brain, thus profoundly influencing the final aesthetic judgment. Therefore, visual attention is a core manifestation of the aesthetic psychological mechanism, revealing the perceptual path, focus of interest, and aesthetic preferences when viewing images, and largely determining how image content is understood, perceived, and evaluated. Current models have failed to effectively model this mechanism, becoming a common key problem hindering image aesthetic computation from further aligning with human subjective feelings.
[0003] Existing image aesthetics computational models include: Feature-based aesthetic evaluation methods. Early research generally relied on manually designed image features, such as brightness, contrast, color composition, compositional balance, and texture distribution, and scored image aesthetics through statistical analysis or traditional machine learning models (such as support vector machines and logistic regression). Convolutional neural network-based aesthetic evaluation methods. Deep learning methods such as Convolutional Neural Networks (CNNs) are widely used in image aesthetics evaluation. Representative models, such as NIMA (Neural Image Assessment) and A-Lamp, typically predict image aesthetic scores or levels through regression or classification. Contrastive learning and multi-task learning methods. Some studies attempt to model the subjective aesthetic judgment process using contrastive learning, ranking learning, or by fusing subjective labels (such as preference levels). For example, using image pairs (A is more beautiful than B) to train aesthetic ranking models; multi-task networks simultaneously predict aesthetic scores and image content labels. Aesthetic evaluation methods incorporating visual attention. Some studies simulate human visual attention mechanisms, extracting saliency maps or attention heatmaps from images, and combining them with features of salient regions for weighted aesthetic scoring. These methods typically utilize saliency detection models or attention mechanisms (such as self-attention or visual Transformer) to capture the most important local regions in an image for aesthetic judgment, enhancing the understanding of local details and overall structure, thereby improving the accuracy and interpretability of aesthetic evaluation.
[0004] Existing technologies have several drawbacks: Image feature-based aesthetic evaluation methods heavily rely on the quality of feature selection and extraction, making them ill-suited to diverse art styles. When faced with stylized artistic images, especially paintings with strong stylistic expression (such as Van Gogh's *Sunflowers* and Mondrian's *Composition with Red, Yellow, and Blue*), they often fail to accurately reflect their aesthetic value and artistic characteristics. Convolutional neural network-based aesthetic scoring methods rely on data fitting and lack in-depth simulation of the human subjective aesthetic cognitive process, particularly failing to accurately reflect the viewer's visual attention and scanning behavior. Furthermore, these methods typically only provide an overall aesthetic score, neglecting the aesthetic differences in local image details, resulting in a lack of fine-grained interpretability. Contrastive learning and multi-task learning methods do not consider human perception mechanisms, modeling aesthetic behavior only at the level of user profiles and rating preferences, ignoring the psychological characteristics such as the actual perceptual path and attention distribution during image viewing. This leads to poor stability and limited generalization ability when dealing with art materials of varying styles. Aesthetic evaluation methods that incorporate visual attention often focus only on salient areas, lacking a comprehensive analysis of multi-dimensional aesthetic features such as composition, color, and content. This makes it difficult for the evaluation results to fully reflect the overall artistic beauty of the image. Summary of the Invention
[0005] To address the aforementioned technical problems, this invention proposes a method and system for aesthetic evaluation of artistic materials, which effectively enhances the expressive power and evaluation accuracy of aesthetic features.
[0006] To achieve the above objectives, this invention provides a method for aesthetic evaluation of artistic materials, including:
[0007] The input image is fed into a shared visual feature extraction module, which is based on the ConvNeXt architecture and consists of multiple stacked ConvNeXt blocks with an inverted bottleneck structure, and outputs a multi-scale feature map.
[0008] The feature map is input to the visual attention module, which includes a visual system processing unit, a visual working memory processing unit, and a priority map generation unit, to generate a time-series suppression label map and output the probability distribution of the next gaze point.
[0009] The visual feature map output by the visual attention module is divided into 9 equal parts of local region features;
[0010] The feature maps at different levels are input into multiple parallel expert analysis modules, including composition and layout expert module, color and lighting expert module, image quality expert module and content subject expert module, respectively, to extract feature vectors of the corresponding aesthetic dimensions;
[0011] The aesthetic feature vector output by the expert analysis module is used as the whole as the input sequence, and is input into the fusion Transformer module along with the 9 equal parts of local visual features segmented by the visual attention module. The fusion Transformer module performs multi-layer encoding processing on the input sequence and outputs the comprehensive aesthetic feature vector corresponding to the 9 local regions.
[0012] The nine local comprehensive aesthetic feature vectors output by the Transformer module are input into the scoring module, and the nine local aesthetic scores are output by the linear regression layer respectively.
[0013] The final overall aesthetic score of the image is obtained by weighting and averaging the nine local aesthetic scores.
[0014] On the other hand, to achieve the above objectives, the present invention also provides an aesthetic evaluation system for artistic materials, comprising:
[0015] A shared visual feature extraction module is used to receive input images and extract multi-scale feature maps based on the ConvNeXt architecture;
[0016] The visual attention module generates a time-series suppression label map and outputs the probability distribution of the next fixation point, while dividing the visual feature map into nine equal parts of local region features.
[0017] The expert analysis module is used to extract feature vectors corresponding to the aesthetic dimensions from the multi-scale feature maps;
[0018] The Transformer module is integrated to perform multi-layer encoding on the input sequence of 9 equal parts of local visual features and expert feature vectors, and output a comprehensive aesthetic feature vector of 9 local regions.
[0019] The scoring module is used to output nine local aesthetic scores through a linear regression layer and perform a weighted average to obtain the final overall aesthetic score.
[0020] Technical advantages of this invention: This invention discloses a method and system for aesthetic evaluation of artistic materials. It simulates the human visual attention mechanism, integrates multi-dimensional design rules, and significantly improves the accuracy of aesthetic evaluation of artistic materials. It is robust to artistic images with complex compositions and diverse styles. The model architecture is efficient, with low computational overhead in the feature extraction and fusion modules, making it suitable for efficient image aesthetic evaluation. It supports the output of multi-dimensional interpretable aesthetic indicators based on composition, color, content, and image quality, improving model transparency and result understandability. It supports local aesthetic score output, helping users understand specific "good-looking" or "bad-looking" locations in the image, providing clear guidance for subsequent aesthetic optimization and editing. Attached Figure Description
[0021] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings:
[0022] Figure 1 This is a flowchart illustrating an aesthetic evaluation method for artistic materials according to an embodiment of the present invention;
[0023] Figure 2 This is a schematic diagram of the structure of an art material aesthetic evaluation system according to an embodiment of the present invention. Detailed Implementation
[0024] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.
[0025] It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be executed in a different order than that shown here.
[0026] like Figure 1 As shown, this embodiment provides a method for aesthetic evaluation of artistic materials, including:
[0027] The input image is fed into a shared visual feature extraction module, which is based on the ConvNeXt architecture and consists of multiple stacked ConvNeXt blocks with an inverted bottleneck structure, and outputs a multi-scale feature map.
[0028] The feature map is input to the visual attention module, which includes a visual system processing unit, a visual working memory processing unit, and a priority map generation unit, to generate a time-series suppression label map and output the probability distribution of the next gaze point.
[0029] The visual feature map output by the visual attention module is divided into 9 equal parts of local region features;
[0030] The feature maps at different levels are input into multiple parallel expert analysis modules, including composition and layout expert module, color and lighting expert module, image quality expert module and content subject expert module, respectively, to extract feature vectors of the corresponding aesthetic dimensions;
[0031] The aesthetic feature vector output by the expert analysis module is used as the whole as the input sequence, and is input into the fusion Transformer module along with the 9 equal parts of local visual features segmented by the visual attention module. The fusion Transformer module performs multi-layer encoding processing on the input sequence and outputs the comprehensive aesthetic feature vector corresponding to the 9 local regions.
[0032] The nine local comprehensive aesthetic feature vectors output by the Transformer module are input into the scoring module, and the nine local aesthetic scores are output by the linear regression layer respectively.
[0033] The final overall aesthetic score of the image is obtained by weighting and averaging the nine local aesthetic scores.
[0034] Furthermore, the shared visual feature extraction module includes four stages, each of which is implemented by stacking multiple ConvNeXt blocks;
[0035] The ConvNeXt block includes a 7×7 depth convolutional layer, layer normalization, a 1×1 pointwise convolutional layer, a GELU activation function, and a residual connection structure.
[0036] Furthermore, in the visual attention module, the visual working memory processing unit locally suppresses the previous gaze position through a spatial mask;
[0037] The spatial mask is applied to the feature map in the form of a Gaussian function, and a time-related attention feature map is generated through a convLSTM unit.
[0038] Furthermore, the composition layout expert module divides the high-level semantic feature map into 3×3 regions; each region extracts a local structural representation through region average pooling and feeds it into a multilayer perceptron with shared parameters to output a composition feature vector.
[0039] Furthermore, the color and light expert module divides the input image into a 3×3 region;
[0040] After color space conversion for each region, a statistical histogram is calculated, which is then stitched together to form a color distribution vector and input into a shallow MLP.
[0041] Furthermore, the fused Transformer module includes a four-layer Transformer encoder;
[0042] Each encoder layer contains an 8-head multi-head self-attention mechanism and a feedforward network, with an input sequence dimension of 512.
[0043] Furthermore, the method employs a multi-task joint training strategy; the joint optimization includes loss functions for aesthetic scoring, visual attention path prediction, composition classification, color regression, image quality regression, and content classification.
[0044] Specifically, the core function of the shared visual feature extraction module is to transform the input raw image pixels into a feature representation with rich hierarchical information that can be understood and processed by subsequent analysis modules. This module receives a preprocessed image tensor as input and consists of four consecutive, hierarchical stages implemented by stacking multiple ConvNeXt blocks as core basic units. Each ConvNeXt block employs an inverted bottleneck structure and contains a main branch and a short-circuit connection for residual learning. In the main branch, the feature map first passes through a deep convolutional layer with a large kernel (e.g., 7x7), designed to effectively expand the model's receptive field to capture richer contextual information. This is followed by a layer normalization layer to stabilize the training process. Next, a 1x1 pointwise convolutional layer expands the feature channel dimension (typically by a factor of 4) and performs a non-linear transformation using the GELU activation function, forming a wide intermediate layer to learn richer features. Finally, another 1x1 pointwise convolutional layer compresses the channel dimension back to its original size, completing the bottleneck structure loop. Meanwhile, a short-circuit connection fuses the original input features of this block with the processing results of the main branch by adding them element by element. This residual learning mechanism makes it possible to train very deep networks.
[0045] By stacking such ConvNext blocks in four stages, the entire module systematically increases its channel dimension while progressively reducing the spatial resolution of the feature maps. Ultimately, the module outputs a set of multi-scale feature maps. Among them, low-level feature maps with higher resolution. It retains rich details and textures, primarily for the image quality expert module; it features feature maps with medium to low resolution. It combines some local structures and contours; it has medium resolution and strong semantics. The main content is supplied to the expert module; while the low-resolution high-level feature map contains the richest global semantic information. This is then supplied to the layout expert module and the core fusion Transformer module.
[0046] The core function of the visual attention module is to simulate the human visual perception mechanism, generating a visual attention scanning path for the input image to be evaluated. The visual attention module comprises three steps: visual system processing, visual working memory processing, and priority map generation. The visual system encodes scene information (such as contrast, color, orientation, and brightness). Visual working memory processing adds selected attentional targets; due to the limited processing capacity of the brain, only one or a few targets can be noticed at a time. At this point, selective attention determines which regions in the image can pass through the processing bottleneck. Priority map generation generates a gaze distribution map, guided by bottom-up saliency based on scene information, the perceptual value of previously focused regions, and gaze history; all these guiding factors are combined in a weighted manner to generate a dynamic priority map. Visual system processing uses a modified VGG-16 model trained on the ImageNet dataset to generate a feature representation F0 with an 8x downsampling ratio. To maintain appropriate spatial resolution, an upsampling layer is added to generate the feature map F1.
[0047] Visual working memory processing utilizes convolutional LSTM (convLSTM) to simulate the human visual inhibition (IOR) mechanism. The IOR mechanism enables humans to suppress processing of visited targets, thus allowing for rapid environmental analysis. Unlike other convLSTM models, ScanpathNet uses convLSTM to learn the spatiotemporal dependencies of gaze sequences, rather than refining the saliency map. Furthermore, it doesn't directly compute the inhibition map but implicitly learns the IOR mechanism by directly suppressing previously gazed feature locations. The visual inhibition process is achieved through the spatial mask preceding the ConvLSTM layer. This is achieved by using the previously viewed position. The spatial mask is a function centered on the previous fixation point, typically a Gaussian mask or other form of suppression mask. This mask suppresses the region surrounding the previous fixation point (e.g., by setting the values of these regions close to 0 or reducing their weights), thereby reducing the influence of previously fixated regions on the next decision. The specific formula is as follows:
[0048] ;
[0049] in This represents element-wise multiplication, which suppresses previous gaze locations in the feature map.
[0050] Priority map generation uses a Mixture Density Network (MDN) to explicitly model the stochasticity of human attention. The MDN generates the probability density of the next fixation point using K Gaussian distributions. The MDN takes a flattened hidden representation from the IOR map as input and outputs a parameterized Gaussian mixture model. The model is composed of The mean of the Gaussian components Standard deviation Correlation coefficient and mixed weights Composition. Mathematically, this can be represented as:
[0051] ;
[0052] To obtain an effective probability distribution, the parameters of the MDN are constrained and normalized. The probability distribution of the next gaze position can be represented as:
[0053] ;
[0054] in, It is a bivariate normal distribution.
[0055] To generate an initial priority map, the model extracts rich representations from the image and applies a Gaussian blur in the middle of the feature space. To generate the location of the next gaze point, the model randomly samples the probability distribution of the output. For the next gaze point, the model uses the feature map... And on The location is masked. Finally, the feature map is evenly divided into 9 regions of equal size (for example, divided into 9 grids by rows and columns: top left, top middle, top right, left middle... bottom right), and each region corresponds to a visual locality of the input image.
[0056] Multiple parallel expert analysis modules are used to independently and meticulously deconstruct the input image from different aesthetic dimensions, thereby providing multi-faceted structured aesthetic information. This part consists of at least four independently designed sub-networks, which model key dimensions such as composition layout, color and lighting, image quality, and subject matter. These sub-modules all adopt lightweight, specialized neural network structures, do not share weights with each other, and can run efficiently under a parallel computing framework.
[0057] The composition layout expert receives the highest semantic layer output (i.e., the high-level feature map, representing the global structural information of the entire image) from the shared visual feature extraction module. This feature map preserves spatial distribution information. To achieve local composition awareness, this module no longer performs global average pooling, but instead directly divides the high-level feature map into nine 3×3 spatial regions. Each region extracts its local structural representation through region average pooling and feeds it into a multilayer perceptron (MLP, containing one or two fully connected layers) with shared parameters, outputting a corresponding composition feature vector. Finally, this module outputs nine local composition feature vectors, which can be used to measure the region-level composition matching degree of the image under rules such as the rule of thirds, symmetry, and visual balance.
[0058] The color and lighting expert takes the original input image as input and first divides the image into nine 3×3 regions. Each region undergoes color space conversion (e.g., RGB to HSV or Lab), and statistical histograms are calculated separately for channels such as hue, saturation, and brightness. The color histograms of each region are concatenated to form a fixed-length color distribution vector, which is then input into a shallow MLP to extract perceptual color features (such as dominant hue, warm / cool tendency, contrast, and harmony). Finally, the module outputs nine local color feature vectors, each corresponding to one of the nine equal regions of the image, used to model the color aesthetics of each region.
[0059] Image quality experts are used to evaluate image sharpness, noise, and compression artifacts at the technical level. This module takes low-level, high-resolution feature maps from the backbone network (such as the outputs of the first few layers of ResNet) as input, preserving rich local texture and edge information. The feature maps are first divided into nine 3×3 sub-regions, and each region undergoes feature extraction via a convolutional coding network. Specifically, each sub-region is fed into three shared-parameter 2D convolutional layers (3×3 kernels, 32, 64, and 128 output channels, stride 1), with Batch Normalization and ReLU applied in between. Finally, each region's feature map is compressed using 2×2 max pooling and flattened into a one-dimensional vector, which is then fed into two MLP layers to generate region-level image quality feature vectors (e.g., 64-dimensional). The final output consists of nine local image quality vectors, describing the degree of blur, compression artifacts, and texture integrity in each region.
[0060] The content subject expert is responsible for analyzing the semantic category and subject structure of the image. The input is a hierarchical semantic feature map in the backbone network (such as the output of a ResNet intermediate layer), which contains object-level mid-level semantic information. This feature map is first divided into nine 3×3 regions. Each region is compressed into a channel-level feature vector through average pooling and then fed into a multilayer perceptron (MLP) with shared parameters to transform it into a region-level semantic representation vector. This feature vector can be used to determine whether each region contains semantic subjects such as people, animals, buildings, or natural landscapes, and can also estimate the proportion, number, and distribution of subjects within the region. Finally, this module outputs nine region-level content semantic vectors, representing the subject perception information of each region.
[0061] The Fusion Transformer module is the core component for achieving multi-dimensional aesthetic information fusion and final scoring decisions. Its structure is based on the standard Transformer Encoder architecture. The input to this module is a set of expert feature vectors output by other sub-modules, forming a unified sequence of "expert opinions." This includes global visual vectors extracted from the visual attention module, as well as feature representations output by expert modules in four dimensions: composition, color, image quality, and image content. To aggregate global information, a learnable classification token vector ([CLS] Token) is appended to the beginning of this sequence. Therefore, the input sequence length of the Transformer is fixed at 6.
[0062] Before constructing the input, the deep semantic feature map output by the backbone network is first weighted pixel-by-pixel using the visual attention scanning path output by the visual attention module. Then, a 512-dimensional attention-weighted visual feature vector is extracted through global average pooling. Each expert module also outputs a 512-dimensional feature vector to ensure dimensionality consistency of the sequence.
[0063] The sequence is then fed into a fused Transformer encoder. The encoder consists of four stacked Transformer encoding layers (L=4), each containing an 8-head multi-head attention mechanism (h=8). Attention is computed using scaled dot products, with each head having a 64-dimensional (512 / 8) dimension. Each layer also includes a feed-forward network (FFN) with 2048 hidden dimensions and a 512 output dimension, using ReLU as the activation function. Each layer is equipped with Layer Normalization and residual connections to enhance training stability and modeling depth. Positional encoding employs learnable absolute position embeddings to ensure that the spatial order of expert vectors in the sequence is explicitly represented.
[0064] In the sequence output by the Transformer, the final output vector corresponding to the first [CLS] marker is considered the fused comprehensive aesthetic representation. This 512-dimensional feature vector is fed into a fully connected linear regression layer (output dim=1) to output the final image aesthetic score. While maintaining sequence simplicity, the overall architecture leverages the Transformer's powerful feature modeling capabilities to learn the contextual dependencies, complementary relationships, and relative weights between various aesthetic dimensions, significantly improving the accuracy and discriminative power of score prediction.
[0065] The scoring module, located at the output of the entire system, is the direct execution unit that maps the comprehensive aesthetic feature vector obtained from the Transformer module to the final aesthetic score. This module employs a simple and efficient linear regression head, whose input is the high-dimensional feature vector corresponding to the [CLS] position in the Transformer encoder output sequence. This [CLS] vector has 512 dimensions and contains a deep fusion representation of information from various expert modules (composition, color, quality, content) and visual attention, representing the overall aesthetic perception of the image. The scoring module's internal structure consists of a fully connected neural network with a single-layer parameter, an input layer dimension of 512, and an output layer dimension of 1, representing the aesthetic score of the artistic material.
[0066] To ensure the accurate aesthetic evaluation of artistic materials described in this invention, a multi-task joint training method is required. This method aims to improve the model's understanding and generalization ability of the main task (aesthetic scoring) by simultaneously optimizing multiple related sub-tasks. The detailed training steps are as follows:
[0067] The first step is the construction and preparation of the dataset. A multi-attribute training dataset is constructed. Each image sample in the dataset must contain the following types of annotation information: a global aesthetic score (main task label), a continuous numerical value, such as 1 to 10, representing the overall aesthetic quality of the image; a visual attention scan path (attention task label), a grayscale image of the same size as the original image, generated from real human eye-tracking data; and expert task labels (auxiliary task labels): composition label (the composition category to which the image belongs, including "rule of thirds," "central symmetry," etc.), color label (a numerical value or category describing color harmony, color temperature, etc.), quality label (a numerical value describing image sharpness, noise level, etc.), and content label (the content category to which the artistic material belongs, including "clothing texture," "drawing," etc.). Since a single existing dataset usually cannot contain all of the above labels, this multi-attribute dataset needs to be constructed by fusing multiple existing public datasets (such as using aesthetic scores from the AVA dataset, or visual attention scan path annotations from the SALICON dataset), or through manual annotation.
[0068] The second step is model initialization. To accelerate model convergence and improve performance, pre-trained weights are used for initialization: the shared visual feature extraction module (ConvNeXt) is loaded with its pre-trained weights on a large image classification dataset (such as ImageNet-22K); the visual attention module is loaded with model weights pre-trained on a large visual attention scan path prediction dataset; and other modules, including the newly added network layers in the expert analysis module, the fusion Transformer module, and the scoring module, use standard random initialization methods (such as Kaiming initialization or Xavier initialization).
[0069] The third step is to design the joint loss function. A joint loss function L_total is designed as the overall objective of model optimization. This function is a weighted sum of the loss functions of all subtasks:
[0070] ;
[0071] in, For the aesthetic rating loss of the main task, L1 loss (MAE Loss) is used to calculate the difference between the model's predicted score and the true score. For the visual attention scanning path prediction loss, binary cross-entropy loss is used to calculate the pixel-level difference between the predicted visual attention scanning path and the ground truth image. and The classification losses for composition and content are respectively calculated using cross-entropy loss. and The regression losses for color and quality are respectively calculated using mean squared error loss or L1 loss. , , ... is a set of preset hyperparameters used to balance the contribution weights of different tasks in the total loss, so as to ensure that the main task is fully optimized while the auxiliary tasks can also play an effective role in regularization and guidance.
[0072] The fourth step is iterative optimization training. A mini-batch stochastic gradient descent strategy is used for end-to-end training of the model. First, a mini-batch of images and their corresponding multi-attribute labels are randomly selected from the prepared dataset. This batch of images is input into MAP-Net, and a forward propagation is performed to obtain the model's prediction outputs for all tasks (including the final aesthetic score, visual attention scanning path, and feature / classification results for each expert dimension). Then, the joint loss function defined above is applied... Calculate the total loss between all predicted outputs of the model and the true labels. Perform one backpropagation, calculating the gradients of all trainable parameters in the model based on the total loss. Use the SGD with momentum optimizer to update the model weights based on the calculated gradients. Finally, repeat the above steps until the model's performance on the validation set converges or the preset number of training epochs is reached. During training, strategies such as learning rate warm-up and cosine annealing are used to dynamically adjust the learning rate to obtain better training results.
[0073] like Figure 2 As shown, this embodiment also provides an art material aesthetic evaluation system, including:
[0074] A shared visual feature extraction module is used to receive input images and extract multi-scale feature maps based on the ConvNeXt architecture;
[0075] The visual attention module generates a time-series suppression label map and outputs the probability distribution of the next fixation point, while dividing the visual feature map into nine equal parts of local region features.
[0076] The expert analysis module is used to extract feature vectors corresponding to the aesthetic dimensions from the multi-scale feature maps;
[0077] The Transformer module is integrated to perform multi-layer encoding on the input sequence of 9 equal parts of local visual features and expert feature vectors, and output a comprehensive aesthetic feature vector of 9 local regions.
[0078] The scoring module is used to output nine local aesthetic scores through a linear regression layer and perform a weighted average to obtain the final overall aesthetic score.
[0079] The core of this invention is a multi-module cascaded neural network system, which mainly consists of the following functional modules: a shared visual feature extraction module, a visual attention module, multiple parallel expert analysis modules, a fusion Transformer module, and a scoring module.
[0080] The shared visual feature extraction module is responsible for converting the input preprocessed image into a multi-scale visual feature representation. Specifically, it is implemented as a ConvNeXt block consisting of four stacked stages. Each block employs an inverted bottleneck structure, including large-kernel depthwise convolutions, layer normalization, pointwise convolutions, and residual connections, progressively extracting multi-level feature maps from low-level texture details to high-level semantic information. The multi-scale features output by this module provide rich foundational representations for subsequent expert modules and the visual attention module.
[0081] The visual attention module simulates the human visual perception mechanism. It extracts salient features of the scene based on a pre-trained VGG-16 model and combines a convolutional LSTM to simulate the visual inhibition mechanism (IOR). Through spatial masking, it dynamically suppresses the gazed area, achieving dynamic generation of the attention focus. A hybrid density network (MDN) is used to model the gaze probability distribution, outputting a gaze path priority map. Then, a local gaze path priority map is obtained by dividing the region into nine equal parts (3×3), which guides subsequent feature weighting and attention enhancement.
[0082] Multiple parallel expert analysis modules include the following four sub-modules, each addressing different aesthetic dimensions for spatial perception modeling and semantic deconstruction of images: The composition layout expert, based on the high-level semantic feature map of the shared feature module, divides the image into nine equal parts (3×3), extracts regional composition features through local average pooling and a shared multilayer perceptron, and evaluates the matching degree of classic composition rules in each region; the color and lighting expert directly processes the nine spatial sub-blocks of the image, performs color space transformation, extracts regional color statistical vectors, and uses a shallow MLP to model the color harmony and brightness features of each region; the image quality expert utilizes the low-level high-resolution feature map from the shared module, divides the image into nine-grid sections, and extracts the image quality features of each region through a shared small convolutional network and fully connected layers, reflecting its sharpness, compression artifacts, and texture consistency; the content subject expert divides the image into nine regions from the mid-level semantic feature map, performs average pooling and multilayer perceptron modeling on each region, extracts the semantic category and subject distribution information of each region, thereby achieving spatial representation of the content dimension.
[0083] The Transformer module serves as the core for multi-dimensional aesthetic feature fusion. It receives attention-weighted visual features from the visual attention module and 512-dimensional feature vectors from various expert modules, forming a fixed-length expert feature sequence of 6. Through a standard Transformer encoder with 4 stacked layers (including multi-head self-attention and feedforward networks), it fuses information from various aesthetic dimensions, captures the contextual dependencies and complementary relationships between features, and outputs local aesthetic features of 9 regions.
[0084] The scoring module is located at the end of the system. It receives the comprehensive aesthetic feature vector (the 512-dimensional vector corresponding to the [CLS] Token) output by the fusion Transformer module. Through a single-layer linear regression network, the comprehensive features are mapped to aesthetic evaluation scores of 9 local regions to achieve quantitative evaluation of the art material. Finally, the overall aesthetic score is obtained by weighted averaging.
[0085] To improve the model's generalization ability and scoring accuracy, the multi-task joint training module employs a multi-task joint training strategy, optimizing the joint loss function that includes aesthetic scoring, visual attention path prediction, composition, color, image quality, and content classification. During training, multi-attribute labeled data is used, pre-trained weights are used to initialize some modules, and stochastic gradient descent and learning rate scheduling strategies are employed to achieve end-to-end joint optimization.
[0086] The art material aesthetic evaluation model described in this invention presents a data flow architecture of "main branch distribution - parallel parsing - central fusion". This structure aims to efficiently deconstruct and process complex visual information, and its specific inter-component connections and data flow are as follows.
[0087] Structurally, after receiving the image to be evaluated at the system's input, the data stream first enters the shared visual feature extraction module (based on ConvNeXt). This module serves as the core visual backbone of the entire system, and its output is a set of multi-scale feature maps. It is designed to form a "one-to-many" fan-out connection with multiple downstream modules.
[0088] Specifically, this set of multi-scale feature maps is precisely distributed to the corresponding analysis modules: low-level and high-resolution feature maps. The output is unidirectionally connected to the input of the image quality expert module, providing it with rich detailed texture information; the middle layer, strong semantic feature map The output end is unidirectionally connected to the input end of the content subject expert module, providing it with the intermediate semantics required for parsing the object; high-level, global semantic feature maps. The output of the module is simultaneously connected to the input of the composition and layout expert module and the input preprocessing unit of the subsequent fusion Transformer module. The color and lighting expert module is structurally relatively independent, with its input directly connected to the system's original image input, ensuring that its analysis is based on the most original color information.
[0089] At the same time, the input of the visual attention module also shares the mid-to-high-level feature maps of the backbone. The visual attention scanning path generated at its output is connected to the input preprocessing unit of the fusion Transformer module, which is used to generate a visual attention-weighted visual feature vector.
[0090] The core of this system architecture lies in the Fusion Transformer module. This module's input plays a "many-to-one" information aggregation role. It aggregates information streams from five different sources: 1) an attention-weighted global visual vector; 2) the output vector from the composition and layout expert module; 3) the output vector from the color and lighting expert module; 4) the output vector from the image quality expert module; and 5) the output vector from the content subject expert module. These vectors are structurally arranged into a fixed-length sequence (6 tokens), serving as the unified input to the Fusion Transformer module.
[0091] In the final output path, the aggregated output of the fused Transformer module (i.e., the feature vector corresponding to the [CLS] token) is unidirectionally connected to the input of the scoring module. The scoring module, as a simple linear mapping unit, outputs the final aesthetic evaluation value of the entire system.
[0092] In summary, the system architecture of this invention achieves efficient and independent parsing of multi-dimensional image information through initial parallel design; then, through a centralized fusion Transformer hub, it enables deep, context-sensitive correlation and integration of this discrete information, ultimately drawing conclusions through a concise output path. This "split-convergence" structure ensures the comprehensiveness of the system analysis and the intelligence of the decision-making.
[0093] Alternative Solution 1: This solution constructs a unified deep image feature extraction backbone (such as ResNet, DenseNet, or EfficientNet) to achieve a global representation of image content. Multiple independent branch modules (i.e., multiple regression heads or classifiers) are connected after the backbone output to predict aesthetic dimensions such as composition, color, content, and sharpness. Finally, a weighted or joint mechanism is used to obtain the overall score. The module structure includes an image feature extractor (Backbone) for extracting general image features; a multi-branch head structure, with each head processing one aesthetic dimension; and a score fusion module that fuses the outputs of each dimension into the final score through simple weighting, logistic regression, or a small neural network.
[0094] Alternative Solution 2: Utilize pre-trained large-scale image-text joint models (such as CLIP, BLIP, Florence, etc.) to convert images into semantic feature vectors via an image encoder. Simultaneously, input predefined aesthetic dimension text (such as "This is a vividly colored image") into a text encoder. Calculate the semantic matching degree (usually cosine similarity) between the two, which serves as the scoring criterion for the image in that aesthetic dimension. The module structure includes an image encoder from the CLIP or BLIP image branch (such as ViT-B / 32); a text encoder from the CLIP text branch, corresponding to multiple aesthetic dimension prompts; a matching module that calculates the cosine similarity of each image-text pair to obtain a score; and a score aggregation module that weights the scores across multiple dimensions to output the final score.
[0095] This invention discloses a method and system for aesthetic evaluation of artistic materials, addressing the problem of insufficient accuracy in evaluating stylized artistic images using existing methods. The proposed model consists of a shared visual feature extraction backbone, a visual attention module, a multi-path parallel expert analysis module, and an attention-guided feature fusion and scoring module. This method extracts multi-level visual information from images, simulates the human visual attention path, and combines four interpretable aesthetic dimensions—composition, color, content, and image quality—for collaborative analysis to extract local and overall aesthetic representations. Each expert module outputs intermediate aesthetic indicators based on visual features and classic design rules; guided by the visual attention mechanism, these indicators are fused into a unified comprehensive aesthetic representation. Finally, the model outputs aesthetic scores for nine local regions, and a weighted average is used to calculate the overall aesthetic evaluation score of the image. This system employs an end-to-end multi-task joint training strategy, effectively enhancing the expressive power and evaluation accuracy of aesthetic features.
[0096] The above are merely preferred embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. A method for evaluating the aesthetics of artistic materials, characterized in that, include: The input image is fed into a shared visual feature extraction module, which is based on the ConvNeXt architecture and consists of multiple stacked ConvNeXt blocks with an inverted bottleneck structure, and outputs a multi-scale feature map. The feature map is input to the visual attention module, which includes a visual system processing unit, a visual working memory processing unit, and a priority map generation unit, to generate a time-series suppression label map and output the probability distribution of the next gaze point. The visual feature map output by the visual attention module is divided into 9 equal parts of local region features; The feature maps at different levels are input into multiple parallel expert analysis modules, including composition and layout expert module, color and lighting expert module, image quality expert module and content subject expert module, respectively, to extract feature vectors of the corresponding aesthetic dimensions; The aesthetic feature vector output by the expert analysis module is used as the whole as the input sequence, and is input into the fusion Transformer module along with the 9 equal parts of local visual features segmented by the visual attention module. The fusion Transformer module performs multi-layer encoding processing on the input sequence and outputs the comprehensive aesthetic feature vector corresponding to the 9 local regions. The nine local comprehensive aesthetic feature vectors output by the Transformer module are input into the scoring module, and the nine local aesthetic scores are output by the linear regression layer respectively. The final overall aesthetic score of the image is obtained by weighting and averaging the nine local aesthetic scores.
2. The method for evaluating the aesthetics of artistic materials as described in claim 1, characterized in that, The shared visual feature extraction module includes four stages, each of which is implemented by stacking multiple ConvNeXt blocks; The ConvNeXt block includes a 7×7 depth convolutional layer, layer normalization, a 1×1 pointwise convolutional layer, a GELU activation function, and a residual connection structure.
3. The method for evaluating the aesthetics of artistic materials as described in claim 1, characterized in that, In the visual attention module, the visual working memory processing unit locally suppresses the previous gaze position through a spatial mask; The spatial mask is applied to the feature map in the form of a Gaussian function, and a time-related attention feature map is generated through a convLSTM unit.
4. The method for evaluating the aesthetics of artistic materials as described in claim 1, characterized in that, The composition layout expert module divides the high-level semantic feature map into 3×3 regions; each region extracts a local structural representation through region average pooling and feeds it into a multilayer perceptron with shared parameters to output a composition feature vector.
5. The method for evaluating the aesthetics of artistic materials as described in claim 1, characterized in that, The color and light expert module divides the input image into a 3×3 region; After color space conversion for each region, a statistical histogram is calculated, which is then stitched together to form a color distribution vector and input into a shallow MLP.
6. The method for evaluating the aesthetics of artistic materials as described in claim 1, characterized in that, The fusion Transformer module includes a four-layer Transformer encoder; Each encoder layer contains an 8-head multi-head self-attention mechanism and a feedforward network, with an input sequence dimension of 512.
7. The method for evaluating the aesthetics of artistic materials as described in claim 1, characterized in that, The method employs a multi-task joint training strategy; the joint optimization includes loss functions for aesthetic scoring, visual attention path prediction, composition classification, color regression, image quality regression, and content classification.
8. A system for the aesthetic evaluation method of art materials according to any one of claims 1-7, characterized in that, include: A shared visual feature extraction module is used to receive input images and extract multi-scale feature maps based on the ConvNeXt architecture; The visual attention module generates a time-series suppression label map and outputs the probability distribution of the next fixation point, while dividing the visual feature map into nine equal parts of local region features. The expert analysis module is used to extract feature vectors corresponding to the aesthetic dimensions from the multi-scale feature maps; The Transformer module is integrated to perform multi-layer encoding on the input sequence of 9 equal parts of local visual features and expert feature vectors, and output a comprehensive aesthetic feature vector of 9 local regions. The scoring module is used to output nine local aesthetic scores through a linear regression layer and perform a weighted average to obtain the final overall aesthetic score.