A few-sample no-reference binaural spatial audio quality evaluation method
By using a pre-trained mono-channel feature transfer and dual-attention fusion module, combined with a lightweight prediction network, the problem of evaluation bias and computational complexity in binaural spatial audio quality assessment methods in scenarios with few samples is solved, achieving efficient and accurate audio quality assessment, suitable for resource-constrained devices.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NINGBO UNIV
- Filing Date
- 2026-05-14
- Publication Date
- 2026-06-12
AI Technical Summary
Existing binaural spatial audio quality assessment methods rely on reference signals, making them difficult to apply in real-world scenarios such as streaming media. Furthermore, they exhibit weak generalization ability in scenarios with few samples, high computational complexity, and an inability to effectively capture the spatial characteristics of binaural spatial audio. This results in significant discrepancies between the assessment results and subjective perception, making them difficult to deploy on resource-constrained devices.
We employ a few-sample, no-reference binaural spatial audio quality assessment method. By transferring features through a pre-trained mono audio feature extraction module, combined with a dual-attention fusion module and a lightweight prediction network, we perform feature extraction and quality prediction. This includes feature calibration, cross-channel spatial dependency modeling, and lightweight encoder design to reduce computational resource requirements.
It achieves high-precision audio quality assessment under no-reference conditions, is suitable for resource-constrained devices, significantly improves the consistency between assessment results and subjective perception, reduces training costs and computational complexity, and broadens the scope of applications.
Smart Images

Figure CN122201349A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of audio signal processing and relates to an audio quality assessment technique, particularly a method for assessing the quality of binaural spatial audio with few samples and no reference. Background Technology
[0002] With the popularization of virtual reality (VR), augmented reality (AR) and immersive audio-visual experiences, binaural spatial audio, as the core technology for presenting spatial sound, directly determines the user's immersive experience.
[0003] Currently, subjective hearing tests remain the gold standard for assessing binaural spatial audio quality. However, these tests suffer from high implementation costs, long processing times, and complex procedures. Furthermore, they are affected by two-channel spatial cues, making it difficult to meet the needs of large-scale, real-time assessments. Therefore, objective assessment methods have become a key research focus in this field.
[0004] Existing objective evaluation methods suffer from the following limitations: First, most methods are intrusive, relying on clean reference signals, which are difficult to obtain in real-world scenarios such as streaming media, limiting their practicality. Second, some methods employ no-reference techniques, but these are mostly designed for mono audio, focusing only on traditional distortions such as timbre and loudness. They fail to effectively capture key spatial features unique to binaural spatial audio, such as spatial localization accuracy, binaural masking effects, and inter-channel coherence, leading to significant discrepancies between the evaluation results and subjective perception of binaural spatial audio quality. Third, deep learning-based evaluation models typically rely on massive amounts of labeled data for training. However, high-quality subjective labeling of binaural spatial audio requires specific acoustic environments, which is extremely costly. Furthermore, samples are scarce in specific scenarios (such as binaural speech in niche dialects or audio specific to certain devices), causing existing models to overfit and exhibit weak generalization ability in low-sample scenarios. In addition, some optimization schemes have a large number of parameters and are computationally complex, making them difficult to deploy efficiently in resource-constrained scenarios such as portable VR devices and in-vehicle terminals, and they cannot fully leverage their advantages in low-sample scenarios. Summary of the Invention
[0005] The technical problem to be solved by the present invention is to provide a method for evaluating the quality of binaural spatial audio with few samples and no reference. This method can accurately capture the spatial characteristics of binaural spatial audio signals with no reference signal and only a small number of samples. It also has low computational resource consumption and is suitable for deployment in resource-constrained scenarios such as VR / AR devices and vehicle terminals.
[0006] The technical solution adopted by this invention to solve the above-mentioned technical problems is: a method for assessing binaural spatial audio quality with few samples and no reference, comprising the following steps: S1. Obtain an audio dataset containing a small number of samples, each of which includes a binaural spatial audio signal and its corresponding subjective quality rating label. S2. Divide the audio dataset into training set, validation set and test set; S3. Construct an audio quality assessment model, which decomposes the input binaural spatial audio signal into left channel and right channel signals, and then performs the following stages in sequence: In the feature extraction stage, the left channel signal and the right channel signal are respectively input to the pre-trained mono audio feature extraction module, which outputs frame-level left channel features and frame-level right channel features accordingly. In the dual attention fusion stage, the frame-level left channel features and the frame-level right channel features are input into the dual attention fusion module, respectively, channel-level calibration is performed, and cross-channel spatial dependency modeling is performed on the calibrated two-channel features to output fused features; In the quality prediction stage, the fused features are input into a lightweight prediction network, which outputs the quality evaluation results. S4. Based on the training set, train the dual attention fusion module and the lightweight prediction network, and freeze the parameters of the mono audio feature extraction module during the training process; and select the optimal model parameters based on the validation set using an early stopping mechanism to obtain the trained audio quality evaluation model. S5. Using the trained audio quality assessment model, perform quality assessment on each binaural spatial audio signal in the test set.
[0007] The mono audio feature extraction module employs a self-supervised music representation learning model based on Mel residual vector quantization.
[0008] The implementation process of the dual attention fusion module is as follows: The frame-level left channel feature and the frame-level right channel feature are respectively input into the channel attention submodule for channel-level calibration, and the calibrated frame-level left channel feature and calibrated frame-level right channel feature are output accordingly. The calibrated frame-level left channel features and the calibrated frame-level right channel features are input into the multi-head self-attention submodule to perform cross-channel spatial dependency modeling in order to capture spatial-specific cues of the input binaural spatial audio signals and output dual-attention calibration features. The dual-attention calibration features are input into the fusion layer for dimensionality reduction, and the nonlinear fusion relationship between the calibrated frame-level left channel features and the calibrated frame-level right channel features is learned, and the fusion features are output.
[0009] The implementation process of the channel attention submodule is as follows: Global average pooling and global max pooling operations are performed on a single channel feature in the time dimension to obtain a global statistical context descriptor and a global extreme value context descriptor, respectively; wherein, the single channel feature is the frame-level left channel feature or the frame-level right channel feature. The global statistical context descriptor and the global extreme value context descriptor are respectively input into a multilayer perceptron with shared weights and subjected to nonlinear transformation to obtain the first channel importance vector and the second channel importance vector. The importance vectors of the first and second channels are added element by element, and the addition result is normalized to obtain the channel attention weights. The individual channel features are weighted along the channel dimension according to the channel attention weight to obtain the calibrated individual channel features; wherein, the calibrated individual channel features are either the calibrated frame-level left channel features or the calibrated frame-level right channel features.
[0010] The implementation process of the multi-head self-attention submodule is as follows: The calibrated frame-level left channel feature and the calibrated frame-level right channel feature are concatenated along the channel dimension to obtain a joint feature; The joint features are mapped to query matrix, key matrix and value matrix respectively, and the query matrix, key matrix and value matrix are all split into multiple query sub-matrices, multiple key matrices and multiple value sub-matrices according to a preset number of attention heads; Calculate the attention weights of the query submatrix and key submatrix of each attention head, and perform weighted aggregation on the corresponding value submatrix according to the attention weights. Then, concatenate the outputs of each attention head in the channel dimension to obtain the concatenated and fused features. The splicing and fusion features and the joint features are subjected to residual connection and layer normalization to output the dual attention calibration features.
[0011] The implementation process of the lightweight prediction network is as follows: The fused features are input into a lightweight encoder to extract features of local temporal patterns and output frame-level encoded features. The frame-level coding features are input into a feedforward neural network regressor to learn the mapping relationship between the frame-level coding features and the frame-level quality score, and output the frame-level quality score. The global quality prediction score of the input binaural spatial audio signal is obtained by performing time-averaged pooling on all the frame-level quality scores through a pooling layer. The global quality prediction score is mapped to a preset scoring range through a mapping layer to obtain the quality assessment result.
[0012] The lightweight encoder is constructed based on inverted residual convolutional blocks, which sequentially include a first pointwise convolution for expanding dimensions, a depthwise convolution for extracting spatial features, and a second pointwise convolution for compressing dimensions.
[0013] The training process employs an early stopping mechanism, using the Spearman rank correlation coefficient on the validation set as the selection criterion for the optimal model parameters. When the Spearman rank correlation coefficient does not improve after a preset number of consecutive rounds, training is stopped and the historically optimal model parameters are loaded.
[0014] During the training process, the RMSprop optimizer is used to update the parameters of the dual attention fusion module and the lightweight prediction network.
[0015] Compared with the prior art, the advantages of the present invention are as follows: 1. No-reference evaluation design, eliminating dependence on reference signals. The method of this invention only requires the binaural spatial audio signal to be evaluated, without any reference signal. This allows the method to be directly applied to real-world scenarios such as streaming media, VR, and AR where reference signals are difficult to obtain, significantly improving its practicality. Experiments show that on the publicly available Open Audio Quality Dataset (ODAQ) test set, the method of this invention achieves an LCC (linear correlation coefficient) of 0.9165 under reference-free conditions, significantly outperforming existing intrusive methods (QASTAnet LCC=0.7812, Ambiqual LCC=0.4399, eMoBi-Q LCC=0.4028), achieving a balance between reference-free operation and high accuracy.
[0016] 2. Cross-domain migration strategy to reduce data dependency This invention employs a pre-trained mono audio feature extraction module, transferring the feature extraction capabilities pre-trained on large-scale mono audio to binaural spatial audio evaluation tasks. By freezing the parameters of the mono audio feature extraction module during training, this method requires only a small number of samples to train the dual-attention fusion module and the lightweight prediction network, reducing training costs and addressing the scarcity of labeled binaural spatial audio data. Ablation experiments demonstrate that removing the mono audio feature extraction module significantly degrades the performance of the audio quality evaluation model, validating the crucial role of the pre-trained mono audio feature extraction module in training with limited samples. This strategy enables the audio quality evaluation model to exhibit good generalization ability in scenarios with scarce samples, such as binaural speech in niche dialects and device-specific audio, reducing training costs and significantly broadening its application scope.
[0017] 3. Dual attention fusion mechanism for precise capture of spatial features This invention employs a dual-attention fusion module to perform channel-level calibration separately and then models the cross-channel spatial dependence of the calibrated two-channel features. Specifically, it enhances the internal features of a single channel through channel attention and captures spatially specific cues between the two channels through multi-head self-attention, accurately characterizing the channel coordination consistency and other unique features of the binaural spatial audio signal. Experiments show that the method of this invention significantly outperforms existing mono and invasive methods in three metrics: LCC (linear correlation coefficient), SRCC (Spearman rank correlation coefficient), and KTAU (Kendall correlation coefficient). Specifically, the LCC is 0.4085 higher than the existing mono method PESQ and 17.3% higher than the invasive method QASTAnet, demonstrating a higher consistency between the quality assessment results and human subjective perception.
[0018] 4. Lightweight architecture design, adaptable to resource-constrained scenarios The lightweight prediction network of the present invention significantly reduces the number of parameters in the audio quality assessment model compared to existing deep learning-based assessment models, enabling efficient deployment in resource-constrained scenarios such as VR devices and in-vehicle terminals, and meeting the needs of real-time audio quality assessment.
[0019] 5. The training strategy is efficient and converges quickly. The method of this invention uses a pre-trained mono audio feature extraction module, and freezes the parameters of the mono audio feature extraction module during training. It also combines an early stopping mechanism based on the validation set to select the optimal model parameters, enabling the audio quality assessment model to converge stably within only 240 samples and 100 maximum training rounds. This is 5-10 times faster than existing deep learning-based assessment models (which typically require 500-1000 training rounds). Attached Figure Description
[0020] Figure 1 This is a flowchart illustrating the overall implementation of the method of the present invention; Figure 2 This is a flowchart illustrating the implementation of the feature extraction stage in the method of the present invention. Figure 3 This is a flowchart illustrating the implementation of the channel attention submodule in the dual attention fusion stage of the method of the present invention. Figure 4 This is a flowchart illustrating the implementation of the multi-head self-attention submodule in the dual-attention fusion stage of the method of the present invention. Figure 5 This is a flowchart illustrating the implementation of the quality prediction stage in the method of this invention. Detailed Implementation
[0021] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments.
[0022] This invention addresses the following problems in existing binaural spatial audio quality assessment methods: invasive methods rely on reference signals, resulting in insufficient practicality in real-world scenarios such as streaming media; mono methods lack spatial feature modeling capabilities, leading to significant discrepancies between quality assessment results and human subjective perception; deep learning-based assessment models have high requirements for labeled data and computational resources and weak generalization ability in scenarios with few samples; and some optimization schemes have a large number of parameters, making them difficult to deploy on resource-constrained devices. Therefore, this invention proposes a binaural spatial audio quality assessment method with few samples and no reference.
[0023] This invention proposes a method for assessing binaural spatial audio quality with few samples and no reference, such as... Figure 1 As shown, it includes the following steps: S1. Obtain an audio dataset containing a small number of samples, each of which includes a binaural spatial audio signal and its corresponding subjective quality rating label.
[0024] In this embodiment, the audio dataset used is the publicly available Open Dataset of Audio Quality (ODAQ). The ODAQ dataset is currently the only publicly available dataset in the field of binaural spatial audio quality assessment. It contains 240 binaural spatial audio signals, all in WAV format, with sampling frequencies of 44.1 kHz or 48 kHz. These 240 binaural spatial audio signals consist of: 14 music clips, including 8 professionally recorded solo compositions and 6 live performance recordings; and 11 film soundtrack clips, integrating dialogue, background music, and environmental sound effects (such as scene sound effects and action sound effects), comprehensively reproducing the audio characteristics of real-life audio-visual scenes. Each binaural spatial audio signal was subject to standardized MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) subjective scoring by 26 listeners with professional acoustic assessment qualifications. The scoring range was 0-100 points, with a 1-point interval. To ensure the authority and accuracy of the subjective quality rating labels corresponding to each binaural spatial audio signal, the average of the 26 subjective ratings for each binaural spatial audio signal is calculated to generate the corresponding subjective quality rating label. This method of generating subjective quality rating labels can effectively offset individual subjective rating biases (such as some listeners' subjective ratings being too strict or too lenient), making the subjective quality rating labels more reflective of the objective perceived quality of binaural spatial audio. Therefore, in this embodiment, each sample includes a binaural spatial audio signal and its corresponding subjective quality rating label, resulting in a total of 240 samples, constituting an audio dataset.
[0025] S2. Divide the audio dataset into training set, validation set and test set.
[0026] In this embodiment, a hierarchical random partitioning strategy is used to divide the 240 samples obtained in step S1 into a training set, a validation set, and a test set, with a partitioning ratio of 70:15:15. That is, the training set contains 168 samples, the validation set contains 36 samples, and the test set contains 36 samples. During the partitioning process, it is ensured that the distribution ratio of music clips and movie soundtrack clips in the three sets is consistent with that of the original dataset. This hierarchical partitioning method can avoid the model training being biased towards one class of samples due to uneven class distribution, thereby affecting the generalization performance of the model. To ensure the reproducibility of experimental results, this embodiment sets the random seed to 42 to ensure that the partitioning results are consistent each time.
[0027] S3. Construct an audio quality assessment model, which decomposes the input binaural spatial audio signal into left channel and right channel signals, and then performs the following stages in sequence: Feature extraction stage, such as Figure 2 As shown, the left channel signal and the right channel signal are input into the pre-trained mono audio feature extraction module, which outputs frame-level left channel features and frame-level right channel features respectively.
[0028] In the dual attention fusion stage, the frame-level left channel features and frame-level right channel features are input into the dual attention fusion module, respectively, and channel-level calibration is performed. The cross-channel spatial dependency modeling is then performed on the calibrated two-channel features to output the fused features.
[0029] In the quality prediction stage, the fused features are input into a lightweight prediction network, which outputs quality assessment results.
[0030] In this embodiment, existing technology is used to decompose the binaural spatial audio signal into left channel signal and right channel signal according to the audio channel. The decomposition process uses torchaudio.load to read the multi-channel data of the binaural spatial audio signal to ensure the integrity and independence of the left channel signal and the right channel signal.
[0031] In this embodiment, the mono audio feature extraction module used in the feature extraction stage is an existing, pre-trained Self-Supervised Music Representation Learning with Mel Residual Vector Quantization (MuQ) model. The MuQ model is a self-supervised audio feature extraction model whose core structure includes: a Mel spectrum extraction layer, a residual convolutional block, a Mel residual vector quantization (Mel-RVQ) layer, and a prediction head. The pre-training data for this MuQ model covers 90,000 hours of open-source mono audio, encompassing various audio types and scenarios, enabling it to learn general audio feature representations with strong universality and robustness. The pre-trained MuQ model has been validated in multiple downstream tasks such as mono audio quality assessment and audio classification, demonstrating excellent feature extraction capabilities and generalization performance. Its feature output effectively captures audio characteristics and quality-related information.
[0032] In this embodiment, the pre-trained MuQ model is used as a fixed feature extractor for transfer learning. The specific implementation is as follows: Figure 2 As shown, the left and right channel signals are input into the same pre-trained MuQ model for feature extraction. The pre-trained MuQ model uses the same set of model parameters for feature extraction of both the left and right channel signals. During training, all parameters of the pre-trained MuQ model are kept unchanged through parameter freezing. In the deep learning framework, the model parameters do not participate in gradient updates during subsequent training. The advantages of parameter freezing are: it avoids destroying the robust feature representation learned by the pre-trained MuQ model on large-scale mono audio data in few-shot training scenarios; it significantly reduces the number of parameters that need to be trained (the trainable parameters in this invention are only the parameters of the dual attention fusion module and the lightweight prediction network), effectively alleviating the overfitting problem in few-shot training scenarios; and it reduces the computational cost of backpropagation, accelerating model convergence. It should be noted that the pre-trained MuQ model, when transferred to the field of binaural spatial audio quality assessment, can effectively compensate for the problem of insufficient binaural spatial audio feature learning in scenarios with few training samples. Experiments show that after removing the MuQ model, the LCC, SRCC, and KTAU of the audio quality assessment model decreased by 59.1%, 59.2%, and 59.0%, respectively, fully verifying the key role of pre-trained feature transfer. At the same time, the feature extraction process of the MuQ model does not require manual feature design and can automatically learn high-order features related to quality assessment, avoiding the limitations of manual features.
[0033] In this embodiment, when extracting frame-level features using the pre-trained MuQ model, the frame shift is set to 40 milliseconds. During training, the frame-level left channel and frame-level right channel features output by the MuQ model have dimensions of B×F×D. Here, B represents the batch size, which is 4 (experiments have shown that this batch size balances GPU memory usage and training efficiency); F represents the number of frames in the time dimension, adaptively adjusted according to the duration of the input binaural spatial audio signal. It is calculated as the duration (in seconds) of the input left or right channel signal to the MuQ model divided by the frame shift (40 milliseconds, or 0.04 seconds). For example, for a 5-second binaural spatial audio signal, where both the left and right channel signals are 5 seconds long, the MuQ model will output frame-level features for F = 5 / 0.04 = 125 time frames; D represents the channel dimension of each time frame, with a value of 1024. This feature dimension, verified by the pre-trained MuQ model, effectively captures key audio features while avoiding excessively high computational complexity.
[0034] In this embodiment, the implementation process of the dual attention fusion module is as follows: a1) Input the frame-level left channel feature and the frame-level right channel feature into the channel attention (CA) submodule respectively, perform channel-level calibration, and output the calibrated frame-level left channel feature and the calibrated frame-level right channel feature accordingly.
[0035] a2) Input the calibrated frame-level left channel features and the calibrated frame-level right channel features into the multi-head self-attention (MHSA) submodule to perform cross-channel spatial dependency modeling in order to capture the spatial specific cues of the input binaural spatial audio signals and output dual-attention calibrated features.
[0036] a3) Input the dual-attention calibration features into the fusion layer, perform dimensionality reduction, and learn the nonlinear fusion relationship between the calibrated frame-level left channel features and the calibrated frame-level right channel features, and output the fusion features.
[0037] In this embodiment, as Figure 3 As shown, the implementation process of the channel attention submodule is as follows: b1) Perform global average pooling (AvgPool) and global max pooling (MaxPool) operations on the time dimension for each individual channel feature, resulting in a global statistical context descriptor and a global extreme value context descriptor. The individual channel feature is either a frame-level left channel feature or a frame-level right channel feature. In this embodiment, the global average pooling operation calculates the average time dimension value for each channel, and the global max pooling operation calculates the maximum time dimension value for each channel, capturing the global statistical information and global extreme value information of each individual channel feature to achieve a comprehensive evaluation of channel importance. The size of the global statistical context descriptor and the global extreme value context descriptor is B×D.
[0038] b2) The global statistical context descriptor and the global extreme value context descriptor are input into a multilayer perceptron (MLP) with shared weights, and a nonlinear transformation is performed to obtain the first channel importance vector and the second channel importance vector. In this embodiment, the MLP is structured as follows: input layer (D=1024 dimensions) → hidden layer (D / 16=64 dimensions) → output layer (D=1024 dimensions). The hidden layer dimension is set to D / 16 to further reduce computational complexity while maintaining the model's expressive power, thus meeting lightweight requirements. The hidden layer uses the ReLU activation function, while the output layer has no activation function, and the dimensionality reduction ratio is set to 16. The role of the MLP is to perform a nonlinear transformation on the global statistical context descriptor and the global extreme value context descriptor to learn the dependencies between channels, thereby calculating the channel importance vector.
[0039] b3) Element-wise sum the importance vectors of the first and second channels, and normalize the sum to obtain the channel attention weights. In this embodiment, the normalization is achieved through the Sigmoid activation function. The value range of the channel attention weights is [0,1]. The formula for calculating the channel attention weights is: a=σ(MLP(AvgPool(H))+MLP(MaxPool(H))), where a represents the channel attention weight, σ() represents the Sigmoid activation function, which maps the input value to the [0,1] interval, allowing the channel attention weights to be directly used for channel-level weighting of features, MLP() represents the multilayer perceptron, AvgPool() represents the global average pooling operation, MaxPool() represents the global max pooling operation, and H represents a single channel feature.
[0040] b4) The individual channel features are weighted along the channel dimension according to the channel attention weight to obtain the calibrated individual channel features. Through this weighting operation (multiplying the channel attention weight with the individual channel features element-wise along the channel dimension), the features of the quality perception-related channels are enhanced (weight close to 1), and the features of irrelevant or interfering channels are suppressed (weight close to 0), thereby improving the discriminative ability of the features; where the calibrated individual channel features are either the calibrated frame-level left channel features or the calibrated frame-level right channel features.
[0041] In this embodiment, as Figure 4 As shown, the implementation process of the multi-head self-attention submodule is as follows: c1) The calibrated frame-level left channel features and the calibrated frame-level right channel features are concatenated along the channel dimension to obtain the joint feature. In this embodiment, the size of the joint feature is B×F×2D. The concatenation operation realizes the preliminary fusion of the calibrated two-channel features, providing a foundation for subsequent cross-channel spatial dependency modeling.
[0042] c2) The joint features are mapped to a query matrix, a key matrix, and a value matrix through three independent linear transformation layers. Each of these matrices is then split into multiple independent query submatrices, multiple independent key submatrices, and multiple independent value submatrices according to a preset number of attention heads. In this embodiment, the input dimension of each linear transformation layer is 2D=2048, and the output dimension is 2048. The preset number of attention heads is 8, meaning the query matrix, key matrix, and value matrix are each split into 8 submatrices, each with a size of B×F×256.
[0043] c3) Calculate the attention weights of the query submatrix and key submatrix of each attention head, and perform weighted aggregation on the corresponding value submatrix based on the attention weights. Then, concatenate the outputs of each attention head along the channel dimension to obtain the concatenated and fused features. In this embodiment, the formula for calculating the attention weights of the query submatrix and key matrix of the i-th attention head and performing weighted aggregation on the corresponding value submatrix based on these attention weights is: Attention(Q i ,K i V i )=softmax(Q i J i / P)V i Among them, Attention(Q i ,K i V i ) represents the output of the i-th attention head, Q i K represents the query submatrix of the i-th attention head. i V represents the key matrix of the i-th attention head. iJ represents the submatrix of values for the i-th attention head. i K represents i The transpose of the expression, i=1,2,...,8, is given. The value of P is the square root of the channel dimension of a single attention head. Since the channel dimension of a single attention head is 256, P=16. P is used to scale the dot product result to avoid the softmax function output becoming extreme (gradient vanishing) due to excessive dimensionality. Through the multi-attention head design, the model can simultaneously capture cross-channel spatial dependencies of different scales and types, and fully explore the spatial features of binaural spatial audio signals. The size of the attention fusion feature is B×F×2048.
[0044] c4) The concatenated and fused features are subjected to residual connections and layer normalization with the joint features to output dual-attention calibration features. In this embodiment, the concatenated and fused features are first passed through a linear transformation layer to maintain their channel dimension (2048 dimensions), then residually connected with the joint features, followed by layer normalization (LayerNorm) processing (epsilon parameter set to 1e-5), and finally passed through a dropout layer (i.e., a dropout layer with a probability of 0.1) to obtain dual-attention calibration features. Residual connections can alleviate the gradient vanishing problem in deep networks, layer normalization can accelerate model convergence, and dropout layers can improve the generalization performance of the model. The synergistic effect of the three can ensure the stability and effectiveness of model training.
[0045] In this embodiment, the fusion layer employs a multilayer perceptron, with the following structure: input layer (2D=2048 dimensions) → hidden layer (D=1024 dimensions) → dropout layer (probability 0.1) → output layer (D=1024 dimensions). The hidden layer uses the ReLU activation function to introduce nonlinearity, while the output layer has no activation function and directly outputs the 1024-dimensional fused feature. The function of this multilayer perceptron is to reduce the dimensionality of the high-dimensional features output by multiple attention, while simultaneously learning the nonlinear fusion relationship between the calibrated two-channel features. The resulting fused feature retains both the spectral-temporal characteristics of a single channel and incorporates the spatial dependency information between the two channels, providing core support for high-quality prediction. The fused feature has dimensions B×F×D.
[0046] In this embodiment, the channel attention submodule and the multi-head self-attention submodule complement each other: the channel attention submodule focuses on the mono channel, enhancing quality-related features and suppressing irrelevant features by adaptively adjusting attention weights; the multi-head self-attention submodule focuses on the two channels, capturing spatially specific cues of binaural spatial audio signals by modeling cross-channel and cross-temporal dependencies. The combination of these two submodules enables the fused features to possess both spectral-temporal and spatial characteristics, overcoming the limitation of traditional fusion methods in effectively capturing spatial features and making the quality assessment results more closely aligned with human subjective perception.
[0047] In this embodiment, as Figure 5 As shown, the implementation process of the lightweight prediction network is as follows: d1) The fused features are input into the lightweight encoder for local temporal pattern feature extraction, and the output is frame-level encoded features. In this embodiment, the lightweight encoder uses MobileNetV3, which is built based on inverted residual convolutional blocks. It contains multiple consecutive inverted residual convolutional blocks, and the structure of each inverted residual convolutional block is: 1×1 pointwise convolutional expansion → 3×3 depthwise convolution → 1×1 pointwise convolutional compression → BatchNorm2d (batch normalization of two-dimensional data) → HardSwish activation. The specific configuration is as follows: The input channels are expanded to a higher dimension using 1×1 pointwise convolutions, then spatial features are extracted using 3×3 depthwise convolutions, and finally compressed back to the original number of channels using 1×1 pointwise convolutions, forming a "bottleneck" structure. The kernel size of the depthwise convolution is 3×3, the stride is dynamically adjusted as needed (1 or 2), and the padding is 1. The 1×1 pointwise convolution used to expand the dimension is the first pointwise convolution, and the 1×1 pointwise convolution used to compress the dimension is the second pointwise convolution. The kernel size of the pointwise convolution is 1×1, the stride is 1, and the padding is 0. The input channel number starts from 16, gradually expands to 96 after multiple inverse residual convolution blocks, and is finally compressed to 256. The epsilon parameter of BatchNorm2d is set to 1e-5, and the momentum parameter is set to 0.1 (default value). The HardSwish activation function is used, employing in-place operation to reduce memory usage, while exhibiting better performance than the ReLU activation function on mobile devices. The inverse residual structure significantly reduces the number of parameters and computational cost (approximately 1 / 8 of the number of parameters in standard convolution) by decomposing the standard convolution into pointwise convolution and depthwise convolution, while maintaining feature extraction capabilities, thus making the encoder lightweight. The core objective of this encoder is to capture local temporal patterns related to audio distortion in the fused features (such as local feature anomalies caused by spectral distortion, spatial positioning bias, etc.), providing targeted features for subsequent quality score prediction.
[0048] d2) The frame-level encoded features are input into a feedforward neural network (FFN) regressor to learn the mapping relationship between the frame-level encoded features and the frame-level quality score, and output the frame-level quality score. In this embodiment, the regressor adopts a lightweight design, containing two fully connected layers. The structure of the fully connected layer is: input layer (256 dimensions) → hidden layer (64 dimensions) → dropout layer (probability 0.3) → output layer (1 dimension). The hidden layer uses the ReLU activation function to introduce non-linearity and improve the model's fitting ability, the dropout layer is used to prevent overfitting, and the output layer has no activation function and directly outputs the frame-level quality score. The FFN regressor has a simple structure, fast computation speed, and small memory footprint, making it suitable for use in scenarios with few samples, and it can quickly learn the mapping relationship between frame-level encoded features and frame-level quality scores.
[0049] The combination of MobileNetV3 and FFN regressors reduces the overall parameter count of the audio quality assessment model compared to existing deep learning assessment models, while increasing inference speed. This enables the model to meet the real-time assessment needs of resource-constrained scenarios such as VR devices and in-vehicle terminals.
[0050] d3) Perform time-averaged pooling on all frame-level quality scores through a pooling layer to obtain the global quality prediction score of the input binaural spatial audio signal. In this embodiment, time-averaged pooling can smooth fluctuations in frame-level quality scores (such as some frames having low or high frame-level quality scores due to local distortion), more accurately reflect the overall perceived quality of the audio, and make the global quality prediction score more consistent with the overall evaluation logic of subjective scoring.
[0051] d4) The global quality prediction score is mapped to a preset scoring range through a mapping layer to obtain the quality assessment result. In this embodiment, the global quality prediction score is mapped to 0-100 points so that the final quality assessment result is consistent with the MUSHRA subjective score (0-100 points).
[0052] In this embodiment, to address the training instability caused by inconsistent lengths of binaural spatial audio signals, a batch-based zero-padding strategy is employed. Specifically, during training, samples are read in batches, with each batch containing four samples. The length T_max of the longest binaural spatial audio signal in the current batch is calculated. Zero-padding (0) is applied to the ends of the remaining binaural spatial audio signals in the current batch, ensuring that the length of all binaural spatial audio signals in the current batch is T_max. This avoids gradient instability or feature alignment errors during model training due to length differences.
[0053] S4. Based on the training set, train the dual attention fusion module and the lightweight prediction network, freezing the parameters of the mono audio feature extraction module during the training process; and select the optimal model parameters based on the validation set using an early stopping mechanism to obtain the trained audio quality evaluation model.
[0054] In this embodiment, the specific process of step S4 is as follows: S41. Initialize model parameters: Load the pre-trained weights (downloaded from the public model repository) and freeze them in the mono audio feature extraction module. The parameters of the dual attention fusion module and the lightweight prediction network are initialized using PyTorch's default uniform distribution method, which is suitable for the initialization needs of most deep learning models.
[0055] S42. Configure training parameters: The RMSprop optimizer is selected for parameter updates of the dual attention fusion module and the lightweight prediction network. This optimizer has good stability when dealing with non-stationary targets and is suitable for training scenarios with few samples. The specific parameters of the RMSprop optimizer are set as follows: the learning rate is fixed at 0.001, which has been experimentally verified to achieve a balance between convergence speed and convergence effect; the alpha parameter is set to 0.9 to control the decay rate of the exponentially weighted average; the batch size B=4; and the maximum number of training epochs is set to 100 to ensure that the model has enough training time to converge.
[0056] S43. Define the loss function: Use mean squared error (MSE) loss to measure the difference between the quality assessment result and the corresponding subjective quality rating label. MSE loss can effectively optimize the absolute deviation between the quality assessment result and the corresponding subjective quality rating label, enabling the model to learn an accurate rating mapping relationship, which meets the optimization goal of the audio quality assessment task.
[0057] S44. Execute the training process, which includes: 1) Reading samples from the training set in batches using a custom data loader. The data loader uses a custom function to handle the filling of binaural audio signals of different lengths; 2) Inputting the batch samples into the audio quality assessment model to obtain the corresponding quality assessment results; 3) Calculating the MSE loss of the batch; 4) Calling a function to calculate the gradient. During the gradient calculation, the gradient is calculated for the entire audio quality assessment model (except for the MuQ model with frozen parameters, i.e., the mono audio feature extraction module); 5) Calling a function to update the parameters, and simultaneously clearing the gradient through a function to avoid gradient accumulation leading to training instability. During the training process, mixed precision training is also enabled to optimize training efficiency, and NaN / Inf (invalid / missing values) cleanup is performed on the input audio, intermediate features, and model output throughout the process to ensure numerical stability.
[0058] S45. An early stopping mechanism is introduced, using the Spearman Rank Correlation Coefficient (SRCC) on the validation set as the selection criterion for optimal model parameters. After each epoch, a model evaluation process is performed on the validation set, calculating the MSE loss between the quality assessment results and subjective quality score labels on the validation set, and simultaneously calculating three correlation indicators: LCC, SRCC, and KTAU. If the SRCC on the validation set in the current epoch is higher than the historical best value, the current model parameters are saved, and the historical best indicators are updated. If there is no improvement in SRCC on the validation set for a preset number of epochs, the early stopping mechanism is triggered, training is stopped, and the historical best model parameters are loaded as the final training result. The preset number of epochs is 20 epochs. This early stopping mechanism can effectively avoid the decline in generalization performance caused by overtraining, ensuring that the model achieves optimal performance on the validation set.
[0059] This embodiment employs a combined training strategy of "pre-trained feature transfer + parameter freezing + early stopping mechanism," enabling the audio quality assessment model to achieve stable convergence and excellent performance on the ODAQ dataset with only 240 samples. The parameter freezing strategy significantly reduces the number of trainable parameters, lowering the risk of overfitting; the early stopping mechanism effectively avoids overtraining of the audio quality assessment model; and the combination of the RMSprop optimizer and MSE loss ensures rapid convergence and accurate fitting of the audio quality assessment model. This training strategy fully considers the characteristics of the few-sample scenario, providing strong support for the performance of the audio quality assessment model.
[0060] S5. Using the trained audio quality assessment model, perform quality assessment on each binaural spatial audio signal in the test set.
[0061] To further verify the feasibility and effectiveness of the method of the present invention, experiments were conducted on the method of the present invention.
[0062] I. Experimental Setup 1.1 Dataset This experiment uses the publicly available ODAQ dataset, which contains 240 binaural spatial audio signals. Detailed information such as the composition of audio content, subjective scoring methods, and scoring range has been described in step S1 and will not be repeated here.
[0063] 1.2 Data Preprocessing Before being input into the audio quality assessment model, all binaural spatial audio signals are processed using an in-batch zero-padding strategy to address the inconsistency in the length of binaural spatial audio signals within the current batch, where the batch size B=4.
[0064] 1.3 Evaluation Indicators This experiment uses three correlation metrics to evaluate model performance: LCC measures the linear fit between the model's output quality assessment results and the subjective quality score labels. The value ranges from [-1, 1], and the closer the value is to 1, the stronger the linear correlation.
[0065] SRCC measures the consistency between the quality assessment results output by the model and the ranking of subjective quality rating labels, focusing on the accuracy of the ranking of quality levels, and its value ranges from [-1, 1].
[0066] KTAU, compared to SRCC, places greater emphasis on the consistency of pairwise sample ranking and is used to supplement the validation of the correlation between model predictions and the level of subjective quality rating labels.
[0067] LCC focuses on numerical fit, while SRCC and KTAU focus on ordination consistency. Combining the three can comprehensively evaluate model performance, avoid the limitations of a single indicator, and meet the core requirements of human subjective evaluation.
[0068] II. Comparison of experimental results: 2.1 Baseline Model This experiment selected the reproducible Ambiqual, QASTAnet, and eMoBi-Q models in the field of spatial audio quality assessment as baselines for comparison with invasive methods. Simultaneously, results were obtained from the benchmark models PESQ and SI-SDR in the ODAQ dataset; these two models are widely used monophonic audio quality assessment models. All five models are invasive and require a reference signal to complete the assessment.
[0069] 2.2 Analysis of Experimental Results Table 1 shows the performance comparison of the method of the present invention with other baseline models on the ODAQ dataset.
[0070] Table 1. Performance comparison of the method of this invention with other baseline models on the ODAQ dataset.
[0071] In Table 1, Ambiqual refers to the method disclosed in AMBIQUAL: Towards a Quality Metric for Headphone Rendered Compressed Ambisonic Spatial Audio, published in Applied Sciences; QASTAnet refers to the method disclosed in QASTAnet: A DNN-based Quality Metric for Spatial Audio, published in the 2026 IEEE International Conference on Acoustics, Speech and Signal Processing; eMoBi-Q refers to the method disclosed in A Computationally Efficient Model for Combined Assessment of Monaural and Binaural Audio Quality, published in the Journal of the Audio Engineering Society; and SI-SDR refers to the method disclosed in the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. The method disclosed in SDR – Half-baked or Well Done, which was included in the 2001 International Conference on Acoustics, Speech, and Signal Processing, is called PESQ. PESQ refers to the method disclosed in Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs, which was included in the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing.
[0072] Experimental results show that: 1. The method of this invention outperforms all invasive models on all metrics (LCC=0.9165, SRCC=0.8941, KTAU=0.7323), indicating that it is more consistent with human judgment in terms of scoring accuracy and ranking consistency.
[0073] 2. Compared with QASTAnet, the method of this invention improves LCC from 0.7812 to 0.9165, with a relative gain of 17.3%; SRCC and KTAU are improved by 0.0828 and 0.0667, respectively.
[0074] 3. Compared with Ambiqual and eMoBi-Q, the method of the present invention has more obvious advantages, with LCC increased by 0.4766 and 0.5137 respectively.
[0075] 4. Compared to PESQ, the method of this invention achieves an absolute improvement of 0.4085 in LCC; compared to SI-SDR, the method of this invention achieves 0.8941 in SRCC, corresponding to an absolute improvement of 0.5251. This significant performance gap indicates that the degradation of spatial audio quality is closely related to the impairment of spatial structure consistency, rather than simple waveform-level distortion.
[0076] III. Ablation Experiment Results To verify the positive role of each module in the method of this invention, an ablation experiment was conducted, and the results are shown in Table 2.
[0077] Table 2 Performance comparison of the method of the present invention on the ODAQ dataset after removing each module.
[0078] The results of the ablation experiment showed that: 1. The removal of the mono audio feature extraction module resulted in a significant performance degradation, with LCC, SRCC, and KTAU decreasing by 59.1%, 59.2%, and 59.0%, respectively. This underscores the indispensable role of this module in extracting perceptually relevant features and confirms the effectiveness of transferring the MuQ model to the binaural spatial audio domain.
[0079] 2. Removal of the multi-head self-attention submodule resulted in a 13.1%, 15.2%, and 25.2% decrease in LCC, SRCC, and KTAU, respectively. The particularly significant decrease in KTAU indicates that the module's inter-channel modeling ability is crucial for learning robust ordinal relationships between quality scores.
[0080] 3. The ablation of the channel attention submodule and the lightweight encoder has a relatively small impact on performance, indicating that they make an auxiliary contribution to the overall quality prediction.
[0081] IV. Comparison Experiment of Data Processing Methods To verify that the feedforward neural network regressor and data processing method in the lightweight prediction network of the present invention can most effectively handle the case of few samples, the feedforward neural network regressor was replaced with an RNN (Recurrent Neural Network) decoder and an FFN (feedforward neural network) decoder, and different data processing methods were used for experimental verification. The experimental results are shown in Table 3.
[0082] Table 3. Comparison of Adjusted Regressor and Data Processing Methods on the ODAQ Dataset
[0083] 4.1 Three Data Processing Methods Average scoring method: The scores from 26 judges for each audio file are aggregated to generate a consensus score. Among all regressor architectures, this method yields the best performance, validating its effectiveness and stability in mapping audio features to a unified quality score. This method best aligns with the design intent of the present invention.
[0084] The full-score method treats each judge's score as an independent sample. Except for the RNN-decoder, this method performs poorly on all regressors and metrics. Its poor performance stems from subjective bias introduced within the same audio sample, which obscures the feature-score mapping and increases the difficulty of model learning.
[0085] Identity encoding method: Judge identities are used in the encoder to model personal bias. Compared with the average score method, the LCC performance of RNN-decoder and FFN-decoder is improved, but neither can outperform the average score method in the prediction network scenario.
[0086] 4.2 Comparison of Regressor Architectures Experimental results show that, under the same data processing method, the regressor architecture exhibits different performance patterns. When using the average score method, the variant based on the lightweight prediction network outperforms other variants on all metrics, confirming its superiority in capturing the mapping relationship between frame-level encoded features and frame-level quality scores, thus justifying its selection as the final regressor architecture.
[0087] In summary, the method of this invention addresses the data scarcity challenge of few-sample training by transferring the robust feature representation capabilities of a mono pre-trained model; it captures spatially specific cues in binaural spatial audio through innovative dual-attention fusion, improving the consistency between quality assessment results and human subjective perception; and it reduces model computational complexity and resource consumption through a lightweight prediction network architecture design, enabling efficient deployment. Ultimately, it achieves the goal of reference-free, high-precision, few-sample-adaptive, and lightweight binaural spatial audio quality assessment, effectively filling the gap in the practical application of existing technologies.
Claims
1. A method for assessing binaural spatial audio quality with few samples and no reference, characterized in that, Includes the following steps: S1. Obtain an audio dataset containing a small number of samples, each of which includes a binaural spatial audio signal and its corresponding subjective quality rating label. S2. Divide the audio dataset into training set, validation set and test set; S3. Construct an audio quality assessment model, which decomposes the input binaural spatial audio signal into left channel and right channel signals, and then performs the following stages in sequence: In the feature extraction stage, the left channel signal and the right channel signal are respectively input to the pre-trained mono audio feature extraction module, which outputs frame-level left channel features and frame-level right channel features accordingly. In the dual attention fusion stage, the frame-level left channel features and the frame-level right channel features are input into the dual attention fusion module, respectively, channel-level calibration is performed, and cross-channel spatial dependency modeling is performed on the calibrated two-channel features to output fused features; In the quality prediction stage, the fused features are input into a lightweight prediction network, which outputs the quality evaluation results. S4. Based on the training set, train the dual attention fusion module and the lightweight prediction network, and freeze the parameters of the mono audio feature extraction module during the training process; and select the optimal model parameters based on the validation set using an early stopping mechanism to obtain the trained audio quality evaluation model. S5. Using the trained audio quality assessment model, perform quality assessment on each binaural spatial audio signal in the test set.
2. The method for assessing binaural spatial audio quality with few samples and no reference, as described in claim 1, is characterized in that... The mono audio feature extraction module employs a self-supervised music representation learning model based on Mel residual vector quantization.
3. The method for assessing binaural spatial audio quality with few samples and no reference, as described in claim 1, is characterized in that... The implementation process of the dual attention fusion module is as follows: The frame-level left channel feature and the frame-level right channel feature are respectively input into the channel attention submodule for channel-level calibration, and the calibrated frame-level left channel feature and calibrated frame-level right channel feature are output accordingly. The calibrated frame-level left channel features and the calibrated frame-level right channel features are input into the multi-head self-attention submodule to perform cross-channel spatial dependency modeling in order to capture spatial-specific cues of the input binaural spatial audio signals and output dual-attention calibration features. The dual-attention calibration features are input into the fusion layer for dimensionality reduction, and the nonlinear fusion relationship between the calibrated frame-level left channel features and the calibrated frame-level right channel features is learned, and the fusion features are output.
4. The method for assessing binaural spatial audio quality with few samples and no reference, as described in claim 3, is characterized in that... The implementation process of the channel attention submodule is as follows: Global average pooling and global max pooling operations are performed on a single channel feature in the time dimension to obtain a global statistical context descriptor and a global extreme value context descriptor, respectively; wherein, the single channel feature is the frame-level left channel feature or the frame-level right channel feature. The global statistical context descriptor and the global extreme value context descriptor are respectively input into a multilayer perceptron with shared weights and subjected to nonlinear transformation to obtain the first channel importance vector and the second channel importance vector. The importance vectors of the first and second channels are added element by element, and the addition result is normalized to obtain the channel attention weights. The individual channel features are weighted along the channel dimension according to the channel attention weight to obtain the calibrated individual channel features; wherein, the calibrated individual channel features are either the calibrated frame-level left channel features or the calibrated frame-level right channel features.
5. The method for assessing binaural spatial audio quality with few samples and no reference, as described in claim 3, is characterized in that... The implementation process of the multi-head self-attention submodule is as follows: The calibrated frame-level left channel feature and the calibrated frame-level right channel feature are concatenated along the channel dimension to obtain a joint feature; The joint features are mapped to query matrix, key matrix and value matrix respectively, and the query matrix, key matrix and value matrix are all split into multiple query sub-matrices, multiple key matrices and multiple value sub-matrices according to a preset number of attention heads; Calculate the attention weights of the query submatrix and key submatrix of each attention head, and perform weighted aggregation on the corresponding value submatrix according to the attention weights. Then, concatenate the outputs of each attention head in the channel dimension to obtain the concatenated and fused features. The splicing and fusion features and the joint features are subjected to residual connection and layer normalization to output the dual attention calibration features.
6. The method for assessing binaural spatial audio quality with few samples and no reference, as described in claim 1, is characterized in that... The implementation process of the lightweight prediction network is as follows: The fused features are input into a lightweight encoder to extract features of local temporal patterns and output frame-level encoded features. The frame-level coding features are input into a feedforward neural network regressor to learn the mapping relationship between the frame-level coding features and the frame-level quality score, and output the frame-level quality score. The global quality prediction score of the input binaural spatial audio signal is obtained by performing time-averaged pooling on all the frame-level quality scores through a pooling layer. The global quality prediction score is mapped to a preset scoring range through a mapping layer to obtain the quality assessment result.
7. The method for assessing binaural spatial audio quality with few samples and no reference, as described in claim 6, is characterized in that... The lightweight encoder is constructed based on inverted residual convolutional blocks, which sequentially include a first pointwise convolution for expanding dimensions, a depthwise convolution for extracting spatial features, and a second pointwise convolution for compressing dimensions.
8. The method for assessing binaural spatial audio quality with few samples and no reference, as described in claim 1, is characterized in that... The training process employs an early stopping mechanism, using the Spearman rank correlation coefficient on the validation set as the selection criterion for the optimal model parameters. When the Spearman rank correlation coefficient does not improve after a preset number of consecutive rounds, training is stopped and the historically optimal model parameters are loaded.
9. The method for assessing binaural spatial audio quality with few samples and no reference, as described in claim 1, is characterized in that... During the training process, the RMSprop optimizer is used to update the parameters of the dual attention fusion module and the lightweight prediction network.