Quality evaluation method and electronic device

CN116110433BActive Publication Date: 2026-06-19BEIJING YOUZHUJU NETWORK TECH CO LTD

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIJING YOUZHUJU NETWORK TECH CO LTD
Filing Date: 2023-01-16
Publication Date: 2026-06-19

Application Information

Patent Timeline

16 Jan 2023

Application

19 Jun 2026

Publication

CN116110433B

IPC: G10L25/60; G10L25/03; G10L25/27; G10L17/04

AI Tagging

Application Domain

Speech analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Raindrop acoustic signal assisted method for removing rain interference from a patrol image
CN121903890BImage enhancement Image analysis
A real-time audio Ethernet transmission and processing system based on double FPGA
CN122204843ASpeech analysis Transmission
Electronic device for detecting speech rate and method for detecting speech rate
CN122224204ASpeech analysis
A method, device and medium for intelligent control of light
CN117636911BElectrical apparatus Speech analysis
Method and apparatus for auditory training
US20260162561A1Data processing applicationsEar treatment

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing subjective evaluation methods for speech quality require extensive manual annotation, resulting in high time and labor costs, and existing predictive MOS schemes are not accurate enough.

Method used

A quality assessment model is constructed by using at least two pre-trained sub-models trained in pairs. Each pre-trained sub-model is sorted by considering the relative order of the input and output, and the final MOS value is obtained by concatenating them. The model is then trained using self-supervised learning and fully connected layers.

Benefits of technology

It reduces labor and time costs, improves the accuracy of MOS prediction and speech quality ranking, and enhances the generalization performance of the model.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116110433B_ABST

Patent Text Reader

Abstract

This disclosure relates to quality assessment methods, apparatus, electronic devices, computer-readable storage media, and computer program products. The method includes: acquiring a trained quality assessment model, wherein the trained quality assessment model includes at least two pre-trained sub-models, each of the two pre-trained sub-models being obtained through a pairwise training method; and inputting input speech into the trained quality assessment model to obtain model output. In this manner, embodiments of this disclosure can determine the MOS value of input speech through the quality assessment model without manual annotation, reducing labor and time costs. Furthermore, in embodiments of this disclosure, since each pre-trained sub-model in the quality assessment model is obtained using a pairwise training method, the MOS values of different speech samples can be accurately ranked.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure generally relates to the field of computers, and more specifically to quality assessment methods and electronic devices. Background Technology

[0002] Speech quality evaluation metrics aim to measure the quality of speech, where the speech being measured can be synthesized speech, such as speech synthesized from text or speech that has been converted. Speech quality evaluation metrics include objective evaluation metrics and subjective evaluation metrics.

[0003] Objective evaluation metrics can quantitatively assess various parameters of speech, thereby measuring the quality of speech. However, due to differences in human subjective perception, there is a gap between objective evaluation metrics and human experience. Therefore, using subjective evaluation metrics can better reflect human perception of speech quality.

[0004] Generally, subjective evaluation metrics require a large number of professionally trained annotators to annotate speech data, which results in significant time and manpower costs. Summary of the Invention

[0005] According to an example embodiment of this disclosure, a quality assessment scheme is provided, wherein each pre-trained sub-model in the quality assessment model is obtained through a pairwise training method.

[0006] In a first aspect of this disclosure, a quality assessment method is provided, comprising: obtaining a trained quality assessment model, wherein the trained quality assessment model includes at least two pre-trained sub-models, each of the two pre-trained sub-models being obtained through a pairwise training method; and inputting input speech into the trained quality assessment model to obtain a model output.

[0007] In a second aspect of this disclosure, an electronic device is provided, comprising: at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the electronic device to perform the method described in the first aspect of this disclosure when executed by the at least one processing unit.

[0008] In a third aspect of this disclosure, a quality assessment apparatus is provided, comprising: a model acquisition unit configured to acquire a trained quality assessment model, wherein the trained quality assessment model includes at least two pre-trained sub-models, each of the two pre-trained sub-models being obtained through a pairwise training method; and an output determination unit configured to input input speech into the trained quality assessment model to obtain a model output.

[0009] In a fourth aspect of this disclosure, a computer-readable storage medium is provided having machine-executable instructions stored thereon, which, when executed by a device, cause the device to perform the method described in the first aspect of this disclosure.

[0010] In a fifth aspect of this disclosure, a computer program product is provided, including computer-executable instructions, wherein the computer-executable instructions, when executed by a processor, implement the method described in the first aspect of this disclosure.

[0011] In a sixth aspect of this disclosure, an electronic device is provided, comprising: at least one processing unit; and at least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the electronic device to perform a method according to a first aspect of this disclosure when executed by the at least one processing unit.

[0012] The summary section is provided to introduce a series of concepts in a simplified form, which will be further described in the detailed description below. The summary section is not intended to identify key or essential features of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0013] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. In the drawings, the same or similar reference numerals denote the same or similar elements, wherein:

[0014] Figure 1 A schematic diagram of the model structure of a quality assessment model according to some embodiments of the present disclosure is shown;

[0015] Figure 2 A schematic flowchart illustrating the model training process according to some embodiments of the present disclosure is shown;

[0016] Figure 3 A schematic diagram of pairwise training according to some embodiments of the present disclosure is shown;

[0017] Figure 4 Another schematic flowchart illustrating the model training process according to some embodiments of the present disclosure is shown;

[0018] Figure 5 A schematic diagram illustrating the determination of out-of-set data items according to some embodiments of the present disclosure is shown;

[0019] Figure 6 A flowchart illustrating the model usage process according to some embodiments of this disclosure is shown;

[0020] Figure 7 Block diagrams of example apparatuses according to some embodiments of the present disclosure are shown; and

[0021] Figure 8 A block diagram of an example device that can be used to implement embodiments of the present disclosure is shown. Detailed Implementation

[0022] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0023] As mentioned earlier, subjective evaluation of speech (also known as audio) better reflects human perception of speech quality. The mean opinion score (MOS) is a subjective evaluation metric for speech quality. For example, for a given piece of speech, multiple annotators can assign integer scores between 1 and 5, and the MOS is then calculated based on the average of these scores. However, subjective evaluation metrics require a large number of annotators to annotate the data, which undoubtedly incurs significant time and manpower costs.

[0024] One cost-reducing approach is to build a model for predicting MOS (Mean Opinion Scale) to fit subjective evaluation metrics, thereby enabling automated evaluation. For example, the predicted MOS can be used as annotation information for individual data items in a dataset, eliminating the need for manual annotation. However, current methods for predicting MOS rely on predicting single speech segments or, after incorporating annotator information, predicting a specific annotator's score, resulting in inaccurate MOS values.

[0025] To address the aforementioned problems and other potential issues, embodiments of this disclosure provide a quality assessment model for predicting MOS (Mean Offset) values for speech. Speech can be output in parallel to at least two pre-trained sub-models in the quality assessment model to obtain at least two corresponding MOS prediction values, and these at least two MOS prediction values can be further concatenated to obtain the final MOS value of the speech. Each pre-trained sub-model is obtained through a pairwise training method, where the trained pre-trained sub-model is obtained by considering the relative ordering of the outputs corresponding to a pair of inputs to the pre-trained sub-model. In this way, the solution of embodiments of this disclosure eliminates the need for manual MOS annotation, reducing labor and time costs. Furthermore, since each pre-trained sub-model is trained through a pairwise training method, the accuracy of the MOS values output by the quality assessment model can be ensured.

[0026] The quality assessment model in the embodiments of this disclosure may also be referred to as a MOS prediction model or a MOS prediction based Pair Comparison (MOSPC) model, or may be referred to by other names, which are not limited in this disclosure.

[0027] Figure 1 A schematic diagram of the model structure of a quality assessment model 100 according to some embodiments of the present disclosure is shown. The quality assessment model 100 includes multiple pre-trained sub-models, such as... Figure 1 Seven pre-trained sub-models 110 to 170 are shown. The quality evaluation model 100 also includes a fully connected (FC) layer 180. (As shown...) Figure 1 As shown, the seven outputs corresponding to the seven pre-trained sub-models 110 to 170 can be concatenated and input into the FC layer 180 to obtain the output of the quality evaluation model 100.

[0028] In some embodiments, each pre-trained sub-model may include a self-supervised learned (SSL) model and a fully connected (FC) sub-layer. For example, the k-th pre-trained sub-model among seven pre-trained sub-models may include an SSL model f. k and FC sublayer fc k , where k takes the value of any positive integer from 1 to 7.

[0029] For example, SSL model f k It may also be called a self-supervised pre-trained sub-model, self-supervised learning pre-trained sub-model, SSL pre-trained sub-model, or other names, which are not limited in this disclosure.

[0030] For example, SSL models f1 to f7 can be, in sequence: waveform-to-vector small model (wav2vec_small), waveform-to-vector large model (wav2vec_large), Hubert base model (hubert_base), waveform-to-vector large model extension (wav2vec_large(lv60)), waveform LM base model (wavlm_base), a model larger than the waveform LM base model (wavlm_base+), and waveform LM large model (wavlm_large). It is understood that SSL models f1 to f7 can have different model structures, numbers of parameters, etc. For example, wav2vec_large and wav2vec_small have similar network structures, but wav2vec_large has more network parameters. As another example, wavlm_base, wavlm_base+, and wavlm_large have similar network structures, but wavlm_base+ has more network parameters than wavlm_base, and wavlm_large has more network parameters than wavlm_base+.

[0031] It is understood that the quality assessment model 100 in the embodiments of this disclosure is constructed as an improvement on the Fusion of Self-Supervised Learned (Fusion-SSL) model. It should be noted that... Figure 1 The model 100 shown is merely illustrative. In actual model structures, the quality assessment model may also include other modules or network layers, which is not limited in this disclosure.

[0032] In some embodiments of this disclosure, each pre-trained sub-model can be generated using a pairwise training method. Exemplarily, the pairwise training method may be referred to as a pairwise comparison training method or other names, which are not limited in this disclosure.

[0033] Figure 2 A schematic flowchart of a model training process 200 according to some embodiments of the present disclosure is shown. In block 210, a first training dataset is constructed, which may include multiple batches, each batch including multiple data items, each data item including speech samples and labels. In block 220, a pre-trained sub-model is trained based on the first training dataset using a pairwise training method.

[0034] Understandable, for Figure 1 Each pre-trained sub-model in the dataset can be generated through the training process 200. In other words, each pre-trained sub-model is obtained independently through training.

[0035] For example, each data item in the first training dataset can be represented as (x1, y1), where x1 can represent a speech sample and y1 can represent a label, such as y1 being the MOS value of x1. Optionally, the labels in the data items can be manually labeled or predicted by other models, and this disclosure is not limited in this regard.

[0036] In some embodiments of this disclosure, pre-trained sub-models can be generated using a pairwise training method. Specifically, the pairwise training method involves using two data items from the same batch as a pair of inputs during training. For example, a pair of inputs can be fed separately into the sub-model, and a loss function can be constructed based on the corresponding pair of outputs and the labels corresponding to this pair of inputs.

[0037] Specifically, two data items from the same batch of the first training dataset can be obtained; two speech samples from the two data items can be input into the pre-trained sub-model to be trained to obtain two outputs; based on the two outputs and the two labels corresponding to the two speech samples of the two data items, a loss function can be constructed; and the pre-trained sub-model can be trained based on the loss function. Taking the k-th pre-trained sub-model as an example, the following will combine... Figure 3 Describe the example training process. Figure 3 A schematic diagram of pairwise training 300 according to some embodiments of the present disclosure is shown.

[0038] like Figure 3 As shown, the k-th pre-trained sub-model includes the SSL model f. k and FC sublayer fc k Suppose that two speech samples from two data items in the same batch are x. i and x j Then x i and x j They can be used as a pair of inputs, respectively, to the k-th pre-trained sub-model, as shown in the figure. Assume the corresponding two (pair) outputs are m. ki and m kj .

[0039] In embodiments of this disclosure, a loss function can be constructed, and training can be performed based on the loss function. Exemplarily, the loss function can be referred to as a pairwise training loss function, denoted as L. pair The loss function may include a relative ranking loss corresponding to a pair of inputs. Optionally, the loss function may also include a first-order (L1) loss corresponding to each input.

[0040] For example, a relative ranking loss can be constructed based on two labels corresponding to two speech samples of two outputs and two data items; and a loss function can be constructed based on the relative ranking loss and two first-order losses corresponding to the two speech samples.

[0041] The relative ranking loss can be expressed as L rank The two first-order losses are expressed as L. d1 and L d2 Therefore, the loss function in this embodiment can be expressed as equation (1) as follows:

[0042] L pair = (1-β)*L rank +β*(L d1 +L d2 (1)

[0043] In equation (1), β is a hyperparameter that can represent weights, for example, a value between 0 and 1. L1 loss, also known as mean absolute error (MAE), can be expressed as the average of the absolute differences between the model's predicted values (output) and the true values (labels).

[0044] Understandably, during training, L is optimized... d1 and L d2 The model can learn progressively to predict MOS values; during training, L is optimized. rank The model can learn step by step to sort the two speech sounds.

[0045] For example, two outputs can be mapped to probabilities using a logistic function; a relative value can be determined by the relative magnitude of the two labels corresponding to the two speech samples of the two data items; and a relative ranking loss can be constructed based on the probability and the relative value. For instance, the probability can be represented as P, and the relative value can be represented as L. In some embodiments, the relative ranking loss L... rank It can be expressed as the following equation (2):

[0046] L rank =-L*log(P)-(1-L)*log(1-P) (2)

[0047] Referring to equation (2), the relative ranking loss L rank Having a similar form to cross-entropy loss, relative ranking loss uses a logistic function to convert a pair of outputs m ki and m kj The mapping is to probability P, as shown in equation (3) below:

[0048]

[0049] In equation (2), L can be determined based on the relative size of the labels corresponding to a pair of inputs. The two speech samples that form a pair of inputs are x. i and x j Assume the tags for the audio recordings from these two days are y.i and y j Therefore, L in equation (2) can be determined using equation (4):

[0050]

[0051] In this way, based on the loss function L pair Through training, pre-trained sub-models can be obtained. By applying process 200, seven pre-trained sub-models can be obtained, and then... Figure 1 The illustrated quality assessment model 100 can be used to determine the MOS value of the input speech. Furthermore, in the embodiments of this disclosure, due to the use of pairwise training, the MOS values of different speech samples can be ranked. In other words, the solution of the embodiments of this disclosure can be used to accurately rank the speech quality of different speech samples.

[0052] In addition, to enhance the generalization performance of the model, embodiments of this disclosure can also be retrained based on pairwise training to obtain pretrained sub-models. Figure 4 A schematic flowchart of a model training process 400 according to some embodiments of the present disclosure is shown. In box 410, a second training dataset is constructed, comprising multiple data items, each including an in-set speech sample and an in-set label. In box 420, an out-of-set data item is constructed based on two data items from the second training dataset, comprising an out-of-set speech sample and an out-of-set label. In box 430, a pre-trained sub-model is trained based on the second training dataset and the out-of-set data item.

[0053] For example, it is possible to Figure 2 Following process 200, the pre-trained sub-models trained in pairs are further trained via process 400. It is understood that the second training dataset and the first training dataset can be independent of each other or related to each other; this disclosure does not limit this.

[0054] In some embodiments, the pre-trained sub-model may include a feature extraction module and an encoder module. Specifically, the two in-set speech items corresponding to two data items can be input into the feature extraction module, and the out-of-set speech item of the out-of-set data item can be determined based on the output of the feature extraction module.

[0055] For example, assuming two data items include a first data item and a second data item, the second data item can be determined based on the sampling probability of a symmetric Gaussian kernel for the first data item. This is how the two data items used to determine out-of-set data items are determined.

[0056] As an example, suppose the first data item is represented as (x1) i ,y1 iThen, based on the first data item, a sampling probability distribution based on a symmetric Gaussian kernel can be constructed to determine the second data item (x1) from the second training data item. j ,y1 j Alternatively, the sampling probability distribution is as shown in equation (5):

[0057]

[0058] In equation (5), d(i,j) represents y1 i With y1 j The distance between them For example, bandwidth can be represented by a hyperparameter. In some examples, the probability values can be normalized to probability density functions, where the sum of the individual probability density functions is 1. The second data item can then be obtained by sampling using these probability density functions.

[0059] In some embodiments of this disclosure, two intra-set speech samples of two data items can be input into the feature extraction module of the pre-trained sub-model to be trained to obtain two intermediate outputs; based on the two intermediate outputs, the embedding representation of the out-of-set speech samples can be determined; and based on the two intra-set labels corresponding to the two intra-set speech samples of the two data items, the out-of-set labels can be determined. Figure 5 A schematic diagram of determining out-of-set data items 500 according to some embodiments of the present disclosure is shown.

[0060] like Figure 5 As shown, the speech samples x1 within the set of the first and second data items i and x1 j Input into SSL model f respectively k The feature extraction module has intermediate outputs, namely the embedding representation e. i and e j Furthermore, the weighted sum of the two intermediate outputs can be used as the embedding representation of the out-of-set speech sample, as shown in equation (6):

[0061]

[0062] λ~Beta(α,α)(7)

[0063] The embedding representation of the out-of-set speech samples is obtained through equation (6), that is, the out-of-set speech samples of the out-of-set data items are obtained. In equation (6), λ is a parameter that satisfies the beta distribution, as shown in equation (7), and α is a parameter of the beta distribution.

[0064] For example, the out-of-set label of an out-of-set data item can be obtained based on the weighted sum of the in-set labels of the two data items, as shown in equation (8):

[0065] yo =λy1 i +(1-λ)y1 j (8)

[0066] Furthermore, by inputting the embedded representations of out-of-collection speech samples into the encoder module of the pre-trained sub-model to be trained, the model output corresponding to the out-of-collection data items can also be obtained. (See also...) Figure 5 Embedded representation of out-of-collection speech samples It can also be further input into the SSL model f k The encoder module, and then via the FC sublayer fc k Obtain the model output corresponding to the out-of-set data item, as shown in the following example:

[0067] In this way, training can be performed based on both in-set and out-of-set data items. For example, training can be performed based on the out-of-set label y of the out-of-set data item. o With model output Construct a first-order loss function, based at least on the first-order loss of out-of-set data items, and use the constructed loss function to train the model.

[0068] Exemplary, in embodiments of this disclosure, in conjunction with Figure 4 and Figure 5 The process can be based on the C-Mixup algorithm, which can simulate the construction of out-of-set data items based on the combination of in-set data (two data items), thereby increasing the generalization performance of the model.

[0069] Figure 6 An example flowchart of example use process 600 according to some embodiments of the present disclosure is shown. At block 610, a trained quality assessment model is obtained, wherein the trained quality assessment model includes at least two pre-trained sub-models, each pre-trained sub-model being obtained through a pairwise training method. At block 620, input speech is fed into the trained quality assessment model to obtain the model output. In some embodiments, the model output may represent the mean opinion score (MOS) value of the input speech.

[0070] In embodiments of this disclosure, the quality assessment model obtained at block 610 may have the following characteristics: Figure 1 The model structure is shown. For example, each pre-trained sub-model in this model can be implemented via, as shown... Figure 2 The process shown in 200 yields, or each pre-trained sub-model in the model can be obtained via, as follows: Figure 4 The process shown in step 400 is obtained.

[0071] Suppose the input speech is represented as X. Then X can be input into at least two pre-trained sub-models. Suppose the output of the at least two pre-trained sub-models is at least two intermediate MOS prediction values. Then the at least two intermediate MOS prediction values can be concatenated and input into the FC layer to obtain the output result, which is represented as Y.

[0072] Combination Figure 1 The input speech X is fed into seven pre-trained sub-models 110 to 170, yielding seven intermediate MOS prediction values, denoted as M1 to M7. These seven intermediate MOS prediction values are concatenated, and the concatenated result is fed into the FC layer 180 to obtain the model output Y. This model output represents the MOS value of the input speech X.

[0073] In this way, embodiments of the present disclosure can determine the MOS value of input speech through a quality assessment model. Furthermore, in embodiments of the present disclosure, since each pre-trained sub-model in the quality assessment model is obtained using a pairwise training method, the MOS values of different speech samples can be ranked. That is, the solution of embodiments of the present disclosure can be used to accurately rank the speech quality of different speech samples.

[0074] The following table shows a comparison between the scheme of the present disclosure (denoted as MOSPC) and schemes such as Mean-Bias Network (MBNet), Unified Listener Dependent Network (LDNET), MOS Network (MOSNET), Fusion-SSL, etc.

[0075] Table 1 shows the comparison results of mean square error (MSE), linear correlation coefficient (LCC), Spearman rank-order correlation coefficient (SRCC), and Kendall Tau rank correlation (KTAU) for utterance-level and system-level data obtained from the Voice Conversion Challenge 2018 (VCC2018) dataset.

[0076] Table 1

[0077]

[0078] Table 2 shows the comparison results of the accuracy of fine-grained fractional speech quality ranking for the VCC2018 dataset.

[0079] Table 2

[0080]

[0081] Table 3 shows the comparison results of the generalization test for the VCC2016 dataset.

[0082] Table 3

[0083]

[0084] The comparison results shown in Tables 1 to 3 above demonstrate that the sorting accuracy of the schemes in the embodiments of this disclosure is higher, and they have stronger generalization performance.

[0085] It should be understood that in the embodiments of this disclosure, "first," "second," "third," etc., are only used to indicate that multiple objects may be different, but at the same time, it does not exclude that two objects are the same, and should not be interpreted as any limitation on the embodiments of this disclosure.

[0086] It should also be understood that the manner, situation, category, and division of embodiments in the present disclosure are for the convenience of description only and should not constitute a special limitation. Various manners, categories, situations, and features in the embodiments can be combined with each other where logically consistent.

[0087] It should also be understood that the foregoing is merely to help those skilled in the art better understand the embodiments of this disclosure, and is not intended to limit the scope of the embodiments of this disclosure. Those skilled in the art can make various modifications, variations, or combinations based on the foregoing. Such modifications, variations, or combinations are also within the scope of the embodiments of this disclosure.

[0088] It should also be understood that the above description focuses on highlighting the differences between the various embodiments. Similarities or commonalities can be referenced or learned from each other, and for the sake of brevity, they will not be repeated here.

[0089] Figure 7 A schematic block diagram of an example device 700 according to some embodiments of the present disclosure is shown. Device 700 can be implemented by software, hardware, or a combination of both. Figure 7 As shown, the device 700 includes a model acquisition unit 710 and an output determination unit 720.

[0090] The model acquisition unit 710 is configured to acquire a trained quality assessment model, wherein the trained quality assessment model includes at least two pre-trained sub-models, each of which is obtained through a pairwise training method. The output determination unit 720 is configured to input the input speech into the trained quality assessment model to obtain the model output.

[0091] For example, the model output represents the MOS value of the input speech. It can be understood that this MOS value is a MOS value predicted by a trained quality assessment model.

[0092] In some embodiments of this disclosure, the apparatus 700 may further include a training unit 705 configured to obtain various pre-trained sub-models through training.

[0093] For example, training unit 705 can be configured to train each pre-trained sub-model by: constructing a first training dataset, the first training dataset comprising multiple batches, each batch comprising multiple data items, each data item in the first training dataset comprising speech samples and labels; and training the pre-trained sub-model based on the first training dataset using a pairwise training method.

[0094] In some examples, training unit 705 is configured to be trained using a pairwise training method by: acquiring two data items from the same batch of a first training dataset; inputting two speech samples from the two data items into the pre-trained sub-model to be trained to obtain two outputs; constructing a loss function based on the two outputs and the two labels corresponding to the two speech samples from the two data items; and training the pre-trained sub-model based on the loss function.

[0095] In some examples, training unit 705 is configured to construct the loss function by: constructing a relative ranking loss based on the two labels corresponding to the two speech samples of the two outputs and two data items; and constructing a loss function based on the relative ranking loss and the two first-order losses corresponding to the two speech samples.

[0096] In some examples, training unit 705 is configured to construct the relative ranking loss by mapping the two outputs to probabilities using a logistic function; determining the relative value by the relative size of the two labels corresponding to the two speech samples of the two data items; and constructing the relative ranking loss based on the probabilities and the relative value.

[0097] For example, training unit 705 may also be configured to: construct a second training dataset, the second training dataset including multiple data items, each data item in the second training dataset including in-set speech samples and in-set labels; construct out-of-set data items based on two data items in the second training dataset, the out-of-set data items including out-of-set speech samples and out-of-set labels; and train a pre-trained sub-model based on the second training dataset and the out-of-set data items.

[0098] In some examples, training unit 705 is configured to construct out-of-set data items by inputting two in-set speech samples of two data items in the second training dataset into the feature extraction module of the pre-trained sub-model to be trained to obtain two intermediate outputs; determining the embedding representation of the out-of-set speech samples based on the two intermediate outputs; and determining the out-of-set labels based on the two in-set labels corresponding to the two in-set speech samples of the two data items.

[0099] Optionally, the training unit 705 can also be configured to: input the embedded representation of out-of-set speech samples into the encoder module of the pre-trained sub-model to be trained to obtain the model output corresponding to the out-of-set data items.

[0100] In some embodiments, the two data items include a first data item and a second data item, and the training unit 705 can also be configured to determine the second data item based on the sampling probability of a symmetric Gaussian kernel for the first data item.

[0101] Figure 7 The device 700 can be used to achieve the above-mentioned combination. Figures 1 to 6 For the sake of brevity, the process described will not be repeated here.

[0102] The division of modules or units in the embodiments of this disclosure is illustrative and only represents one logical functional division. In actual implementation, there may be other division methods. Furthermore, the functional units in the disclosed embodiments may be integrated into one unit, exist as separate physical entities, or two or more units may be integrated into one unit. The integrated unit described above can be implemented in hardware or as a software functional unit.

[0103] Figure 8 A block diagram of an example device 800 that can be used to implement embodiments of the present disclosure is shown. It should be understood that... Figure 8 The device 800 shown is merely exemplary and should not be construed as limiting the functionality and scope of the implementation described herein. For example, device 800 can be used to perform the functions described above. Figures 1 to 6 The process described.

[0104] like Figure 8As shown, device 800 is in the form of a general-purpose computing device. Components of computing device 800 may include, but are not limited to, one or more processors or processing units 810, memory 820, storage devices 830, one or more communication units 840, one or more input devices 850, and one or more output devices 860. Processing unit 810 may be a physical or virtual processor and is capable of performing various processes according to programs stored in memory 820. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capability of computing device 800.

[0105] Computing device 800 typically includes multiple computer storage media. Such media can be any available media accessible to computing device 800, including but not limited to volatile and non-volatile media, removable and non-removable media. Memory 820 can be volatile memory (e.g., registers, cache, random access memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof). Storage device 830 can be removable or non-removable media and may include machine-readable media, such as flash drives, disks, or any other media capable of storing information and / or data (e.g., training data for training) and accessible within computing device 800.

[0106] The computing device 800 may further include additional removable / non-removable, volatile / non-volatile storage media. Although not explicitly stated... Figure 8 As shown, disk drives for reading from or writing to removable, non-volatile disks (e.g., "floppy disks") and optical disk drives for reading from or writing to removable, non-volatile optical disks can be provided. In these cases, each drive can be connected to a bus (not shown) via one or more data media interfaces. Memory 820 may include computer program product 825 having one or more program modules configured to perform various methods or actions of various implementations of this disclosure.

[0107] The communication unit 840 enables communication with other computing devices via a communication medium. Additionally, the components of the computing device 800 can function as a single computing cluster or multiple computing machines capable of communicating via communication connections. Therefore, the computing device 800 can operate in a networked environment using logical connections to one or more other servers, network personal computers (PCs), or another network node.

[0108] Input device 850 can be one or more input devices, such as a mouse, keyboard, trackball, etc. Output device 860 can be one or more output devices, such as a monitor, speaker, printer, etc. Computing device 800 can also communicate with one or more external devices (not shown) via communication unit 840 as needed. These external devices include storage devices, display devices, etc., and can communicate with one or more devices that enable user interaction with computing device 800, or with any device that enables computing device 800 to communicate with one or more other computing devices (e.g., network card, modem, etc.). Such communication can be performed via an input / output (I / O) interface (not shown).

[0109] According to an exemplary implementation of this disclosure, a computer-readable storage medium is provided that stores computer-executable instructions thereon, wherein the computer-executable instructions are executed by a processor to implement the methods described above. According to an exemplary implementation of this disclosure, a computer program product is also provided, which is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions, which are executed by a processor to implement the methods described above. According to an exemplary implementation of this disclosure, a computer program product is provided that stores a computer program thereon, which, when executed by a processor, implements the methods described above.

[0110] Various aspects of this disclosure are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatuses, devices, and computer program products implemented according to this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer-readable program instructions.

[0111] These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing apparatus to produce a machine such that, when executed by the processing unit of the computer or other programmable data processing apparatus, they create means for implementing the functions / actions specified in one or more blocks of the flowchart and / or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium that causes a computer, programmable data processing apparatus, and / or other device to operate in a particular manner. Thus, the computer-readable medium storing the instructions comprises an article of manufacture that includes instructions for implementing aspects of the functions / actions specified in one or more blocks of the flowchart and / or block diagram.

[0112] Computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions that execute on the computer, other programmable data processing apparatus, or other device to perform the functions / actions specified in one or more boxes of a flowchart and / or block diagram.

[0113] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction, which contains one or more executable instructions for implementing the specified logical function. In some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0114] Various implementations of this disclosure have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed implementations. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described implementations. The terminology used herein is chosen to best explain the principles, practical applications, or improvements to technology in the market, or to enable others skilled in the art to understand the various implementations disclosed herein.

Claims

1. A quality assessment method, comprising: Obtain a trained quality assessment model, wherein the trained quality assessment model includes at least two pre-trained sub-models, each of the two pre-trained sub-models being obtained through a pairwise training method, wherein the pairwise training method includes: during training, using two data items from the same batch as a pair of inputs, inputting the pair of inputs into the pre-trained sub-models respectively, and constructing a loss function based on the corresponding pair of outputs and the labels corresponding to the pair of inputs; and The input speech is fed into the trained quality assessment model to obtain the model output.

2. The method of claim 1, further comprising: Each of the pre-trained sub-models is trained in the following manner: Construct a first training dataset, which includes multiple batches, each batch including multiple data items, and each data item in the first training dataset including speech samples and labels; as well as Based on the first training dataset, the pre-trained sub-model is trained using a pairwise training method.

3. The method according to claim 2, wherein training via a pairwise training method comprises: Obtain two data items from the same batch of the first training dataset; The two speech samples of the two data items are respectively input into the pre-trained sub-model to be trained to obtain two outputs; Based on the two outputs and the two labels corresponding to the two speech samples of the two data items, a loss function is constructed; as well as The pre-trained sub-model is trained based on the loss function.

4. The method of claim 3, wherein constructing the loss function comprises: Based on the two outputs and the two labels corresponding to the two speech samples of the two data items, a relative ranking loss is constructed; as well as The loss function is constructed based on the relative ranking loss and the two first-order losses corresponding to the two speech samples.

5. The method of claim 4, wherein constructing the relative ranking loss comprises: The two outputs are mapped to probabilities using a logical function; The relative value is determined by the relative size of the two labels corresponding to the two speech samples of the two data items; as well as The relative ranking loss is constructed based on the probability and the relative value.

6. The method according to claim 2, further comprising: Construct a second training dataset, which includes multiple data items, each of which includes an in-set speech sample and an in-set label; Based on two data items in the second training dataset, an out-of-set data item is constructed, which includes out-of-set speech samples and out-of-set labels. as well as The pre-trained sub-model is trained based on the second training dataset and the out-of-set data items.

7. The method of claim 6, wherein constructing out-of-set data items comprises: Two in-set speech samples from the two data items in the second training dataset are respectively input into the feature extraction module of the pre-trained sub-model to be trained, so as to obtain two intermediate outputs. Based on the two intermediate outputs, the embedding representation of the out-of-set speech sample is determined; as well as Based on the two in-set labels corresponding to the two in-set speech samples of the two data items, the out-of-set label is determined.

8. The method according to claim 7, further comprising: By inputting the embedded representation of the out-of-set speech samples into the encoder module of the pre-trained sub-model to be trained, a model output corresponding to the out-of-set data item is obtained.

9. The method of claim 6, wherein the two data items include a first data item and a second data item, and the method further includes: For the first data item, the second data item is determined based on the sampling probability of the symmetric Gaussian kernel.

10. The method of claim 1, wherein the model output represents the mean opinion score (MOS) value of the input speech.

11. An electronic device, comprising: At least one processing unit; At least one memory coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the electronic device to perform an action when executed by the at least one processing unit, the action including: Obtain a trained quality assessment model, wherein the trained quality assessment model includes at least two pre-trained sub-models, each of the two pre-trained sub-models being obtained through a pairwise training method, wherein the pairwise training method includes: during training, using two data items from the same batch as a pair of inputs, inputting the pair of inputs into the pre-trained sub-models respectively, and constructing a loss function based on the corresponding pair of outputs and the labels corresponding to the pair of inputs; and The input speech is fed into the trained quality assessment model to obtain the model output.

12. A quality assessment device, comprising: A model acquisition unit is configured to acquire a trained quality assessment model, wherein the trained quality assessment model includes at least two pre-trained sub-models, each of the two pre-trained sub-models being obtained through a pairwise training method, wherein the pairwise training method includes: during training, using two data items from the same batch as a pair of inputs, inputting the pair of inputs into the pre-trained sub-models respectively, and constructing a loss function based on the corresponding pair of outputs and the labels corresponding to the pair of inputs; and The output determination unit is configured to input the input speech into the trained quality assessment model to obtain the model output.

13. A computer-readable storage medium having a computer program stored thereon, the program, when executed by a processor, implementing the method according to any one of claims 1 to 10.