Method, device and system for speech quality evaluation of inventory speech files

CN115512718BActive Publication Date: 2026-06-30ZHONGKE YOUSHENG (SUZHOU) TECH CO LTD

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ZHONGKE YOUSHENG (SUZHOU) TECH CO LTD
Filing Date: 2022-09-14
Publication Date: 2026-06-30

AI Technical Summary

Technical Problem

In the existing technology, there is a lack of convenient, accurate and real-time measurement methods for evaluating the recording quality of existing voice files. The evaluation results of professionals are not accurate and cannot achieve a unified measurement standard. Ordinary users cannot carry out tests, and existing equipment is expensive and time-consuming.

Method used

By receiving the target speech signal, the system extracts features from the target speech signal using a pre-trained speech quality model, calculates the speech quality evaluation results, and displays them in real time. The system uses octave band filtering and a neural network model to evaluate speech quality.

Benefits of technology

It enables convenient, accurate, and real-time evaluation of the voice quality of existing voice files, can display the intelligibility of the language in real time, and supports the optimization and adjustment of the voice file recording environment and parameters.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115512718B_ABST

Patent Text Reader

Abstract

This application discloses a method, apparatus, and system for evaluating the voice quality of existing voice files. The evaluation method includes receiving a target voice signal, which includes a target existing voice file played by a target sound-producing device; calculating the corresponding voice quality evaluation result in real time based on the target voice signal; and displaying the voice quality evaluation result at the front end. The voice quality evaluation method for existing voice files of this application can evaluate the voice quality of any existing voice file to test and obtain the real-time display of the language intelligibility when the existing voice file is played. Furthermore, by evaluating the recording quality of existing voice files in real time, the recording environment and parameters of the voice files can be further optimized and adjusted.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of acoustic measurement technology, and in particular to methods, apparatus and systems for evaluating the speech quality of existing speech files. Background Technology

[0002] The recording quality of existing audio files is usually evaluated by the intelligibility of the language during playback. When the intelligibility of the language during playback is low, it indicates that the current audio recording quality is low and may even require re-recording.

[0003] In existing technologies, the intelligibility of existing audio files is typically evaluated manually by professionals. However, the accuracy of these evaluations is low, and there is no unified measurement standard, resulting in inconsistent quality among existing audio files. Alternatively, some methods require professionals to use specialized playback equipment, such as an artificial mouth or talkbox, to play specific standard modulation test signals. Other methods involve using specialized equipment to collect room impulse responses for complex calculations. These methods are not only time-consuming and expensive, but also inaccessible to ordinary users, making it impossible to obtain test results in real time, thus presenting significant limitations.

[0004] Therefore, we need to find a measurement method that can accurately, conveniently, and in real time determine the speech intelligibility of existing speech files. Summary of the Invention

[0005] The purpose of this application is to provide a method, apparatus and system for evaluating the voice quality of existing voice files, which can conveniently and accurately evaluate and display the voice quality of existing voice files in real time.

[0006] To achieve the above-mentioned objectives, this application proposes the following technical solution:

[0007] Firstly, a method for evaluating speech quality is provided, the evaluation method comprising:

[0008] Receive a target voice signal, wherein the target voice signal includes a target existing voice file played by a target sound-producing device;

[0009] The corresponding speech quality evaluation result is obtained in real time based on the target speech signal;

[0010] The voice quality evaluation results are displayed on the front end.

[0011] In a preferred embodiment, the step of calculating the corresponding speech quality evaluation result in real time based on the target speech signal includes:

[0012] At least one set of target speech signals is obtained by performing feature extraction on the target speech signal;

[0013] Using the at least one set of target speech signals as input, the corresponding speech quality evaluation results are obtained through a pre-trained speech quality model.

[0014] In a preferred embodiment, the speech quality evaluation result includes, but is not limited to, one of the speech transmission index or the average opinion score.

[0015] In a preferred embodiment, when the speech quality evaluation result includes a speech transmission index, the step of calculating the corresponding speech quality evaluation result in real time based on the target speech signal includes:

[0016] The target speech signal is subjected to feature extraction according to p different octave band filtered signal bands to obtain p sets of target features, where p ≥ 2;

[0017] Based on the p groups of target features, the speech quality evaluation result corresponding to the target speech signal at the first target location is obtained.

[0018] In a preferred embodiment, the step of extracting features from the target speech signal according to p different octave band filtered signal bands to obtain corresponding p sets of target features includes:

[0019] The target speech signal is filtered to obtain p different octave band filtered signal bands, and each octave band filtered signal band includes n modulation frequencies f. m n≥1, m≥1;

[0020] The envelopes of the p groups of different octave band filtered signal bands are extracted to obtain the envelope features of the p groups of sub-bands.

[0021] Taking the octave-band filtered signal band corresponding to any one of the sub-band envelope features as input, the reverberation time T of p groups is obtained by pre-training the speech quality model corresponding to the corresponding octave-band filtered signal band;

[0022] Based on the reverberation time T of each of the p groups, the corresponding target features of the p groups are obtained, and the target features are modulation transfer function values.

[0023] In a preferred embodiment, the step of extracting the envelope features of the p groups of different octave band filtered signal bands to obtain the envelope features of the p groups of sub-bands includes:

[0024] The envelope characteristics of the p groups of sub-bands are obtained by performing half-wave envelope detection on the different octave bands of the p groups of filtered signals.

[0025] In a preferred embodiment, the step of taking any set of octave-band filtered signal bands corresponding to the envelope features of the sub-bands as input, and obtaining p sets of reverberation times T through a pre-trained speech quality model corresponding to the corresponding octave-band filtered signal bands, includes:

[0026] Divide any of the octave band filtered signals into N consecutive speech segments of equal duration, where N≥2;

[0027] For any of the N speech segments included in any octave band filtered signal band, feature extraction is performed on any speech segment using a combination structure of one or more of the following: convolutional neural network, linear connection layer, activation layer, and normalization layer, to obtain a matrix of shape [P,Q], thereby obtaining the corresponding N speech segment features;

[0028] Any speech segment feature among the N obtained speech segment features is interacted through a combination of one or more of the following structures: Long Short-Term Memory module, multi-head / single-head attention module, linear connection layer, activation layer, and normalization layer, to obtain the corresponding speech segment interaction feature;

[0029] Based on the obtained N speech segment interaction features, N reverberation times T corresponding to the N speech segment interaction features are predicted by a linear regression layer or a classification layer, respectively. N ;

[0030] For N reverberation times T corresponding to any octave band filtered signal band N The average values are taken separately to obtain the p-group reverberation time T corresponding to the respective octave band filtered signal bands.

[0031] In a preferred embodiment, obtaining the corresponding p groups of target features based on the reverberation times T of the p groups includes:

[0032] Based on any modulation frequency f m The value and corresponding reverberation time T are used to obtain any modulation frequency f of any octave band filtered signal. m The modulation transfer function value m k,fm .

[0033] In a preferred embodiment, obtaining the speech quality evaluation result of the first target location corresponding to the target speech signal based on the p sets of target features includes:

[0034] Based on any of the modulation transfer function values m k,fm Obtain any modulation frequency f of the corresponding octave-band filtered signal band k. m Effective signal-to-noise ratio (SNR) eff k,fm ;

[0035] Based on any of the aforementioned effective signal-to-noise ratios (SNR) eff k,fm Obtain any modulation frequency f of the corresponding octave-band filtered signal band k. m Transmission index TI at the location k,fm ;

[0036] Calculate the n transmission indices TI for any octave-band filtered signal band k. k,fm The mean value is used to obtain the modulation transfer index M of the corresponding octave band filtered signal band k. k ;

[0037] Based on the modulation transfer index M of p octave band filtered signal bands k The speech quality evaluation result corresponding to the target speech signal at the first target location is calculated.

[0038] In a preferred embodiment, the evaluation method further includes pre-training p speech quality models corresponding to the p different octave band filtered signal bands, including:

[0039] Based on any existing speech file in the existing speech file sample set, obtain p different octave band filtered signal band sample sets. Each octave band filtered signal band sample set includes q modulation frequency samples and corresponding q impulse response samples. Each impulse response sample includes a reverberation time sample T0, where q ≥ 2.

[0040] Using the q modulation frequency samples as input and the corresponding q reverberation time samples T0 as output, p speech quality models corresponding to p octave band filtered signal bands are obtained by training based on a neural network.

[0041] In a preferred embodiment, the voice quality evaluation results are displayed on the front end, and the display methods include, but are not limited to:

[0042] The voice quality evaluation results are displayed on the interface in the form of numerical values and dynamic movement signals; or...

[0043] The voice quality evaluation results are displayed on the interface as numerical values and dynamic Wi-Fi signals; or...

[0044] The voice quality evaluation results are displayed on the interface in the form of numerical values and a dynamic dashboard; or...

[0045] The voice quality evaluation results are displayed on the interface in the form of numerical values and dynamic range bars.

[0046] Secondly, a device for evaluating the voice quality of existing voice files is provided, characterized in that the device comprises:

[0047] A receiving module is configured to receive a target voice signal, wherein the target voice signal includes a target existing voice file played by a target sound-producing device.

[0048] The processing module is used to calculate and obtain the corresponding speech quality evaluation result in real time based on the target speech signal;

[0049] The display module is used to display the voice quality evaluation results on the front end.

[0050] Thirdly, a voice quality evaluation system for existing voice files is provided, the evaluation system comprising:

[0051] At least one voice receiving device, the voice receiving device being used to receive a target voice signal, the target voice signal including a target existing voice file played by a target sound-producing device;

[0052] At least one display device, the at least one display device being used to display the speech quality evaluation results on a front end;

[0053] A smart device, wherein the smart device is configured to receive a target voice signal sent by the at least one voice receiving device, perform an operation as described in any one of the first aspects based on the target voice signal to calculate and obtain a corresponding voice quality evaluation result in real time, and send the voice quality evaluation result to the at least one display device for front-end display.

[0054] Fourthly, an electronic device is provided, comprising:

[0055] One or more processors; and

[0056] A memory associated with the one or more processors, the memory being used to store program instructions that, when read and executed by the one or more processors, perform the operations described in any one of the first aspects.

[0057] Fifthly, a computer-readable storage medium is provided having a computer program stored thereon, wherein the program, when executed by a processor, implements the method as described in any one of the first aspects.

[0058] Compared with the prior art, this application has the following beneficial effects:

[0059] This application provides a method, apparatus, and system for evaluating the voice quality of existing voice files. The evaluation method includes receiving a target voice signal, which includes a target existing voice file played by a target sound-producing device; calculating the corresponding voice quality evaluation result in real time based on the target voice signal; and displaying the voice quality evaluation result at the front end. The voice quality evaluation method for existing voice files in this application can evaluate the voice quality of any existing voice file to test and obtain the real-time display of the language intelligibility when the existing voice file is played. Furthermore, by evaluating the recording quality of existing voice files in real time, the recording environment and parameters of the voice files can be further optimized and adjusted. Attached Figure Description

[0060] Figure 1 This explains the meaning of STI scores;

[0061] Figure 2 This is a flowchart of the speech quality evaluation method for existing speech files in this embodiment;

[0062] Figure 3 This is a schematic diagram of envelope extraction to obtain the envelope boundary in this embodiment;

[0063] Figure 4 This is a diagram of the half-wave envelope detector circuit in this embodiment;

[0064] Figure 5 This is a schematic diagram of a neural network structure;

[0065] Figures 6a-6d This is an example of the display content when showing the voice quality evaluation results on the interface;

[0066] Figure 7 This is a system architecture diagram for a voice quality evaluation system for existing voice files. Detailed Implementation

[0067] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0068] In the description of this application, it should be understood that the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this application, unless otherwise stated, "multiple" means two or more.

[0069] To address the current limitations of real-time measurement and feedback in assessing the intelligibility of existing speech files, this embodiment provides a speech quality evaluation method with real-time feedback and accurate measurement results. The following detailed description, in conjunction with specific embodiments, provides a method, apparatus, and system for evaluating the speech quality of existing speech files.

[0070] Example

[0071] like Figure 2 As shown, this embodiment provides a method for evaluating the speech quality of existing speech files, which is applicable to evaluating the intelligibility of the language in existing speech files.

[0072] Specifically, the speech quality evaluation method for existing speech files in this embodiment includes the following steps:

[0073] S1. Receive the target audio signal. The target audio signal includes the target existing audio file played by the target audio-producing device. This embodiment does not limit the target audio-producing device.

[0074] S2. Calculate the corresponding speech quality evaluation results in real time based on the target speech signal.

[0075] Typically, step S2 above includes the following steps:

[0076] S21. Perform feature extraction on the target speech signal to obtain at least one set of target speech signals;

[0077] S22. Using at least one set of target speech signals as input, obtain the corresponding speech quality evaluation results through a pre-trained speech quality model. It should be noted that the speech quality evaluation results include, but are not limited to, one of the following: Speech Transmission Index (STI) or Mean Opinion Score (MOS).

[0078] For ease of description, the speech quality evaluation result in this embodiment uses the Speech Transmission Index (STI) as an example, but it is not limited to this. Typically, the STI derives the speech transmission quality of the transmission path by sending a specific test signal to the transmission path, analyzing the received signal, and expressing it using a score between 0 and 1 (e.g., ...). Figure 1).

[0079] Step S21 specifically involves extracting features from the target speech signal according to p different octave band filtered signal bands k to obtain corresponding p groups of target features, where p ≥ 2 and 1 ≤ k ≤ p.

[0080] Furthermore, step S21 includes:

[0081] S21a. Filter the target speech signal to obtain p different octave band filtered signal bands k, where each octave band filtered signal band k includes n modulation frequencies f. m , n≥1, m≥1.

[0082] It should be noted that human speech is typically divided into seven frequency bands; therefore, in this embodiment, p = 7 is preferred, i.e., 1 ≤ k ≤ 7. Thus, the center frequency f is obtained by filtering the target speech separately. c The octave-band filtered signal bands k are 125Hz, 250Hz, 500Hz, 1kHz, 2kHz, 4kHz, and 8kHz respectively, and the upper frequency f in each octave-band filtered signal band k is... u and lower limit frequency f l As shown in formulas (1) and (2) below respectively:

[0083]

[0084]

[0085] S21b. Envelope extraction is performed on p groups of different octave band filtered signal bands k to obtain the envelope features of p groups of sub-bands. The envelope extraction results are as follows: Figure 3 The envelope boundary is shown.

[0086] This embodiment does not limit the envelope extraction algorithm, but preferably obtains the envelope features of the p groups of sub-bands by performing half-wave envelope detection on p different octave band filtered signal bands k (e.g., ...). Figure 4 As shown), the expression is presented in the form of difference equations as shown in equations (3) and (4):

[0087]

[0088]

[0089] S21c. Taking the octave-band filtered signal band k corresponding to any group of sub-band envelope features as input, the reverberation time T of p groups is obtained by using the pre-trained speech quality model corresponding to the corresponding octave-band filtered signal band k.

[0090] Any speech quality model sequentially includes a data preprocessing module, a feature extraction module, a temporal interaction module, and a prediction module. Specifically:

[0091] Data preprocessing module: Divides any octave-band filtered signal band k into N consecutive speech segments (e.g., ..., N segments) with a preset duration (e.g., x seconds). Figure 5 (Chinese speech slice), N≥2.

[0092] Feature extraction module: For any octave-band filtered signal band k, any one of the N speech segments is processed by a combination of one or more of the following structures: convolutional neural network, linear connection layer, activation layer, and normalization layer, to extract features and obtain a matrix of shape [P,Q], thus obtaining the corresponding N speech segment features, such as... Figure 5 Chinese speech slice features.

[0093] The temporal interaction module: It takes any one of the N obtained speech segment features and interacts with it through one or more combinations of Long Short-Term Memory (LSTM) modules, multi-head / single-head attention modules, linear connectivity layers, activation layers, and normalization layers to obtain the corresponding speech segment interaction features, such as... Figure 5 Slice interaction features.

[0094] Prediction module: Based on the obtained N speech segment interaction features, it predicts N reverberation times T corresponding to the N speech segment interaction features respectively through a linear regression layer or a classification layer. N ,like Figure 5 medium slice reverberation time T N For N T N The average value is used to obtain the reverberation time T corresponding to the corresponding octave band filtered signal band k. The methods for calculating the average value include, but are not limited to, any one of the following: simple averaging, weighted averaging, or harmonic averaging.

[0095] Therefore, prior to step S21c, the evaluation method further includes: Sa, pre-training p speech quality models corresponding to p different octave-band filtered signal bands k, including:

[0096] Sa1. Based on any existing speech file in the existing speech file sample set, obtain p different octave band filtered signal band sample sets. Each octave band filtered signal band sample set includes q modulation frequency samples and corresponding q impulse response samples. Each impulse response sample includes a reverberation time sample T0, where q ≥ 2.

[0097] Sa2. Taking q modulation frequency samples as input and q corresponding reverberation time samples T0 as output, p speech quality models corresponding to p octave band filtered signal bands k are obtained by training based on neural networks.

[0098] S21d, based on the reverberation time T of the p groups, respectively obtain the corresponding p groups of target features, and the target features are the modulation transfer function values.

[0099] The modulation transfer function (MTF) describes the degree to which modulation m is transmitted from the target object (sound source) to the receiving sensor, and is the modulation frequency f. m function m k,fm MTF determines the degree of modulation reduction in the target speech signal. Specifically, the modulation frequency f m The range is from 0.63Hz to 12.5Hz. Therefore, the MTF function value depends on the system environment characteristics and background noise. The MTF calculation process is shown in equation (5) below:

[0100]

[0101] Step S22 specifically involves obtaining the speech quality evaluation result of the target speech signal corresponding to the first target position based on p groups of target features.

[0102] Specifically, step S22 includes:

[0103] S221, Based on any modulation transfer function value m k,fm Obtain any modulation frequency f of the corresponding octave-band filtered signal band k. m Effective signal-to-noise ratio (SNR) eff k,fm Specifically, the effective signal-to-noise ratio (SNR) eff k,fm It is obtained by calculation using the following formula (6):

[0104]

[0105] S222, Based on any effective signal-to-noise ratio (SNR) eff k,fm Obtain any modulation frequency f of the corresponding octave-band filtered signal band k. m Transmission index TI at the location k,fm Specifically, the transmission index TI k,fm It is obtained by calculation using the following formula (7):

[0106]

[0107] S223. Calculate the n transmission indices TI for any octave-band filtered signal band k. k，fm The mean value is used to obtain the modulation transfer index M of the corresponding octave band filtered signal band k. k Modulation transfer index M k The value range is -15dB to +15dB. Specifically, the modulation transfer index M... k It is obtained by calculation using the following formula (8):

[0108]

[0109] S224, Modulation transfer index M based on p octave band filtered signal bands k The speech quality assessment (STI) of the target speech signal corresponding to the first target position is calculated; specifically, the STI is calculated by the following formula (9):

[0110]

[0111] Where, α k This represents the gender weighting factor for band k of the octave-band filtered signal;

[0112] β k This represents the different gender redundancy factors between octave-band filtered signal band k and octave-band filtered signal band k+1;

[0113] M k The modulation transfer index refers to the octave band filtered signal band.

[0114] It should be noted that while the STI method can distinguish between male and female speech signals, in practice, to simplify the measurement process, only male speech is used to evaluate the speech transmission path. Table 1 shows the male STI weighting factor α and redundancy factor β as octave band functions.

[0115] Table 1

[0116]

[0117] S3. Display the voice quality evaluation results obtained in step S2 on the front end.

[0118] like Figures 6a-6d As shown, the interface display methods used to present the voice quality assessment results on the front end include, but are not limited to:

[0119] The voice quality evaluation results are displayed on the interface in the form of numerical values and dynamic movement signals; or...

[0120] The voice quality evaluation results are displayed on the interface as numerical values and dynamic Wi-Fi signal data; or...

[0121] The voice quality evaluation results are displayed on the interface in the form of numerical values and dynamic dashboards; or...

[0122] The voice quality evaluation results are displayed on the interface in the form of numerical values and dynamic range bars.

[0123] In summary, the speech quality evaluation method for existing speech files provided in this embodiment can evaluate the speech quality of any existing speech file. Compared with the existing methods that require professional equipment and standard methods to complete the evaluation, this application has the advantages of being more universal, convenient, and providing real-time feedback.

[0124] Furthermore, the speech quality evaluation method for existing speech files provided in this embodiment adopts a method of constructing speech quality models for different octave bands to obtain reverberation time when calculating the speech quality evaluation results, which has strong robustness and reproducibility.

[0125] Corresponding to the above-described speech quality evaluation method, this embodiment further provides a speech quality evaluation device corresponding to the evaluation method, which implements the method through various functional modules. The speech quality evaluation device includes:

[0126] The receiving module is used to receive the target voice signal, which includes the target existing voice file played by the target sound-producing device;

[0127] The processing module is used to calculate and obtain the corresponding speech quality evaluation result in real time based on the target speech signal;

[0128] The display module is used to display the voice quality evaluation results on the front end.

[0129] The model training module is used to pre-train p speech quality models corresponding to the p different octave band filtered signal bands.

[0130] The processing module includes:

[0131] The feature extraction unit is used to extract features from the target speech signal according to p different octave band filtered signal bands to obtain corresponding p sets of target features, where p≥2;

[0132] An evaluation unit is used to obtain a speech quality evaluation result of the first target position corresponding to the target speech signal based on the p groups of target features.

[0133] Furthermore, the feature extraction unit specifically includes:

[0134] The first processing subunit is configured to filter the target speech signal to obtain p different octave band filtered signal bands, each of which includes n modulation frequencies f. m n≥1, m≥1;

[0135] The second processing subunit is used to extract the envelope features of the p groups of different octave band filtered signal bands respectively.

[0136] The third processing subunit is used to take the octave band filtered signal band corresponding to any one of the sub-band envelope features as input, and obtain the reverberation time T of p groups respectively through the pre-trained speech quality model corresponding to the corresponding octave band filtered signal band;

[0137] The fourth processing subunit is used to obtain the corresponding p groups of target features based on the reverberation time T of the p groups, wherein the target features are modulation transfer function values.

[0138] Specifically, the second processing subunit is used to perform half-wave envelope detection on the p groups of different octave band filtered signal bands to obtain the envelope characteristics of the p groups of sub-bands.

[0139] The third processing subunit is specifically used for:

[0140] Divide any of the octave band filtered signals into N consecutive speech segments of equal duration, where N≥2;

[0141] For any of the N speech segments included in any octave band filtered signal band, feature extraction is performed on any speech segment using a combination structure of one or more of the following: convolutional neural network, linear connection layer, activation layer, and normalization layer, to obtain a matrix of shape [P,Q], thereby obtaining the corresponding N speech segment features;

[0142] Any speech segment feature among the N obtained speech segment features is interacted through a combination of one or more of the following structures: Long Short-Term Memory module, multi-head / single-head attention module, linear connection layer, activation layer, and normalization layer, to obtain the corresponding speech segment interaction feature;

[0143] Based on the obtained N speech segment interaction features, N reverberation times T corresponding to the N speech segment interaction features are predicted by a linear regression layer or a classification layer, respectively. N ;

[0144] For N reverberation times T corresponding to any octave band filtered signal band N The average values are taken separately to obtain the p-group reverberation time T corresponding to the respective octave band filtered signal bands.

[0145] The fourth processing subunit is specifically used for processing based on any modulation frequency f m The value and corresponding reverberation time T are used to obtain any modulation frequency f of any octave band filtered signal. m The modulation transfer function value m k,fm .

[0146] Furthermore, the evaluation units specifically include:

[0147] The fifth processing subunit is used to process any of the modulation transfer function values m. k,fmObtain any modulation frequency f of the corresponding octave-band filtered signal band k. m Effective signal-to-noise ratio (SNR) eff k,fm ;

[0148] The sixth processing subunit is used to process data based on any of the aforementioned signal-to-noise ratios (SNR). eff k,fm Obtain any modulation frequency f of the corresponding octave-band filtered signal band k. m Transmission index TI at the location k,fm ;

[0149] The seventh processing subunit is used to calculate the n transmission indices TI for any octave-band filtered signal band k. k,fm The mean value is used to obtain the modulation transfer index M of the corresponding octave band filtered signal band k. k ;

[0150] The eighth processing subunit is used to process the modulation transfer index M based on p octave band filtered signal bands. k The speech quality evaluation result corresponding to the target speech signal at the first target location is calculated.

[0151] Specifically, the eighth processing subunit performs the following calculation (9) to obtain the speech quality evaluation result corresponding to the target speech signal at the first target position:

[0152]

[0153] Where, α k This represents the gender weighting factor for band k of the octave-band filtered signal;

[0154] β k This represents the different gender redundancy factors between octave-band filtered signal band k and octave-band filtered signal band k+1;

[0155] M k The modulation transfer index refers to the octave band filtered signal band.

[0156] The model training module is specifically used for:

[0157] Based on any existing speech file in the existing speech file sample set, obtain p different octave band filtered signal band sample sets. Each octave band filtered signal band sample set includes q modulation frequency samples and corresponding q impulse response samples. Each impulse response sample includes a reverberation time sample T0, where q ≥ 2.

[0158] Using the q modulation frequency samples as input and the corresponding q reverberation time samples T0 as output, p speech quality models corresponding to p octave band filtered signal bands are obtained by training based on a neural network.

[0159] When displaying the voice quality evaluation results on the front end, the display module may use the following display methods, including but not limited to displaying the voice quality evaluation results as numerical values and dynamic motion signals on the interface; or displaying the voice quality evaluation results as numerical values and dynamic Wi-Fi signals on the interface; or displaying the voice quality evaluation results as numerical values and dynamic dashboards on the interface; or displaying the voice quality evaluation results as numerical values and dynamic progress bars on the interface.

[0160] It should be noted that the voice quality evaluation device for existing voice files provided in the above embodiments is only illustrated by the division of the above functional modules when performing voice quality evaluation services. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the system can be divided into different functional modules to complete all or part of the functions described above. In addition, the voice quality evaluation device and the voice quality evaluation method embodiments provided in the above embodiments belong to the same concept, that is, the device is based on the method, and its specific implementation process is detailed in the method embodiments, which will not be repeated here.

[0161] And, such as Figure 7 As shown, this embodiment also provides a voice quality evaluation system for existing voice files, the evaluation system comprising:

[0162] At least one voice receiving device is provided for receiving a target voice signal, the target voice signal including a target existing voice file played by a target sound-producing device; preferably, the voice receiving device is a voice sensor.

[0163] At least one display device, the display device being used to display the speech quality evaluation results at the front end;

[0164] A smart device is configured to receive a target voice signal sent by at least one voice receiving device, perform real-time calculation based on the target voice signal using a voice quality evaluation method for local use to obtain a corresponding voice quality evaluation result, and send the voice quality evaluation result to at least one display device for front-end display.

[0165] Furthermore, this embodiment also provides an electronic device, including:

[0166] One or more processors; and

[0167] A memory associated with the one or more processors, the memory being used to store program instructions that, when read and executed by the one or more processors, perform operations as described in any one of the methods for evaluating the speech quality of existing speech files.

[0168] The specific execution details and corresponding beneficial effects of the speech quality evaluation method executed by the program instructions are consistent with the description of the aforementioned method, and will not be repeated here.

[0169] Furthermore, this embodiment also provides a computer-readable storage medium having a computer program stored thereon, wherein the program, when executed by a processor, implements the method as described in any one of the methods for evaluating the voice quality of existing voice files.

[0170] All the above-mentioned optional technical solutions can be combined in any way to form the optional embodiments of this application. That is, any number of embodiments can be combined to meet the needs of different application scenarios. All of them are within the protection scope of this application and will not be described in detail here.

[0171] It should be noted that the above description is only a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.

Claims

1. A method for evaluating the speech quality of existing speech files, characterized in that, The evaluation methods include: Receive a target voice signal, wherein the target voice signal includes a target existing voice file played by a target sound-producing device; The process involves calculating a corresponding speech transmission index in real time based on the target speech signal, including: extracting features from the target speech signal to obtain at least one set of target speech signals; filtering each of the at least one set of target speech signals to obtain p different octave band filtered signal bands, each set of octave band filtered signal bands including n modulation frequencies f. m n≥1, m≥1; envelope extraction is performed on the p groups of different octave band filtered signal bands to obtain p groups of sub-band envelope features; taking the octave band filtered signal band corresponding to any group of sub-band envelope features as input, the reverberation time T of the p groups is obtained by pre-training the speech quality model corresponding to the corresponding octave band filtered signal band; based on the reverberation time T of the p groups, the corresponding p groups of target features are obtained, and the target features are modulation transfer function values; based on the p groups of target features, the speech transmission index corresponding to the target speech signal at the first target position is obtained; The voice transmission index is displayed on the front end.

2. The evaluation method as described in claim 1, characterized in that, The step of extracting the envelope features of the p groups of different octave band filtered signal bands to obtain the envelope features of the p groups of sub-bands includes: The envelope characteristics of the p groups of sub-bands are obtained by performing half-wave envelope detection on the different octave bands of the p groups of filtered signals.

3. The evaluation method as described in claim 1, characterized in that, The process involves taking the octave-band filtered signal band corresponding to any one of the sub-band envelope features as input, and obtaining p groups of reverberation times T through a pre-trained speech quality model corresponding to the corresponding octave-band filtered signal band, including: Divide any of the octave band filtered signals into N consecutive speech segments of equal duration, where N≥2; For any of the N speech segments included in any octave-band filtered signal band, feature extraction is performed on any speech segment using a combination structure of one or more of the following: convolutional neural network, linear connection layer, activation layer, and normalization layer, to obtain a matrix of shape [P,Q], thereby obtaining the corresponding N speech segment features, where P corresponds to the number p of the octave-band filtered signal bands k, and Q corresponds to the number q of the modulation frequency samples; Any speech segment feature among the N obtained speech segment features is interacted through a combination of one or more of the following structures: Long Short-Term Memory module, multi-head / single-head attention module, linear connection layer, activation layer, and normalization layer, to obtain the corresponding speech segment interaction feature; Based on the obtained N speech segment interaction features, N reverberation times T corresponding to the N speech segment interaction features are predicted by a linear regression layer or a classification layer, respectively. N ; For N reverberation times T corresponding to any octave band filtered signal band N The average values are taken separately to obtain the p-group reverberation time T corresponding to the respective octave band filtered signal bands.

4. The evaluation method as described in claim 1, characterized in that, The process of obtaining the corresponding p groups of target features based on the reverberation time T of each of the p groups includes: Based on any modulation frequency f m The value and corresponding reverberation time T are used to obtain any modulation frequency f of any octave band filtered signal. m The modulation transfer function value m k,fm .

5. The evaluation method as described in claim 4, characterized in that, The step of obtaining the speech transmission index of the first target position corresponding to the target speech signal based on the p sets of target features includes: Based on any of the modulation transfer function values m k,fm Obtain any modulation frequency f of the corresponding octave-band filtered signal band k. m Effective signal-to-noise ratio (SNR) eff k,fm ; Based on any of the aforementioned effective signal-to-noise ratios (SNR) eff k,fm Obtain any modulation frequency f of the corresponding octave-band filtered signal band k. m Transmission index TI at the location k,fm ; Calculate the n transmission indices TI for any octave-band filtered signal band k. k,fm The mean value is used to obtain the modulation transfer index M of the corresponding octave band filtered signal band k. k ; Based on the modulation transfer index M of p octave band filtered signal bands k The speech transmission index corresponding to the target speech signal at the first target location is calculated.

6. The evaluation method according to any one of claims 1 to 5, characterized in that, The evaluation method further includes pre-training p speech quality models corresponding to the p groups of different octave band filtered signal bands, including: Based on any existing speech file in the existing speech file sample set, obtain p different octave band filtered signal band sample sets. Each octave band filtered signal band sample set includes q modulation frequency samples and corresponding q impulse response samples. Each impulse response sample includes a reverberation time sample T0, where q ≥ 2. Using the q modulation frequency samples as input and the corresponding q reverberation time samples T0 as output, p speech quality models corresponding to p octave band filtered signal bands are obtained by training based on a neural network.

7. The evaluation method as described in claim 1, characterized in that, The voice transmission index is displayed on the front end, and the display methods include, but are not limited to: The voice transmission index is displayed on the interface as a numerical value and a dynamic movement signal; or... The voice transmission index is displayed on the interface as a numerical value and dynamic Wi-Fi signal; or... The voice transmission index is displayed on the interface as a numerical value and a dynamic dashboard; or... The voice transmission index is displayed on the interface as a numerical value and a dynamic bar.

8. A device for evaluating the voice quality of existing voice files, characterized in that, The device includes: The receiving module is used to receive the target voice signal, which includes the target existing voice file played by the target sound-producing device; The processing module is used to calculate the corresponding speech transmission index in real time based on the target speech signal, including: extracting features from the target speech signal to obtain at least one set of target speech signals; filtering the at least one set of target speech signals to obtain p different octave band filtered signal bands, each set of the octave band filtered signal bands including n modulation frequencies f. m n≥1, m≥1; envelope extraction is performed on the p groups of different octave band filtered signal bands to obtain p groups of sub-band envelope features; taking the octave band filtered signal band corresponding to any group of sub-band envelope features as input, the reverberation time T of the p groups is obtained by pre-training the speech quality model corresponding to the corresponding octave band filtered signal band; based on the reverberation time T of the p groups, the corresponding p groups of target features are obtained, and the target features are modulation transfer function values; based on the p groups of target features, the speech transmission index corresponding to the target speech signal at the first target position is obtained; The display module is used to display the voice transmission index on the front end.

9. A voice quality evaluation system for existing voice files, characterized in that, The evaluation system includes: At least one voice receiving device, the voice receiving device being used to receive a target voice signal, the target voice signal including a target existing voice file played by a target sound-producing device; At least one display device, the display device being used to display the voice transmission index at the front end; A smart device is configured to receive a target voice signal sent by the at least one voice receiving device, perform a voice quality evaluation method for existing voice files as described in any one of claims 1 to 7 to calculate a corresponding voice transmission index in real time based on the target voice signal, and send the voice transmission index to the at least one display device for front-end display.

10. An electronic device, characterized in that, include: One or more processors; as well as A memory associated with the one or more processors, the memory being used to store program instructions that, when read and executed by the one or more processors, perform the speech quality evaluation method for existing speech files as described in any one of claims 1 to 7; as well as A display associated with the one or more processors, the display being used to display in real time the voice transmission index obtained after the one or more processors execute the program instructions.

11. A computer-readable storage medium, characterized in that, It stores a computer program, which, when executed by a processor, implements the speech quality evaluation method for existing speech files as described in any one of claims 1 to 7.