Speech understanding model unified evaluation method, storage medium and device

By introducing a unified standard prediction output format and text normalization processing, the consistency and fairness issues in speech understanding model evaluation are resolved, stable evaluation is achieved in complex environments, and the comparability of evaluation results and the accuracy of model selection are improved.

CN122201253APending Publication Date: 2026-06-12AISPEECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
AISPEECH CO LTD
Filing Date
2026-03-31
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing evaluation methods for speech understanding models lack consistency and fairness, making it difficult to reflect the robustness and generalization ability of models in complex environments. Furthermore, the training and reproducibility are difficult, resulting in insufficient stability and comparability of evaluation results.

Method used

By introducing a unified standard prediction output format, a correspondence between the predicted text and the reference text is established based on sample identifiers and metadata information. A unified text normalization process and scoring script are executed to generate a standardized evaluation report that covers the complex factors of real deployment environments.

🎯Benefits of technology

It improves the consistency and fairness of the evaluation process, enhances the stability and comparability of evaluation results, supports horizontal comparisons under multi-task and multi-model conditions, and enhances the supporting value for model selection and engineering deployment.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201253A_ABST
    Figure CN122201253A_ABST
Patent Text Reader

Abstract

The application discloses a voice understanding model unified evaluation method and device, wherein the method comprises the following steps: obtaining model prediction files output by a plurality of voice understanding models for an evaluation task and corresponding standard reference files, wherein the model prediction files comply with a preset standard prediction output format; analyzing the model prediction files, extracting sample identifiers, metadata information and prediction texts, establishing a corresponding relationship between the prediction texts and reference texts, and identifying a voice understanding task type in combination with the metadata information and the standard reference files; further calling a text normalization component and a unified scoring interface corresponding to the task type, performing unified normalization processing and performance scoring on the prediction texts and the reference texts, obtaining task performance indicators of each model, performing normalization conversion according to a preset reference performance value, obtaining relative performance scores, and then generating a standardized evaluation report. Thus, unified evaluation and comparable analysis of different voice understanding models are realized.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of intelligent driving system technology, and in particular to a unified evaluation method, storage medium and device for speech understanding models. Background Technology

[0002] With the continuous development of basic speech models, large-scale speech models, and various speech processing architectures, speech understanding technology has been widely applied to multiple tasks such as automatic speech recognition, speech translation, emotion recognition, and semantic understanding. Conducting performance testing, result comparison, and capability evaluation of different models has become a crucial part of research validation and engineering deployment in the field of speech understanding. Related work typically relies on publicly available benchmark datasets, evaluation platforms, or experimental procedures built by the research entity itself to measure and report the model's performance on specific tasks.

[0003] Currently, while existing evaluation methods can support model testing and result presentation to some extent, they still lack sufficient consistency constraints on the evaluation process itself. Different research entities or systems often differ in data processing methods, result organization, and pre- and post-scoring processing rules. This makes the results obtained for the same task susceptible to the influence of different implementation paths, thereby weakening the fairness and comparability of comparisons between different models. For applications focused on model selection and capability analysis, these differences further affect the stability and reference value of evaluation conclusions.

[0004] Meanwhile, most currently widely used evaluation systems still primarily focus on performance measurements under standard test conditions, offering limited coverage of the complexities encountered in real-world deployment environments. For instance, in actual voice interaction scenarios, multiple factors often interact, including noise interference, far-field acquisition, multi-speaker interaction, accent differences, and contextual changes. Relying solely on test results under relatively ideal conditions often fails to fully reflect the model's robustness, generalization ability, and practical applicability boundaries in complex environments. Therefore, the supporting role of evaluation results in engineering deployment and model selection still has room for improvement.

[0005] Furthermore, as speech model training processes become increasingly complex, model performance is not only related to model structure but also closely related to the organization of training data, training process configuration, and engineering implementation conditions. Differences in training environments and implementation paths among different research entities can easily lead to high difficulty in model reproduction, making it difficult to establish a consistent and reproducible basis for comparing the capabilities of different models. Summary of the Invention

[0006] This application provides a unified evaluation method, device, storage medium, and program product for speech understanding models, which is used to solve at least one of the above-mentioned technical problems.

[0007] In a first aspect, embodiments of this application provide a unified evaluation method for speech understanding models, comprising: acquiring model prediction files output by multiple speech understanding models to be evaluated for an evaluation task, and a standard reference file corresponding to the evaluation task; the model prediction files follow a preset standard prediction output format; parsing the model prediction files based on the standard prediction output format to extract sample identifiers, metadata information, and predicted text; establishing a correspondence between the predicted text and reference text in the standard reference file based on the sample identifiers; identifying the speech understanding task type corresponding to the current evaluation task based on the metadata information; and determining the speech understanding task type by combining the standard reference file when the metadata information is insufficient to determine the speech understanding task type; and calling the corresponding speech understanding model according to the identified speech understanding task type. The text normalization component corresponding to the task type performs unified text normalization processing on the extracted predicted text and the reference text in the standard reference file to obtain normalized predicted text and normalized reference text. Based on the speech understanding task type, a scoring script corresponding to the speech understanding task type is called through a preset unified scoring interface. According to the normalized predicted text and the normalized reference text, the task performance indicators of the multiple speech understanding models under the current evaluation task are calculated respectively. According to the task performance indicators and the preset reference performance values ​​corresponding to each task performance indicator, the task performance indicators of each speech understanding model are normalized and converted to obtain the relative performance score corresponding to each speech understanding model. The task performance indicators and the relative performance scores corresponding to each speech understanding model are aggregated to generate a standardized evaluation report.

[0008] Secondly, embodiments of this application provide a storage medium storing one or more programs including execution instructions, which can be read and executed by electronic devices (including but not limited to computers, servers, or network devices) to perform the steps of the method described in any of the above claims of this application.

[0009] Thirdly, a computer device is provided, comprising: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the steps of the method described in any of the preceding claims of this application.

[0010] Fourthly, embodiments of this application also provide a computer program product, the computer program product including a computer program stored on a storage medium, the computer program including program instructions, which, when executed by a computer, cause the computer to perform the steps of any of the methods described above.

[0011] The beneficial effects of the embodiments of this application are as follows: By introducing a unified standard prediction output format from the prediction results of multiple speech understanding models and establishing the correspondence between predicted text and reference text based on sample identifiers, and automatically identifying the evaluation task type by combining metadata information and standard reference documents, unified text normalization processing and scoring script invocation are further performed for the corresponding tasks. This unifies the constraints on differences in result organization, text preprocessing and post-processing rules, and scoring execution processes among different models, systems, and research subjects in the evaluation chain, effectively reducing the interference of non-model capability factors on the evaluation results and improving the consistency, fairness, and repeatability of the evaluation process. At the same time, by normalizing and converting the performance indicators of each task with preset reference performance values ​​to form a relative performance score, and generating a standardized evaluation report together with the original task performance indicators, it is possible to achieve unified characterization and horizontal comparison of evaluation results under multi-task and multi-model conditions, thereby improving the stability and comparability of evaluation conclusions and their supporting value for model selection, capability analysis, and engineering deployment. Attached Figure Description

[0012] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0013] Figure 1 A flowchart illustrating an example of a unified evaluation method for speech understanding models according to an embodiment of this application is shown. Figure 2 A flowchart illustrating an example of performing an automated training process conversion operation in a method according to an embodiment of this application is shown. Figure 3 A system operation principle diagram of an example of a unified evaluation method for speech understanding models according to an embodiment of this application is shown. Figure 4 A schematic diagram illustrating the effect of an example of the evaluation scope of the unified evaluation and agent-assisted training conversion workflow according to an embodiment of this application is shown. Figure 5 An overview architecture diagram of the unified evaluation and agent-assisted training conversion workflow according to an embodiment of this application is shown; Figure 6 This is a schematic diagram of the structure of an embodiment of the electronic device of this application. Detailed Implementation

[0014] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application. It should be noted that, unless otherwise specified, the embodiments and features in the embodiments of this application can be combined with each other.

[0015] It should also be noted that, in this document, the terms "comprising" or "including" include not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0016] It should be noted that in current related technologies, a technical approach has emerged for evaluating speech understanding models, characterized by public benchmark systems, experimental platforms, and datasets with accompanying evaluation scripts. For example, benchmark systems and dynamic speech evaluation systems for general speech processing tasks, as well as evaluation platforms for audio language models or multimodal audio understanding, can all support model capability testing and result display to a certain extent. These solutions typically compare the performance of different models on tasks such as automatic speech recognition, speech translation, emotion recognition, and speech understanding by providing standardized speech data, pre-setting task types, selecting corresponding evaluation metrics, and outputting leaderboards or experimental result reports. In addition, some research entities use public speech datasets in conjunction with custom evaluation scripts to conduct specialized tests on target models.

[0017] Furthermore, the evaluation processes in current related technologies are generally not strictly standardized. Although different approaches may use the same or similar datasets and evaluation metrics, significant implementation differences often exist before and after scoring. For example, different research entities may employ different text normalization methods, numerical expression processing methods, punctuation processing methods, label mapping rules, segmentation rules, and task-specific post-processing strategies for processing prediction results and reference results; different result organization formats and script interfaces are also frequently used for the model output itself. These differences directly affect the final metric calculation process, making test results under the same task susceptible to the influence of implementation path differences, thereby weakening the fairness and consistency of horizontal comparisons between different models.

[0018] Furthermore, most current evaluation schemes in related technologies still primarily focus on performance measurement under standard conditions, failing to adequately cover the complex factors in real-world deployment environments. While some schemes attempt to improve coverage by expanding data scale, adding task types, or supplementing with new evaluation sets, these approaches primarily address the issue of expanding the evaluation object, rather than simultaneously establishing a unified stress testing mechanism. Especially in scenarios involving noise interference, far-field acquisition, multi-speaker interaction, dialectal accent differences, code-switching, and contextual hot word intervention, model performance is often affected by a combination of factors. Relying solely on results under conventional testing conditions is insufficient to fully characterize the model's robustness and applicability boundaries in complex real-world environments.

[0019] At the training and reproducibility level, current technologies typically rely on paper descriptions, open-source code, or open data to support model comparisons, but it remains difficult to establish a unified and reproducible foundation for architectural comparison. On the one hand, modern speech models, especially large-scale speech models, often involve heterogeneous data mixing, multi-stage training processes, specific training strategies, and complex engineering dependencies, making it difficult for different research entities to complete training and reproducibility under consistent conditions. On the other hand, the model structures and training approaches disclosed in papers often fail to fully cover the configuration details and engineering implementations required for actual training, resulting in an implementation gap between the paper descriptions and the training code. Even with publicly available code, limitations such as complex operating environments, intricate dependencies, and inconsistent training frameworks often hinder the effective reduction of implementation variance. Therefore, how to further improve the consistency, realism, and reproducibility of the experimental process based on unified evaluation has become a noteworthy issue in the evaluation and comparison of speech understanding models.

[0020] It should be understood that the above description of the relevant technologies is intended only to help the public better understand the inventive spirit and motivation of this application, and is not intended to limit this application. Furthermore, the technical solutions described in the above-mentioned relevant technologies are not prior art, and may also be undisclosed technical solutions, such as those under research or in the laboratory stage.

[0021] The technical solutions in this application, including the collection, storage, use, processing, transmission, provision, and disclosure of users' personal information, comply with relevant laws and regulations and do not violate public order and good morals.

[0022] Figure 1 A flowchart illustrating an example of a unified evaluation method for speech understanding models according to an embodiment of this application is shown.

[0023] Regarding the execution subject of the method in the embodiments of this application, it can be any controller or processor with computing or processing capabilities. In some examples, the method in the embodiments of this application can be integrated and configured in an electronic device or terminal through software, hardware or a combination of software and hardware. The type of terminal or electronic device can be diverse, such as a server, desktop computer, workstation or cloud evaluation node.

[0024] For example, the execution entity of the method in the embodiments of this application can be an evaluation controller integrated in the unified evaluation platform for speech understanding models. The evaluation controller executes computer-executable instructions stored in the memory to realize data processing and evaluation control functions related to the unified evaluation of speech understanding models.

[0025] like Figure 1 As shown, this method constructs a unified evaluation data processing flow to consistently receive, process, and score the prediction results of multiple speech understanding models to be evaluated, thereby achieving cross-model performance comparison under a unified protocol. In step S110, the model prediction files output by the multiple speech understanding models to be evaluated for the evaluation task, as well as the standard reference file corresponding to the evaluation task, are obtained. The model prediction files follow a preset standard prediction output format.

[0026] In practical applications, different speech understanding models often exhibit significant differences in their original output results due to variations in modeling paradigms, output mechanisms, and engineering implementation methods. For example, different models may differ in the organization of result fields, the way output content is carried, and the file structure. To facilitate unified processing later, this embodiment applies a unified format constraint to the model prediction files during the data access stage, ensuring that each model to be evaluated organizes and encapsulates its prediction results according to a preset standard prediction output format before entering the evaluation process.

[0027] By establishing uniform format constraints at the input end, the evaluation system can process prediction results from different models using a consistent data reception method, without needing to design separate parsing processes for different models. This not only helps reduce the complexity of subsequent data processing but also helps reduce additional biases caused by differences in output format, inconsistent field organization, or non-standard result expression.

[0028] In some embodiments, to ensure that the prediction results output by heterogeneous models can be parsed by the evaluation system without loss or ambiguity, the aforementioned standard prediction output format is designed as a unified data encapsulation structure. Specifically, the standard prediction output format includes at least a sample identifier field, a metadata field, and a prediction text field.

[0029] The metadata fields include at least one of the following: model identifier, dataset identifier, task identifier, scene identifier, language identifier, and timestamp information. By pre-encapsulating these multi-dimensional attribute tags in the metadata fields, the evaluation system can not only accurately complete task routing and strategy matching during multi-task concurrent processing, but also support the subsequent system in performing fine-grained slice analysis and drill-down comparison based on specific scenarios (such as noisy environments), specific languages, or specific models when generating evaluation reports.

[0030] Meanwhile, the aforementioned sample identifier field is used to uniquely identify the corresponding original test audio segment, while the predicted text field is purely used to carry the string content of the actual decoding output of the model. By strictly decoupling "data identity tracking (sample identifier)," "context attribute description (metadata)," and "core prediction result (predicted text)" at the field level, the standardization of evaluation data flow is improved, providing reliable and unified underlying data support for accurate text alignment and multi-dimensional comprehensive evaluation in subsequent steps.

[0031] In step S120, the model prediction file is parsed based on the standard prediction output format to extract sample identifiers, metadata information and prediction text. The correspondence between the prediction text and the reference text in the standard reference file is established based on the sample identifiers. The speech understanding task type corresponding to the current evaluation task is identified based on the metadata information. When the metadata information is insufficient to determine the speech understanding task type, the speech understanding task type is determined by combining the standard reference file.

[0032] Here, the system first performs structured parsing on the model prediction file according to the standard prediction output format, extracting key content such as sample identifiers, metadata information, and prediction text. The sample identifier represents the specific evaluation sample corresponding to the prediction result, the prediction text represents the model's output result for that sample, and the metadata information describes task attributes, data source, or other information that helps identify the characteristics of the current evaluation task. Based on the extracted sample identifiers, the system can establish a one-to-one correspondence between the prediction text and the reference text in the standard reference file, thereby ensuring that subsequent processing and scoring are performed on the same data instance.

[0033] In some implementation scenarios, the metadata information in the model prediction file can directly indicate the speech understanding task type to which the current evaluation task belongs. However, when the metadata information is missing, incomplete, or poorly labeled, the system can further supplement the task type determination by combining the content structure, label format, or result organization method of the standard reference file. Through this task type identification mechanism, the system can provide a basis for subsequent component calls and scoring process selection, and improve its automatic processing capabilities and process continuity when the input data is somewhat non-standardized.

[0034] In step S130, the text normalization component corresponding to the identified speech understanding task type is called according to the identified speech understanding task type. The extracted predicted text and the reference text in the standard reference file are subjected to unified text normalization processing to obtain normalized predicted text and normalized reference text.

[0035] Because different models often employ different writing habits and formatting rules in their result presentation—such as capitalization, numerical representation, symbol retention, and label writing—directly submitting the original predicted text and reference text to the scoring process can easily lead to these discrepancies, which are not directly related to the task objective, affecting the final scoring result. Therefore, this embodiment introduces a unified text normalization process before scoring and calls the corresponding text normalization component based on the task type to apply consistent normalization rules to both the predicted and reference texts.

[0036] In practice, the system performs symmetrical processing on paired prediction and reference texts, meaning that format standardization, symbol processing, and expression unification are completed simultaneously under the same rule system. This transforms prediction and reference results, which initially exhibit superficial differences in expression, into standardized texts more suitable for comparison, thereby reducing the interference of non-essential format factors on the scoring results.

[0037] In step S140, based on the speech understanding task type, a scoring script corresponding to the speech understanding task type is called through a preset unified scoring interface. Based on the normalized predicted text and the normalized reference text, the task performance indicators of multiple speech understanding models under the current evaluation task are calculated respectively.

[0038] After normalization, the system enters the task performance index calculation stage. Considering that different speech understanding tasks often correspond to different evaluation rules and scoring implementations, this embodiment sets up a unified scoring interface at the scoring layer. This interface provides a consistent calling method externally and internally calls the corresponding scoring script based on the currently identified speech understanding task type. In this way, although different tasks can retain their own suitable evaluation methods, their access, scheduling, and execution methods can be completed within a unified framework.

[0039] Through this unified scoring interface, the system can process the normalized predicted text and normalized reference text of multiple speech understanding models under the same evaluation task, and output the corresponding task performance indicators. This design helps to reduce the additional differences caused by different scoring implementation methods used by different research subjects, and also makes the entire evaluation system highly scalable. When it is necessary to support new speech understanding tasks, only the calling logic of the corresponding scoring script needs to be added to the unified scoring interface, without making major changes to the overall evaluation process.

[0040] In step S150, the task performance indicators of each speech understanding model are normalized and converted according to the task performance indicators and the preset reference performance values ​​corresponding to each task performance indicator, so as to obtain the relative performance score corresponding to each speech understanding model.

[0041] When comprehensively evaluating multiple speech understanding tasks, the original performance indicators for different tasks often have different numerical meanings and evaluation directions. For example, for some indicators, a larger value represents better performance, while for others, a smaller value represents better performance; at the same time, the value range and units of measurement may also differ between different indicators. Therefore, directly comparing or comprehensively analyzing the original task performance indicators horizontally is easily affected by inconsistencies in evaluation scales. This embodiment introduces preset reference performance values ​​corresponding to the performance indicators of each task to normalize and convert the original task performance indicators, thereby obtaining relative performance scores in a unified expression form.

[0042] Through this conversion process, the system transforms absolute indicators, originally belonging to different task evaluation systems, into relative results that can be compared on a unified scale. This preserves the evaluation significance reflected in the original scoring rules of each task while providing a unified numerical basis for comprehensive comparisons across tasks and models. Consequently, it reduces the difficulties arising from direct comparisons between heterogeneous evaluation indicators and improves the overall interpretability and usability of multi-task evaluation results.

[0043] In step S160, the task performance indicators and relative performance scores corresponding to each speech understanding model are aggregated to generate a standardized evaluation report.

[0044] After calculating the task performance metrics and converting the relative performance scores, the system summarizes the results for multiple speech understanding models and generates a standardized evaluation report according to preset report organization rules. This report can retain the original task performance metrics of each model in the current evaluation task, and can also simultaneously provide the uniformly converted relative performance scores, thus making the results presentation take into account both intra-task evaluation and cross-task comparison needs.

[0045] By automatically generating standardized evaluation reports, this embodiment constructs a complete evaluation loop, from the access of heterogeneous model prediction results, unified parsing, unified normalization, unified scoring to unified result output. This evaluation report provides consistent, comparable, and easily verifiable results for model capability analysis, model architecture comparison, and model selection in engineering deployment, thereby improving the standardization of speech understanding model evaluation and the usability of the results.

[0046] In some optional embodiments of this application, regarding the acquisition of model prediction files in step S110, the system can first construct a scenario stress test suite oriented towards a real deployment environment, and use this scenario stress test suite as the data source for the current evaluation task. In actual speech application scenarios, the model operating environment is usually more complex than standard test conditions. In addition to regular speech content, it may also be affected by factors such as background noise, changes in acquisition distance, overlapping of multiple speakers, dialect accent shifts, and enhanced contextual dependencies. Therefore, if the evaluation is based solely on relatively ideal speech data, it is often difficult to fully reflect the robustness level and applicability boundaries of the model under real deployment conditions. Based on this, this embodiment introduces complex interference factors corresponding to the deployment environment during the evaluation data access stage by constructing a scenario stress test suite, thereby enhancing the evaluation process's coverage of real application conditions.

[0047] Specifically, the scenario stress test suite can include two parts: an acoustic stress test set and a language stress test set. The acoustic stress test set can include audio samples with background noise, audio samples with far-field reverberation, and conference audio samples with overlapping multiple speakers, used to characterize distortion, overlap, or attenuation of speech signals under complex acoustic environments. The language stress test set can include audio samples with dialect variant features, audio samples with mixed Chinese-English code-switching features, and audio samples carrying contextual hot words, used to characterize deviations from standard spoken language, cross-language switching, and enhanced contextual dependence. By constructing test data from both acoustic perturbation factors and linguistic complexity factors, the system's adaptability to different types of complex conditions can be examined within a unified evaluation framework.

[0048] Subsequently, for each audio sample in the scenario stress test suite, the system controls multiple speech understanding models to be evaluated to perform inference prediction operations, generating corresponding raw inference output results. Considering that different models may organize their outputs differently under complex stress scenarios, after obtaining the raw inference output results, the system can perform unified structured encapsulation processing according to a preset standard prediction output format, thereby generating multiple model prediction files corresponding to different real-world stress scenarios. During this encapsulation process, the system can associate and record the predicted text, sample identifiers, and scenario tags or other metadata information related to the current stress scenario, so that subsequent evaluation processes can continue to perform parsing, normalization, and scoring processing under a unified format.

[0049] Through the above implementation methods, the system can not only introduce various stress factors related to the real deployment environment at the data source, but also simultaneously retain scenario attribute information corresponding to specific stress scenarios when generating model prediction files. This allows subsequent evaluation processes to obtain prediction results in a unified format while distinguishing the test conditions under which different prediction results are generated. Therefore, within a standardized evaluation framework, the system can further reflect the performance changes of the model under different complex environments, providing more targeted evaluation criteria for subsequent result aggregation, scenario-based analysis, and model selection. It also helps to enhance the reference value of evaluation conclusions for actual deployment scenarios.

[0050] In some optional embodiments of this application, for the text normalization process in step S130, the system does not uniformly apply the same set of cleaning rules to all speech understanding tasks. Instead, based on the identified speech understanding task type, it calls the text normalization component corresponding to that task type to apply appropriate normalization processing to the predicted text and the reference text. Since different tasks differ in result representation, evaluation focus, and scoring sensitivities, if uniform processing rules are applied indiscriminately during the normalization stage, expression differences unrelated to the task objective can easily affect the final scoring result. Therefore, this embodiment establishes a multi-task branch dynamic invocation mechanism to ensure that the normalization processing matches the task attributes, thereby providing consistent and more comparable input text for subsequent scoring.

[0051] In some implementations, when the identified speech understanding task type is an automatic speech recognition task, the system calls the basic text processing component to perform case unification, language-related numerical expression standardization, and punctuation normalization on the predicted and reference texts, respectively, to obtain the corresponding normalized predicted and reference texts. The evaluation of automatic speech recognition tasks focuses on measuring the model's transcription accuracy of speech content. However, different models may use different writing conventions when decoding output, such as the way numbers are written, letter case, and punctuation retention. If the original output is directly used to calculate the word error rate or character error rate, these differences in expression may be mistakenly included in the recognition bias. Therefore, the basic text processing component can use preset text standardization rules to perform a consistent standardization mapping on the predicted and reference texts, making them more consistent in format. This reduces the interference of non-essential writing differences on recognition evaluation indicators and helps the scoring results more comprehensively reflect the model's transcription ability.

[0052] In other implementations, when the identified speech understanding task type is a speech emotion recognition task, a gender recognition task, or a speech input-based classification task, the system invokes a label processing component to perform classification label cleaning, task-irrelevant special character removal, and category format unification processing on the predicted and reference texts, respectively, to obtain the corresponding normalized predicted and reference texts. Unlike long text transcription tasks, the output of such tasks is essentially a discrete category determination result. However, in actual implementations, different models may express classification results in different ways. In particular, generative models sometimes output results containing explanatory statements, wrapping text, or formatting symbols, while traditional classification models usually output category labels directly. To avoid the impact of differences in output packaging formats on accuracy calculations, the label processing component can perform label extraction, irrelevant character cleaning, and category format unification processing on the predicted and reference texts, so that the core category information output by different models can be compared within a unified label space. This processing helps reduce the impact of differences in generative output formats on classification evaluation results, thereby improving the consistency of results comparison between different models in classification tasks.

[0053] In some implementations, when the identified speech understanding task is a speech translation task, the system invokes the translation text processing component to perform language-related format normalization, tag cleaning, and cross-language punctuation normalization on the predicted and reference texts, respectively, to obtain the corresponding normalized predicted and reference texts. Speech translation tasks involve both speech-to-text conversion and cross-language mapping from the source language to the target language. Different models may employ different formatting rules when outputting in the target language, such as character spacing, punctuation styles, and local formatting expressions. If these differences are not uniformly processed before scoring, the translation evaluation metrics are easily affected by differences in the surface expression habits of the target language. Therefore, the translation text processing component can perform consistent format normalization and punctuation normalization on the predicted and reference texts according to the text representation characteristics of the target language, thereby making the text entering the translation evaluation script more uniform in expression and helping to improve the stability and comparability of cross-language task scoring results.

[0054] Through the embodiments of this application, the system can apply normalization rules adapted to the task attributes to both the predicted and reference texts based on the characteristics of different speech understanding tasks. This establishes a more consistent text comparison basis before scoring, making normalization processing not merely a general format cleaning process, but a pre-processing step dynamically coupled with task type. Consequently, it reduces additional scoring fluctuations caused by heterogeneous models due to differences in output expression habits, label organization methods, or cross-language format differences. This allows subsequent task performance indicators to better reflect the model's actual processing capability in the corresponding task and improves the consistency and reference value of evaluation results in cross-model comparison scenarios.

[0055] In some optional embodiments of this application, regarding the task performance index calculation process in step S140, the system does not use the same scoring method for all speech understanding tasks. Instead, it first calls the scoring script corresponding to the identified speech understanding task type through a preset unified scoring interface, and then calculates the corresponding task performance index based on the normalized predicted text and normalized reference text. Since different speech understanding tasks differ in output format, evaluation objectives, and index definitions, directly using a single evaluation logic to uniformly calculate all types of tasks could easily lead to a mismatch between the evaluation results and the task's objectives. Therefore, this embodiment uses a unified interface scheduling and task-specific script execution method to enable all types of tasks to adopt index calculation methods adapted to their evaluation objectives, while maintaining the consistency of the overall evaluation process.

[0056] In some implementations, if the speech understanding task is an automatic speech recognition task, the corresponding error rate calculation script is invoked, and the calculated word error rate or character error rate is used as the task performance indicator. The evaluation of automatic speech recognition tasks focuses on measuring the accuracy of the model's transcription of speech content. Therefore, the system can perform word-by-word or character-by-character alignment comparisons based on the sequence correspondence between the normalized predicted text and the normalized reference text, and statistically analyze the deviations of the prediction results relative to the reference results, such as substitutions, deletions, and insertions. Based on this, the corresponding word error rate or character error rate is calculated to characterize the model's recognition deviation level in the speech transcription task. By employing such error rate calculation scripts, the difference between the model's transcription results and the reference results can be quantified at a finer-grained text unit level, thereby providing an evaluation indicator corresponding to the task objectives of automatic speech recognition.

[0057] In other implementations, if the speech understanding task is a speech translation task, the corresponding machine translation evaluation script is invoked, and the translation evaluation score of the calculated normalized predicted text compared to the normalized reference text is used as the task performance indicator. Speech translation tasks not only require the model to recognize the source speech content but also to convert it into a reasonable expression in the target language. Since the same semantic content may have multiple acceptable expressions in the target language, simply using a strict character-by-character or word-by-word matching method for evaluation can easily incur additional penalties for translations with different expressions but essentially the same meaning. Therefore, the system can invoke the machine translation evaluation script to calculate the degree of word overlap, local expression consistency, or overall correspondence between the predicted translation and the reference translation, and obtain a translation evaluation score accordingly. This type of scoring method helps to balance the diversity of expressions and the consistency of evaluation in translation tasks to a certain extent, thus making the evaluation results of speech translation tasks more consistent with the characteristics of cross-language mapping scenarios.

[0058] In some implementations, if the speech understanding task is a speech emotion recognition task, a gender recognition task, or a speech input-based classification task, the corresponding classification accuracy calculation script is invoked. The accuracy value of matching the calculated normalized predicted text with the normalized reference text is used as the task performance indicator. The output of such tasks typically belongs to the category label determination results in a finite set of categories, and the evaluation focuses on whether the model can make the correct category classification judgment for the input speech sample. Therefore, the system can perform a consistency comparison between the normalized predicted category label and the reference category label, and count the proportion of correct judgments made by the model on the current evaluation sample set to obtain the overall accuracy. By adopting an accuracy calculation method corresponding to the characteristics of the classification task, the model's classification decision-making ability in tasks such as speech attribute recognition and semantic category determination can be reflected more directly.

[0059] Through the embodiments of this application, the system can select appropriate scoring scripts and evaluation metrics based on the characteristics of different speech understanding tasks within a unified evaluation framework. This allows automatic speech recognition, speech translation, and classification tasks to complete performance calculations under adapted evaluation logic. Therefore, it avoids the mismatch problem caused by using a single metric to evaluate heterogeneous tasks and helps ensure that the calculated performance metrics for different tasks remain consistent with their task objectives. This provides more reasonable basic data for subsequent relative performance score conversion and cross-model comprehensive comparison.

[0060] Regarding the implementation details of step S150, in some optional embodiments of this application, the system first obtains the best benchmark score corresponding to the current evaluation task from a preset evaluation leaderboard record, and uses this best benchmark score as a preset reference performance value. In multi-task speech understanding evaluation, the evaluation indicators of different tasks often have different dimensions, different numerical ranges, and different task difficulty characteristics. If a unified fixed value is used as the conversion benchmark, it is usually difficult to reflect the current evaluation level of each task. Based on this, this embodiment reads the evaluation leaderboard record corresponding to the current evaluation task and selects the best benchmark score recorded for that task as the reference value, so that the subsequent conversion process is based on a benchmark level that matches the current task. In this way, the relative performance of different models in the current task can be measured with reference to the same benchmark, thereby improving the interpretability of relative performance scores and the consistency of intra-task comparisons.

[0061] After obtaining the preset reference performance value, the system further identifies the polarity of the current task's performance indicators. Indicator polarity includes positive and negative indicators. Positive indicators represent performance indicators where a larger value signifies better performance, while negative indicators represent performance indicators where a smaller value signifies better performance. In actual speech understanding evaluation, the direction of indicators is often inconsistent across different tasks. For example, classification accuracy or translation evaluation scores are usually positive indicators, while word error rate or character error rate are usually negative indicators. If the polarity of the indicators is not identified before conversion, subsequent normalization calculations may result in inconsistent evaluation directions. Therefore, this step identifies the indicator polarity in advance, providing a basis for subsequent conversions based on different mathematical relationships, thus enabling the system to be compatible with multiple evaluation indicators in an automated process.

[0062] After identifying the polarity of the indicators, the system selects the appropriate normalization conversion method based on the indicator type. On one hand, when the task performance indicator is a positive indicator, the system calculates the ratio between the current speech understanding model's task performance indicator and the best benchmark score; on the other hand, when the task performance indicator is a negative indicator, the system calculates the ratio between the best benchmark score and the current speech understanding model's task performance indicator. Through this conversion method, different indicators that originally had opposite evaluation directions can be mapped to the same relative comparison logic, so that the converted ratio parameters can all characterize the degree of performance closeness of the model relative to the corresponding benchmark level. Therefore, although the original evaluation indicators of different tasks differ in their definition, after this step, they can continue to participate in the subsequent result aggregation on a unified relative comparison basis.

[0063] Furthermore, based on the ratio parameter, the system determines the relative performance score for each speech understanding model. In practice, the system can directly generate the corresponding relative performance score based on the ratio parameter, or it can generate the relative performance score after further numerical processing of the ratio parameter according to specific implementation requirements. Through this step, task performance indicators that originally belonged to different task evaluation systems and had different directions are transformed into relative result representations with a unified expression form, thus providing a consistent data foundation for cross-model comparisons and cross-task result summaries in subsequent standardized evaluation reports.

[0064] Through the embodiments of this application, while retaining the significance of various original task performance indicators, a normalization conversion mechanism based on the best benchmark score and indicator polarity is introduced. This enables the absolute indicator results under different tasks to be further transformed into comparable relative performance scores. This not only helps reduce the difficulties of direct horizontal comparison between heterogeneous evaluation indicators but also enhances the organizational capacity and analytical value of multi-task evaluation results within a unified reporting framework, thereby providing more stable data support for subsequent comprehensive model comparisons.

[0065] In some preferred embodiments of this application, after generating a standardized evaluation report, a dynamic leaderboard calibration operation can be further performed. Specifically, the system first detects whether the task performance indicators of each speech understanding model under the current evaluation task are better than the best benchmark score in the evaluation leaderboard record. In the scenario of continuously iterating speech understanding model evaluation, the best performance level of different tasks may change continuously with the introduction of new models, new training methods, or new implementation schemes. If the evaluation system maintains a constant benchmark score for a long time, the relative performance scores calculated based on the benchmark score may gradually lose their discriminative power, thereby affecting the leaderboard results' ability to characterize the relative level of the models. Based on this, this embodiment adds a benchmark detection step after each round of evaluation, enabling the system to update and judge the existing best benchmark score based on the task performance indicators of the newly added models.

[0066] When a task performance metric superior to the best baseline score is detected, the system updates the best baseline score to that task performance metric and triggers a refresh mechanism for the evaluation leaderboard. In other words, when a model in the current batch of tests produces an evaluation result superior to the historical best, the system considers this result the new baseline level for the current task and replaces the original best baseline score. Through this update method, the baseline values ​​in the evaluation system are no longer fixed static parameters but can be dynamically adjusted as the model's performance level improves. This ensures that subsequent relative performance scores are always calculated based on a baseline score closer to the latest level for the current task, thereby improving the leaderboard's adaptability to the current technological state.

[0067] After updating the best benchmark score, the system further re-performs normalization based on the updated best benchmark score and historical evaluation records. This dynamically calibrates the relative performance scores of other speech understanding models already recorded on the leaderboard for the current evaluation task and updates the leaderboard record corresponding to the current task. Since the relative performance scores of historical models were originally calculated relative to the old best benchmark score, these scores no longer perfectly correspond to the new benchmark conditions after the best benchmark score is updated. To maintain consistency in the calculation of relative performance scores across models in the leaderboard, the system can call upon task performance metrics stored in historical evaluation records and uniformly re-perform normalization based on the new best benchmark score. In this way, model results submitted at different times in the leaderboard can be realigned under the same benchmark conditions, thus making the updated leaderboard record more consistently reflect the relative performance of each model in the current task.

[0068] Through the embodiments of this application, a dynamic calibration process linked to the leaderboard records is further established based on the relative performance score calculation mechanism. This allows the best benchmark score to be updated according to the evaluation results of the new model, and the relative performance scores of historical models to be recalculated synchronously. This reduces the problem of inconsistencies between historical and current results due to a long-term fixed benchmark, and improves the comparability, consistency, and reference value of the leaderboard results in continuously iterative evaluation scenarios.

[0069] Regarding the implementation details of generating a standardized evaluation report in step S160, in some optional embodiments of this application, the system can further organize the task performance indicators and relative performance scores obtained in the preceding evaluation process into a standardized evaluation report that can be used for comparative analysis through grouping, summarizing, structured organization, and unified output. Since the performance of the same model may vary significantly under different task types and different real-world environmental pressure scenarios, simply summarizing all evaluation results would not be conducive to identifying performance changes of the model in a specific task or scenario. Therefore, this embodiment introduces a grouping dimension in the report generation stage to further segment and organize the evaluation results.

[0070] Specifically, the system first groups and summarizes the task performance indicators and relative performance scores of each speech understanding model according to at least one grouping dimension. The grouping dimension can include a speech understanding task type dimension and a real-world stress scenario dimension. The real-world stress scenario dimension can include at least one of the following: background noise scenario, far-field reverberation scenario, multi-speaker overlapping conference scenario, dialect variant scenario, mixed Chinese-English code-switching scenario, and contextual hot word scenario. Through this grouping method, the system can further break down the originally uniformly summarized evaluation results into result sets with clear task and scenario attributes, thus facilitating the observation of the model's performance under different tasks and stress conditions. This reduces the information masking problem caused by judging solely based on the overall average result and improves the analytical granularity of the evaluation results.

[0071] After grouping and summarizing, the system further generates structured evaluation results including model identifier, task type, scenario identifier, task performance metrics, relative performance scores, and model ranking results. In practice, the system can organize multiple fields involved in the evaluation in a related manner, ensuring that the original metric results and relative performance results of the same model under the corresponding task type and scenario conditions are recorded consistently. Within each group, the system further determines the model ranking results according to preset ranking rules. This structured organization method transforms scattered evaluation calculation results into unified result data with clearly defined fields and relationships, thus providing a standardized data foundation for subsequent report generation, result querying, and external access.

[0072] Furthermore, the system outputs a standardized evaluation report based on the structured evaluation results. This standardized evaluation report can uniformly organize the structured evaluation results according to preset display rules, including the model's task performance indicators, relative performance scores, and ranking in the corresponding group for each individual task. Through this output process, the system can transform the underlying evaluation calculation results into a unified report format that is easy to view and compare, thereby supporting a comprehensive comparison of the performance of different models under different task types and different real-world environmental pressure scenarios. Therefore, the generated standardized evaluation report not only reflects the overall competitiveness of the models but also further demonstrates the performance characteristics of the models under specific scenario conditions, providing more granular results for subsequent model analysis, model selection, and deployment evaluation.

[0073] Through the embodiments of this application, a grouping organization mechanism combining task type and real-world environmental stress scenarios is established in the evaluation result output stage. This enables the standardized evaluation report to not only provide the overall evaluation results of the model, but also to further demonstrate the model's performance in different tasks and scenarios. This helps improve the readability, comparability, and analytical value of the evaluation results, and provides more targeted results support for model defect localization, performance diagnosis, and scenario adaptation analysis.

[0074] In some examples of embodiments of this application, after generating a standardized evaluation report, the system can not only output the performance comparison results of multiple speech understanding models under a unified evaluation protocol, but also further perform automated training process conversion operations to support controlled training and reproduction of the target speech understanding model. Figure 2 A flowchart illustrating an example of performing an automated training process conversion operation according to an embodiment of this application is shown. Through this process, the system can further convert the paper description and open-source implementation of the target model into a standardized training process suitable for execution within a unified training framework, based on the evaluation results, thereby providing support for subsequent controlled reproduction and architecture comparison.

[0075] like Figure 2 As shown, in step S210, a controlled training reproduction request for the target speech understanding model is received.

[0076] In some implementation scenarios, when evaluators, developers, or automated management modules determine that a particular model has further reproducibility analysis value based on a standardized evaluation report, they can submit a controlled training and reproducibility request for that target speech understanding model through an interactive interface, task scheduling interface, or application programming interface. This request can carry the target model's unique identifier, code repository address, model name, or other retrieval information capable of locating the target model. Through this step, the system can further enter the training process conversion stage corresponding to the target model after completing the unified evaluation, establishing a clear connection between the evaluation results and the subsequent reproducibility process.

[0077] In step S220, based on the controlled training reproduction request, the intelligent agent module automatically parses the paper description documents and open source code repositories associated with the target speech understanding model to extract the network architecture features, training dataset dependencies, and optimizer hyperparameter configurations of the target speech understanding model.

[0078] In traditional replication processes, developers typically need to read paper descriptions, search configuration files, and understand code implementations to gradually extract the key elements required for model training. This embodiment introduces an intelligent agent module to automatically and jointly parse these heterogeneous information sources. Paper descriptions usually contain unstructured information such as model structure design, training ideas, and experimental settings, while open-source code repositories contain engineering information such as model implementation, configuration organization, and dependencies. By collaboratively analyzing both, the system can extract key parameters directly related to training replication, such as network hierarchy, core module configuration, dataset information, optimizer type, and hyperparameters. This step reduces the workload of manually identifying the correspondence between papers and code and improves the consistency of training element extraction.

[0079] In step S230, based on the extracted network architecture features, training dataset dependencies, and optimizer hyperparameter configurations, and in conjunction with the preset unified training framework specifications, a standardized unified training configuration file is generated.

[0080] Since the original implementations of different models typically rely on different deep learning frameworks, toolchains, and training script organization methods, directly reproducing them in the original engineering environment is easily affected by differences in the underlying implementation, thus increasing the possibility of inconsistent training conditions between models. Therefore, this embodiment, after extracting the key training elements of the target model, further maps and reorganizes the above information according to a preset unified training framework specification to generate a standardized unified training configuration file. This unified training configuration file can serve as the configuration basis for subsequent training process scheduling and training environment construction, enabling different target models to complete parameter organization and process orchestration under consistent training framework constraints when entering the reproduction stage. This provides a unified configuration entry point for subsequent controlled training and reduces the additional impact caused by differences in the original engineering environment.

[0081] In step S240, the training process verification is performed based on the unified training configuration file, and after the training process verification is passed, the executable training process corresponding to the target speech understanding model is output.

[0082] Here, the system first performs verification processing on the target training process based on the unified training configuration file to confirm the executability and consistency between the configuration content and the unified training framework. After successful verification, the system further outputs an executable training process corresponding to the target speech understanding model, enabling developers or subsequent automation modules to conduct controlled training reproduction based on this executable training process. Through this step, the system does not merely remain at the abstract configuration layer, but further implements the transformed training elements into actual execution processes that can be invoked by the unified training environment, thereby completing the closed loop from discovering the target model through evaluation results to forming the entry point for controlled training reproduction.

[0083] This application's embodiments, in addition to a unified evaluation mechanism, further establish an automated training process conversion procedure for target speech understanding models, enabling the system to extend to the training and reproduction stage based on evaluation results. This implementation method, by automatically parsing paper description documents and open-source code repositories, extracting key training elements, generating a unified training configuration file, and outputting an executable training process, helps reduce reliance on manual experience and engineering work during model reproduction, and provides a foundation for controlled reproduction of different models within a unified training framework. Therefore, beyond unified evaluation, it can further enhance the analytical capabilities regarding the relationship between model architecture, training conditions, and reproduction processes, thereby improving the system's support for model comparison and subsequent reproduction verification.

[0084] Regarding the implementation details of step S240, in some optional embodiments of this application, after generating the unified training configuration file, the system does not directly use the configuration file for the formal training task. Instead, it first performs training process verification to confirm the executability and consistency between the unified training configuration file and the unified training framework. Specifically, the training process verification may first include at least one of the following static checks: code dependency resolution check, parameter consistency check, and loss function and evaluation metric correspondence check. Through this static check process, the system can pre-verify the matching between configuration content, dependencies, and key parameters before the training task enters the running phase.

[0085] In some implementations, code dependency resolution checks are used to confirm whether the network modules, operator components, or training dependencies declared in the unified training configuration file can be recognized and invoked by the current unified training framework's runtime environment, thereby reducing training startup failures due to missing dependencies, version incompatibility, or incorrect component references. Parameter consistency checks are used to verify the relationships between key parameters in the model configuration, such as the input-output dimension relationships between network layers and the matching relationship between training batch parameters and resource configurations, to reduce training process anomalies caused by inconsistent parameter settings. Loss function and evaluation metric correspondence checks are used to confirm that there is no significant mismatch between the optimization objective selected in the configuration file and the evaluation method corresponding to the current task, for example, avoiding configuring loss functions that are not suitable for the current task type in the target training process. Through the above static checks, the system can identify some potential configuration-level problems at an early stage, thereby reducing invalid execution in subsequent formal training.

[0086] After the static checks are passed, the system further performs integration and runtime tests on a pre-set small-scale test set. Static checks are primarily used to identify structural problems in the configuration and dependency layers; however, whether the training process can complete key steps such as data loading, forward computation, loss calculation, backpropagation, and parameter updates during actual operation still needs to be verified through real-world testing. Therefore, this embodiment introduces a small-scale test set to drive the target model to perform a limited number of training steps under the constraints of a unified training configuration file. Through this integration and runtime test, the system can further verify the connectivity of the data pipeline, the effectiveness of underlying operator calls, and the basic operability of the training process, while reducing the time and computational overhead of directly performing trial and error on the full training data.

[0087] Furthermore, after successful integration and testing, the system outputs an executable training process corresponding to the target speech understanding model. This executable training process can be generated based on a unified training configuration file and standard training organization methods within a unified training framework, enabling subsequent developers or automated execution modules to directly conduct controlled training and reproduction based on this training process. Through this step, the system further translates the results of the aforementioned configuration extraction, configuration conversion, and process verification into actual callable training execution units, thereby completing the closed-loop processing from target model information parsing to executable output of the training process.

[0088] In this embodiment, after the unified training configuration file is generated, a training process verification procedure consisting of static checks and integration runtime tests is further established. This ensures that the output executable training process not only originates from the automated configuration conversion but also undergoes executability verification. This helps reduce the risk of subsequent training failures caused by configuration mismatch, dependency anomalies, or incomplete runtime processes, and improves the usability and stability of the automated training process conversion results in a unified and controlled training scenario, thereby providing a more reliable execution foundation for the reproducible training of the target speech understanding model.

[0089] Figure 3 A schematic diagram illustrating the system operation principle of an example of a unified evaluation method for speech understanding models according to an embodiment of this application is shown.

[0090] like Figure 3 As shown, the system proposed in this application is not a single benchmark set or a collection of scattered evaluation scripts, but rather a unified experimental framework for speech understanding models. The system can be comprised of two interconnected parts: a unified evaluation pipeline for speech understanding models in stage one, and an agent-assisted automated training conversion pipeline in stage two. Through a unified evaluation protocol, a unified prediction output format, a unified post-processing workflow, a unified scoring interface, and an automated training workflow conversion mechanism, the system can support performance evaluation, result comparison, and controlled training reproducibility of heterogeneous speech understanding models under consistent conditions.

[0091] In the unified evaluation pipeline of Phase One, the system first introduces scenario test suites and standard reference files as input data sources. To more comprehensively evaluate the performance of the model in real deployment environments, the system constructs a multi-dimensional real-world test set, including test scenarios such as noisy speech scenarios, dialect speech scenarios, code-switching language scenarios, multi-speaker conference scenarios, and contextual hot word recognition scenarios. After the speech understanding model to be evaluated performs inference in the above scenarios, it outputs a model prediction file that follows a unified standard. To reduce the impact of differences in the organization of outputs from different models on subsequent evaluations, the system organizes the prediction results using a unified prediction data format, ensuring that the output results can at least carry speech identifiers, prediction results, and metadata information, and establish a correspondence with the standard reference file.

[0092] Subsequently, the system performs format parsing and sample identifier alignment, and identifies the current task type based on metadata information, such as automatic speech recognition, speech translation, speech emotion recognition, speech understanding, or gender recognition. Before entering the scoring stage, the system calls the text processing component corresponding to the task type to perform consistency normalization processing on the prediction results and reference results, such as case unification, number expression normalization, label cleanup, and special symbol processing. Next, the system calls the scoring script corresponding to the current task type through a unified scoring interface to calculate the absolute performance index of each model under the corresponding task. Furthermore, the system can also perform unified conversion of results under different tasks based on task performance indexes and corresponding benchmark values ​​to obtain relative performance scores that can be used for cross-task comparisons, and generate a standardized evaluation report accordingly.

[0093] When standardized evaluation reports indicate that certain target models have further analytical value, the system can enter the second phase of the Agent-Assisted Automated Training Transformation Pipeline based on controlled training reproduction requests. In this phase, the intelligent agent module automatically accesses and parses the paper description documents and open-source code repositories associated with the target model, extracting model structure information, training data dependencies, and training parameter configurations. Based on the extracted multi-source information, the system further generates a unified training configuration file, enabling the target model to be mapped to a pre-defined unified training framework.

[0094] To ensure the executable nature of the automatically generated training configuration, the system then performs training process verification. This verification process may include static checks such as parameter consistency checks, dependency checks, and checks on the correspondence between loss functions and evaluation metrics. After the static checks pass, a small-scale test run can be performed to verify the feasibility of the training process in a real-world operating environment. Upon successful verification, the system outputs an executable training process corresponding to the target model to support subsequent reproducible training under unified training conditions.

[0095] Through the aforementioned system mechanism, a speech understanding experimental framework combining unified evaluation with automated training conversion was established. This framework helps improve the consistency of the evaluation process for heterogeneous models and the fairness of result comparison; it also enhances the evaluation coverage under real-world conditions and provides unified support for model training reproducibility and architecture comparison. Therefore, it provides a relatively complete technical foundation for the research, analysis, performance comparison, and engineering selection of speech understanding models.

[0096] It should be noted that while speech-based models and large-scale speech language models possess advanced speech understanding capabilities, deployment-oriented model selection is hampered by mismatched evaluations after processing and the difficulty in reproducing training results across different data scales and workflows. This application proposes SURE (Speech Understanding Reproducible Evaluation), a unified experimental framework that standardizes prediction formats, normalization, and scoring. SURE evaluates representative tasks across various paradigms, from traditional pipelines to large-scale speech language models, in realistic acoustic and linguistic stress environments. In addition to evaluation, SURE introduces a proxy-assisted training transformation pipeline that maps documents and code to versioned, runnable training pipelines based on a unified protocol, suitable for matched subsets of open data. Overall, SURE improves the comparability and reproducibility of deployment-oriented evaluations.

[0097] More specifically, SURE is a unified and reproducible experimentation framework for speech understanding. It is not a static benchmark, but a deployment-oriented model selection framework that is coupled in three ways: (i) conducting scenario-driven evaluations under fixed protocols; (ii) providing a broad range of coverage for diverse model families and rich evaluation dimensions; and (iii) enabling controlled training through agent-assisted transformation workflows to support fairer architecture comparisons.

[0098] Table 1: Comparison of speech understanding benchmarks from a general perspective This application compares models based on three aspects: dataset stress factors, model family breadth, and controlled training for structure evaluation. Here, Families represents the number of different modeling paradigms (e.g., CTC / AED, cascaded pipelines, and Speech LLMs) that are explicitly evaluated side-by-side in the main results of each benchmark.

[0099] In recent years, benchmarks such as SUPERB, Dynamic SUPERB, AIR-Bench, and AudioBench have significantly expanded task coverage and driven large-scale evaluation. However, as summarized in Table 1, these benchmarks typically provide only limited model family breadth for speech understanding tasks: evaluations tend to focus on a narrower class of models, usually Speech LLMs, with fewer comparisons alongside strong traditional paradigms. For some tasks, Speech LLMs are not necessarily the best-performing system class, and the lack of architectural diversity makes task-specific conclusions less meaningful. Furthermore, even when a model is evaluated, the results are usually reported only under a single standard test condition, making it difficult to accurately pinpoint the model's robustness and generalization ability under real-world stress. In contrast, SURE emphasizes scenario-deep evaluation: probing each model under real-world stress conditions. It also supports evaluation of multiple model types under unified protocols and controlled training conditions.

[0100] This paper summarizes the released SURE package, including its various tracks (branches), data suites, and unified interfaces for training, inference, and scoring. It further introduces the evaluation protocol and metrics. Results are reported on three tracks, covering scenario-oriented deep evaluation, cross-task comparisons, and architecture research based on controlled training. The main contributions of this paper are as follows: This application releases SURE as a reproducible experimental closed loop for deployment-oriented model selection, within which the evaluation process is unified through a consistent protocol; This application constructs a scenario-oriented data suite and benchmarks around diverse model families and capability dimensions to support cross-paradigm comparisons under real acoustic and language stress conditions. This application takes an initial step toward more controlled comparisons by introducing an agent-assisted transformation pipeline for controlled training, which reduces implementation variance.

[0101] Figure 4 A schematic diagram illustrating an example of the evaluation scope of the unified evaluation and agent-assisted training conversion workflow (i.e., the SURE framework) according to an embodiment of this application is shown.

[0102] SURE is an end-to-end experimental package for implementing reproducible speech understanding evaluation and supporting controlled training research through an agent-assisted transformation workflow. It provides the following: The project website is used for documentation and updates; A unified evaluation stack that includes post-processing and scoring; The proxy-assisted pipeline is used for training the transformation; Test and training set suites for all tracks.

[0103] SURE contains three tracks: two of them are used for evaluation (Track I–II; see...) Figure 4 ), one for controlled training research (Track III).

[0104] Track I: Stress Testing for Front-End Speech Tasks This track builds a suite of scenarios based on open-source data to approximate real-world conditions, covering both acoustic and linguistic stress factors.

[0105] Track II: Full-Stack Speech Understanding Evaluation This track provides a unified assessment of a wide range of capabilities, from signal-level perception to lightweight semantic processing and transformation, and then to information-based deep reasoning.

[0106] Track III: Preliminary Exploration of Controlled Training This track is an initial step toward more controlled comparisons, involving de novo training on a matched subset of open data; this process is supported by an agent-assisted workflow that transforms “paper + code” into runnable Swift pipelines.

[0107] Figure 5 An overview architecture diagram of the unified evaluation and agent-assisted training conversion workflow (i.e., the SURE framework) according to an embodiment of this application is shown.

[0108] like Figure 5 As shown, the workflow provided in this embodiment generally includes a unified evaluation pipeline on the left and an agent-assisted training conversion pipeline on the right. This architecture, through a unified evaluation protocol and automated agent mechanism, transforms model prediction files and reference files into standardized evaluation reports, and supports the transformation of open-source model papers and code into reproducible training recipes.

[0109] against Figure 5The unified evaluation pipeline on the left follows a fixed standard workflow of "input-preprocessing-normalization-evaluation-output". In the input and preprocessing stages, the system receives labeled JSON files and predicted text files via command-line parameters. After loading the data, it performs task recognition, task mapping, and alias resolution, accurately mapping the user-specified task to the system's internal standard evaluator. In the normalization stage, the system executes fixed text processing components for text-centric tasks. Specifically, the system performs differentiated calls based on task type: for example, for Automatic Speech Recognition (ASR) tasks, it calls the preprocessor and executes language-specific rules such as digit-to-word conversion; for tasks such as Speech Emotion Recognition (SER), Gender Recognition (GR), and S2TT (Speech-to-Text Translation), it executes rules such as label normalization, digit / currency conversion, and translated text processing; for Speaker Diarization (SD) and Speaker-Attributed ASR (SA-ASR) tasks, it performs the corresponding input preprocessing. During the evaluation and output phase, the system calls the corresponding specific scoring scripts under a unified interface to generate indicator results (such as word / character error rate WER / CER for ASR tasks; translation evaluation scores such as BLEU / chrF2 for S2TT tasks; and accuracy for SER / GR / SLU tasks). Finally, the results of all tasks are formatted, output through the console, and saved as a unified JSON evaluation report.

[0110] In the evaluation protocol, to enable intuitive and updatable comparisons of model performance across heterogeneous tasks, this application introduces a compact comprehensive evaluation metric in addition to individual metrics: the Relative Performance Score (RPS). Since the SURE framework covers heterogeneous tasks and different evaluation metrics, the unified Relative Performance Score (RPS) reported in this application has its value range normalized to the [0, 1] interval.

[0111] Specifically, within the same evaluation pipeline, the metrics for each task are normalized relative to the best score for that task on the current leaderboard, and this best score is recorded as... The specific conversion logic is shown in the following formula: , in, This indicates the current evaluation model in the task. The actual score on This is a minimal constant used to ensure numerical stability (to prevent the denominator from being zero). Simultaneously, the system will use the initially calculated... Compare with 1 and extract the smaller value to ensure that the upper limit of the final score is truncated to 1.

[0112] Furthermore, the RPS metric is dynamic in two dimensions: firstly, the leaderboard is updated. When a new strong model is added, the system can update the best score by rerunning the published evaluation script. First, the system can recalibrate the RPS of all models already in the database accordingly. Second, the system can expand its tasks. New tasks can be flexibly incorporated by supplementing standardized evaluation scripts, allowing the RPS to be comprehensively summarized over a wider set of tasks over time. To support fair interpretation across heterogeneous tasks, the system will provide individual metrics for each task in addition to reporting the RPS summary results.

[0113] against Figure 5 The agent-assisted training conversion pipeline on the right provides a workflow for mapping open-source "papers + code" to target training recipes. When the conversion agent (ReproAgent) is started, the system first initializes the workspace and registers associated tools. This workflow is mainly divided into two phases: Phase 1 is the architecture analysis phase (Architect Agent). This agent is responsible for exploring the code repository structure, model, and data, identifying core components and strategies, and planning the creation of a conversion plan (e.g., the conversion_plan.md file). If the plan is not successfully generated and the maximum number of retries is exceeded, it is considered a failure; if it is successfully generated, it proceeds to the next phase. Phase 2 is the implementation loop phase (Worker & ValidatorAgent). In this phase, the worker agent reads the plan and implements the code for the model and training scripts, writing and submitting the code; subsequently, the validator agent performs syntax, import, and architecture verification on the submitted code, and performs integration tests by running the training process in the script, finally generating a detailed feedback report. If a problem is found during verification, the feedback is returned to the worker agent for iterative optimization, forming a feedback loop; if the verification passes, a success flag is output, completing the reproducible conversion and verification of the unified training configuration.

[0114] The following section describes scenario-based stress testing for front-end speech tasks, specifically Track I. Track I focuses on front-end speech perception and examines whether the ASR system remains reliable under real-world deployment conditions. This application introduces this track because many evaluations are conducted on highly controlled and narrow test sets, thus failing to characterize the system's behavior under the complex stress conditions commonly encountered in real-world deployments.

[0115] The stress tests in this application cover two complementary families of scenarios. The first family consists of challenging acoustic and scene conditions that are prone to causing recognition failures, including background noise, far-field reverberation, and multi-person conference speech. Specifically, this application uses VoxPopuli-en to evaluate English recordings in natural noise environments, AISHELL-5 to evaluate Mandarin in-vehicle speech with noise and reverberation, and conference corpora AMI and AliMeeting to evaluate speaker attribution transcription ability in English and Mandarin, respectively. The second family consists of language and context-related conditions that require explicit biasing or grounding, including mixed Chinese-English code-switching (CS-Dialogue), dialect variants (KeSpeech), and context / hot word sensitive recognition (ContextASR).

[0116] This application benchmarks modern speech foundation models and widely used traditional baselines, covering open-source systems and commercial APIs. For ASR, this application reports WER (Word Error Rate) for English and CER (Character Error Rate) for Mandarin. For conference transcription, this application also reports permutation-invariant cpWER and DER.

[0117] Table 2: Track I: Speaker-perceived ASR performance (DER and cpWER / cpCER, ↓ indicates lower values ​​are better, in %) "–" indicates no results yet. Collar is set to 0.

[0118] The experiments revealed two complementary failure modes, which motivated SURE's scenario-driven design. First, for speaker-aware conference transcription (Table 2), the cascaded pipeline remains highly competitive with end-to-end systems (such as VibeVoice-ASR), highlighting the difficulty of conference scenarios due to the interaction of far-field acoustics, interference, and speaker attribution. This gap indicates that conference scenarios are not simply more challenging ASR settings, but rather a complex problem comprised of acoustic stress factors and structural requirements (speaker arrangement and attribution), thus necessitating specialized evaluation beyond single-speaker benchmarks.

[0119] Table 3: Track I: Front-end perception evaluation under scenario stress test, error rate reported as % (the lower the value, the better).

[0120] This application also reports RPS (↑), where the state-of-the-art (SOTA) for each task is taken from the best score in the same table (therefore, the RPS of the best model = 1). “–” indicates no results are available. For ContextASR, this application reports two settings: injected hot words (l-value) and uninjected hot words (r-value); RPS is calculated using l-values.

[0121] Secondly, the ASR stress test suite (Table 3) shows a clear trade-off between different stress factor families: systems with stronger context / biasing capabilities tend to perform better in code-switching and context recognition tasks, but this does not necessarily provide a general advantage under conditions of severe acoustic degradation or dialect variations. This application also observes that applying SURE's unified normalization and scoring methods substantially alters reported results. For example, on LibriSpeech, rerunning a representative system using the evaluation pipeline of this application resulted in an RPS change of approximately 0.3 compared to the originally reported value, highlighting the necessity of using a unified script for fair comparison.

[0122] In summary, these results demonstrate that SURE provides significant value for model selection by offering scenario-specific diagnostic results under a unified protocol.

[0123] The following section describes the full-stack speech understanding evaluation using Track II. Building upon the scenario stress tests of Track I, Track II provides a horizontal comparison of representative speech understanding tasks under a unified protocol. This application benchmarks strong models across different paradigms, including end-to-end Speech LLMs and cascaded pipelines as a supplementary reference, evaluating them using the same prediction format, normalization method, and scoring script. As shown in Table 4, Track II covers a wide range of tasks, from basic recognition and translation to paralinguistic and semantic understanding tasks.

[0124] Table 4: Track II: Cross-sectional comparisons on the speech comprehension task. All scores are expressed as %.

[0125] For ASR, this application reports the WER (clean / other) for LibriSpeech and the CER (↓) for AISHELL-1. GR, SER, and SLU (Spoken Language Understanding) are presented with accuracy (↑). S2TT reports the character-level BLEU (↑) for the English-to-Chinese / Chinese-to-English translation task on CoVoST2. “–” indicates no results are available.

[0126] Three observations stand out. First, with fixed post-processing and scoring methods, cascaded pipelines remain competitive on core perception tasks, indicating that combining a strong front-end with a robust language back-end remains a viable design approach under clean conditions. Second, sentiment recognition remains challenging for all systems, suggesting that current models do not fully utilize sentiment and prosodic cues. Third, this application observes that some Speech LLMs with instruction-following capabilities suffer from format adherence issues crucial for evaluation on relatively simple tasks (such as ASR and S2TT): even if the generated content appears reasonable, automatic evaluation metrics drop significantly if it deviates from the required output pattern.

[0127] The following section describes the initial exploration of controlled training in conjunction with Track III. Following the completion of Tracks I–II, Track III provides an initial exploration of controlled training as a step towards more reproducible training-oriented research. The goal of this application is not to directly present broad architectural conclusions, but rather to transform "paper + code" into an executable and comparable training process under a unified protocol by publishing a proxy-assisted transformation workflow. This workflow enables the migration of the training pipeline to the open-source framework Swift. With a limited open data budget, this application employs a matching protocol to train the model from scratch and evaluates it using the same scoring script at its best model checkpoint, thereby reducing the variance introduced by heterogeneous training pipelines and different reporting methods.

[0128] Regarding task and data partitioning, this application reuses the task hierarchy from Track II, while constructing training partitions that are source-dependent with the evaluation benchmarks and incorporating explicit generalization ability tests. For example, this application trains the SER on IEMOCAP and evaluates it on MELD; it trains the SLU on SLURP and evaluates it on MMSU-Reason. All evaluation metrics follow the settings of Track II.

[0129] In terms of agent-assisted conversion workflows, such as Figure 5 As shown on the right, the proxy pipeline of this application generates a Swift training recipe and simultaneously outputs a versioned conversion plan and validation report. This pipeline analyzes model specifications from papers and code repositories, generates executable configurations, validation data, and loss / metric connections, and runs integration checks before officially starting training.

[0130] Specifically, the agent will generate the following three parts: (i) Versioned Swift recipes (including models, data, optimizers, and training schedulers); (ii) An executable conversion plan; (iii) Verifier report.

[0131] The validator performs static checks (such as dependency resolution, configuration rationality checks, and loss / metric signature checks) and integration checks (such as a short integration trial run on mini-batch data) to ensure that the transformed pipeline is runnable before full training.

[0132] Regarding the initial model coverage and results, as a proof of concept, this application migrated a small group of representative models to Swift. It is worth noting that Qwen2-Audio can complete the end-to-end conversion without manual patching, while other models may require lightweight manual modifications due to incomplete or non-standardized releases. Table 5 reports the results for Qwen2-Audio-7B and TASU(SFT)-2B, both of which were trained from scratch using the same protocol.

[0133] Table 5: Task coverage of Track III controlled training.

[0134] Table 5 shows the evaluation of ASR on Aishell1 (Chinese) and LibriSpeech test-clean (English); GR on LibriSpeech; SER on MELD; SLU on MMSU-Reason; and S2TT on CoVoST2. All evaluation metrics follow the definitions in Table 4, and results are reported as a percentage (%).

[0135] Overall, TASU lags behind Qwen2-Audio on para-linguistic tasks (such as GR and SER), but remains competitive on semantic tasks (such as SLU and S2TT), which is consistent with its design approach that places greater emphasis on language supervision.

[0136] The SURE framework proposed in this application provides a unified and reproducible experimental framework for speech understanding. SURE standardizes prediction formats, normalization, and scoring to support consistent comparisons across different model types and provides a scenario-driven data suite under real acoustic and language stress conditions. Simultaneously, SURE also releases an agent-assisted transformation workflow that converts "paper + code" into versioned, runnable Swiftpipelines to support controlled training research. SURE is open-source and scalable, making it suitable for deployment-oriented model selection.

[0137] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of combined actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Secondly, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this application. In the above embodiments, the descriptions of each embodiment have their own emphasis; for parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0138] In some embodiments, this application also provides a computer program product, the computer program product including a computer program stored on a non-volatile computer-readable storage medium, the computer program including program instructions, which, when executed by a computer, cause the computer to execute any of the above-described unified evaluation methods for speech understanding models.

[0139] In some embodiments, this application also provides an electronic device, which includes: at least one processor and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to execute a unified evaluation method for speech understanding models.

[0140] The apparatus described in the embodiments of this application can be used to execute the unified evaluation method for speech understanding models in the embodiments of this application, and accordingly achieve the technical effects achieved by the unified evaluation method for speech understanding models in the embodiments of this application, which will not be elaborated further here. In the embodiments of this application, the relevant functional modules can be implemented using a hardware processor.

[0141] Figure 6 This is a schematic diagram of the hardware structure of an electronic device that performs a unified evaluation method for speech understanding models, as provided in another embodiment of this application. Figure 6 As shown, the device includes: One or more processors 610 and memory 620, Figure 6 Take the 610 processor as an example.

[0142] The device for implementing the unified evaluation method for speech understanding models may further include: an input device 630 and an output device 640.

[0143] The processor 610, memory 620, input device 630, and output device 640 can be connected via a bus or other means. Figure 6 Taking the example of a connection between China and Israel via a bus.

[0144] The memory 620, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the program instructions / modules corresponding to the unified evaluation method for speech understanding models in the embodiments of this application. The processor 610 executes various functional applications and data processing of the server by running the non-volatile software programs, instructions, and modules stored in the memory 620, thereby implementing the unified evaluation method for speech understanding models in the above-described method embodiments.

[0145] The memory 620 may include a program storage area and a data storage area, wherein the program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on the use of the device, etc. Furthermore, the memory 620 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, the memory 620 may optionally include memory remotely located relative to the processor 610, and these remote memories may be connected to the device via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

[0146] Input device 630 can receive input digital or character information and generate signals related to user settings and function control of the device. Output device 640 may include display devices such as a display screen.

[0147] The one or more modules are stored in the memory 620, and when executed by the one or more processors 610, they execute the unified evaluation method for speech understanding models in any of the above method embodiments.

[0148] The above-described product can perform the methods provided in the embodiments of this application, and has the corresponding functional modules and beneficial effects for performing the methods. Technical details not described in detail in this embodiment can be found in the methods provided in the embodiments of this application.

[0149] The electronic devices in this application embodiments exist in various forms, including but not limited to: (1) Mobile communication devices: These devices are characterized by their mobile communication capabilities and primarily aim to provide voice and data communication. These terminals include: smartphones (e.g., iPhones), multimedia phones, feature phones, and low-end phones, etc.

[0150] (2) Ultra-mobile personal computer devices: These devices fall under the category of personal computers, possessing computing and processing capabilities, and generally also have mobile internet access features. These terminals include PDAs, MIDs, and UMPCs, such as the iPad.

[0151] (3) Portable entertainment devices: These devices can display and play multimedia content. This category includes audio and video players (such as iPods), handheld game consoles, e-book readers, as well as smart toys and portable car navigation devices.

[0152] (4) Server: A device that provides computing services. The components of a server include a processor, hard disk, memory, system bus, etc. Servers are similar to general computer architectures, but because they need to provide highly reliable services, they have higher requirements in terms of processing power, stability, reliability, security, scalability, and manageability.

[0153] (5) Other electronic devices with data interaction functions.

[0154] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.

[0155] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented using software plus a general-purpose hardware platform, or of course, using hardware. Based on this understanding, the above technical solutions, in essence or the parts that contribute to the related technology, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0156] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A unified evaluation method for speech understanding models, comprising: Obtain the model prediction files output by the multiple speech understanding models to be evaluated for the evaluation task, as well as the standard reference file corresponding to the evaluation task; The model prediction file follows a preset standard prediction output format; The model prediction file is parsed based on the standard prediction output format to extract sample identifiers, metadata information and prediction text. The correspondence between the prediction text and the reference text in the standard reference file is established based on the sample identifiers. The speech understanding task type corresponding to the current evaluation task is identified based on the metadata information. When the metadata information is insufficient to determine the speech understanding task type, the speech understanding task type is determined by combining the standard reference file. Based on the identified speech understanding task type, the text normalization component corresponding to the speech understanding task type is invoked to perform unified text normalization processing on the extracted predicted text and the reference text in the standard reference file, so as to obtain normalized predicted text and normalized reference text. Based on the speech understanding task type, a scoring script corresponding to the speech understanding task type is called through a preset unified scoring interface. Based on the normalized predicted text and the normalized reference text, the task performance indicators of the multiple speech understanding models under the current evaluation task are calculated respectively. Based on the task performance indicators and the preset reference performance values ​​corresponding to each task performance indicator, the task performance indicators of each speech understanding model are normalized and converted to obtain the relative performance scores corresponding to each speech understanding model. The task performance metrics and relative performance scores corresponding to each of the aforementioned speech understanding models are aggregated to generate a standardized evaluation report.

2. The method according to claim 1, wherein, The process of obtaining model prediction files output by the multiple speech understanding models to be evaluated for the evaluation task includes: A scenario stress test suite oriented towards a real deployment environment is constructed as the data source for the evaluation task. The scenario stress test suite includes an acoustic stress test set and a language stress test set. The acoustic stress test set includes audio samples with background noise, audio samples with far-field reverberation, and conference audio samples with multiple speakers overlapping. The language stress test set includes audio samples with dialect variant features, audio samples with Chinese-English mixed code switching features, and audio samples carrying contextual hot word information. For each audio sample in the scenario stress test suite, the multiple speech understanding models to be evaluated are controlled to perform inference prediction operations to generate inference output results. The inference output results are then structured and encapsulated according to the standard prediction output format to generate multiple model prediction files corresponding to different real-world stress scenarios.

3. The method according to claim 1, wherein, The step of calling the text normalization component corresponding to the identified speech understanding task type to perform unified text normalization processing on the extracted predicted text and the reference text in the standard reference file, to obtain normalized predicted text and normalized reference text, includes: When the identified speech understanding task type is an automatic speech recognition task, the basic text processing component is invoked to perform case unification processing, language-related numerical expression standardization processing, and punctuation normalization processing on the predicted text and the reference text, so as to obtain the corresponding normalized predicted text and normalized reference text respectively. When the identified speech understanding task type is a speech emotion recognition task, a gender recognition task, or a speech input-based classification task, the label processing component is invoked to perform classification label cleaning, task-irrelevant special symbol removal, and category format unification processing on the predicted text and the reference text, so as to obtain the corresponding normalized predicted text and normalized reference text respectively. When the identified speech understanding task type is a speech translation task, the translation text processing component is invoked to perform language-related format normalization, tag cleaning, and cross-language punctuation normalization on the predicted text and the reference text, so as to obtain the corresponding normalized predicted text and normalized reference text respectively.

4. The method according to claim 1, wherein, Based on the speech understanding task type, a scoring script corresponding to the speech understanding task type is invoked through a preset unified scoring interface. Based on the normalized predicted text and the normalized reference text, the task performance metrics of the multiple speech understanding models under the current evaluation task are calculated, including: If the speech understanding task type is an automatic speech recognition task, then the corresponding error rate calculation script is called, and the calculated word error rate or character error rate is used as the task performance indicator. If the speech understanding task type is a speech translation task, then the corresponding machine translation evaluation script is invoked, and the translation evaluation score of the normalized predicted text compared with the normalized reference text is used as the task performance index. If the speech understanding task type is a speech emotion recognition task, a gender recognition task, or a speech input-based classification task, then the corresponding classification accuracy calculation script is invoked, and the accuracy value of the normalized predicted text matched with the normalized reference text is used as the task performance index.

5. The method according to claim 1, wherein, The step of normalizing the task performance indicators of each speech understanding model based on the task performance indicators and the preset reference performance values ​​corresponding to each task performance indicator to obtain the relative performance score corresponding to each speech understanding model includes: Obtain the best benchmark score corresponding to the current evaluation task from the preset evaluation leaderboard records, and use the best benchmark score as the preset reference performance value; Identify the polarity of the task performance indicators, which include positive indicators and negative indicators. The positive indicators are those whose larger values ​​represent better performance, and the negative indicators are those whose smaller values ​​represent better performance. If the task performance metric is positive, calculate the ratio parameter between the task performance metric of the current speech understanding model and the best benchmark score; When the task performance metric is an inverse metric, calculate the ratio parameter between the best benchmark score and the task performance metric of the current speech understanding model; Based on the ratio parameter, the relative performance score corresponding to each of the speech understanding models is determined.

6. The method according to claim 5, wherein, After generating the standardized evaluation report, the method further includes performing a dynamic leaderboard calibration operation, specifically including: Detect whether the performance metrics of each of the speech understanding models under the current evaluation task are better than the best benchmark score in the evaluation leaderboard record; If a task performance metric that is better than the best benchmark score exists, the best benchmark score will be updated to that task performance metric to trigger the evaluation leaderboard refresh mechanism. Based on the updated best benchmark score and historical evaluation records, the normalization transformation is re-executed to dynamically calibrate the relative performance scores of other speech understanding models recorded on the evaluation leaderboard under the current evaluation task, and to update the evaluation leaderboard records corresponding to the current evaluation task.

7. The method according to claim 2, wherein, The process aggregates the task performance metrics and relative performance scores corresponding to each of the speech understanding models to generate a standardized evaluation report, including: The task performance indicators and relative performance scores corresponding to each of the speech understanding models are grouped and summarized according to at least one grouping dimension; wherein, the grouping dimension includes the speech understanding task type dimension and the real-world stress scenario dimension; the real-world stress scenario dimension includes any one of the following: background noise scenario, far-field reverberation scenario, multi-speaker overlapping conference scenario, dialect variant scenario, Chinese-English mixed code switching scenario, and context hot word scenario. Generate structured evaluation results that include model identifier, task type, scenario identifier, task performance metrics, relative performance score, and model ranking results; A standardized evaluation report is output based on the structured evaluation results.

8. The method according to claim 1, wherein, After generating a standardized evaluation report, the method further includes performing an automated training process transformation operation, specifically including: Receive controlled training and reproduction requests for the target speech understanding model; Based on the controlled training reproduction request, the intelligent agent module automatically parses the paper description documents and open source code repositories associated with the target speech understanding model to extract the network architecture features, training dataset dependencies and optimizer hyperparameter configurations of the target speech understanding model. Based on the extracted network architecture features, training dataset dependencies, and optimizer hyperparameter configurations, and combined with the preset unified training framework specifications, a standardized unified training configuration file is generated. The training process is verified based on the unified training configuration file, and after the training process verification is successful, an executable training process corresponding to the target speech understanding model is output.

9. The method according to claim 8, wherein, The step of performing training process verification based on the unified training configuration file, and outputting an executable training process corresponding to the target speech understanding model after the training process verification is successful, includes: Perform at least one of the following static checks: code dependency resolution check, parameter consistency check, and loss function and evaluation metric correspondence check; After the static checks pass, integration tests are performed on a pre-set small-scale test suite. After the integrated runtime test is passed, an executable training process corresponding to the target speech understanding model is output.

10. A computer device comprising a memory, a processor, and a computer program stored in the memory, wherein, The processor executes the computer program to implement the steps of the method according to any one of claims 1-9.