Method, apparatus and medium for evaluating speech recognition intermediate results

By comparing characters one by one and calculating the degree of fluctuation, the problem of distortion in the intermediate results of speech recognition evaluation was solved, and the quality of intermediate results was accurately quantified, thereby improving the accuracy and authenticity of the evaluation results.

CN122201347APending Publication Date: 2026-06-12IFLYTEK CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
IFLYTEK CO LTD
Filing Date
2026-03-23
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing methods for evaluating intermediate results in speech recognition cannot accurately measure the quality of these results, leading to distorted evaluation results that fail to reflect the user's true experience.

Method used

By acquiring the transcribed text of the audio to be evaluated and the intermediate speech recognition results, the correctness markers are determined by comparing the actual transcribed characters with the transcribed text character by character. The intermediate process characters are extracted and the degree of fluctuation is calculated. Finally, the evaluation result is determined based on the correctness markers and the degree of fluctuation.

🎯Benefits of technology

It achieves objective and accurate quantification of intermediate results in speech recognition, avoiding the problem of traditional evaluation methods that only focus on smoothness while ignoring accuracy and stability, and significantly improves the accuracy of evaluation results in reflecting real user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201347A_ABST
    Figure CN122201347A_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of speech recognition, and provides a speech recognition intermediate result evaluation method, device, equipment and medium, the method comprising: obtaining a transcription text of an audio to be evaluated, and a speech recognition intermediate result output in a recognition process of the audio to be evaluated; comparing actual writing characters in the speech recognition intermediate result with the transcription text to determine correctness marks of the actual writing characters; determining intermediate process characters from the actual writing characters, and obtaining correctness marks corresponding to the intermediate process characters; calculating fluctuation degrees of the intermediate process characters in an output process according to the correctness marks corresponding to the intermediate process characters; and determining an evaluation result of the speech recognition intermediate result based on the correctness marks of the actual writing characters and the fluctuation degrees. The present application introduces fluctuation degree calculation for intermediate process characters and combines content correctness marks in evaluation, thereby avoiding index distortion problems caused by multiple word omissions or repeated modification flickering.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of speech recognition technology, and in particular to a method, apparatus, device, and medium for evaluating intermediate results of speech recognition. Background Technology

[0002] When speech recognition decodes in real time, intermediate results are displayed on the screen. Objectively measuring the quality of this display is key to improving the interactive experience.

[0003] Currently, all intermediate results are typically recorded, and the quality of the displayed text is measured by calculating metrics such as overall smoothness. The closer the smoothness is to 1, the higher the quality is considered. However, the smoothness metric mainly focuses on the continuity of the text output in terms of the rhythm of increasing word count. Even if the displayed text exhibits repeated modifications and flickering phenomena such as "outputting typos first and then replacing and correcting them," the smoothness metric will still judge the process as extremely smooth and give it a perfect score because the characters on the screen are still being output continuously and coherently. This leads to a serious distortion of the evaluation results. Summary of the Invention

[0004] This invention provides a method, apparatus, device, and medium for evaluating intermediate results of speech recognition, in order to address the deficiencies in the prior art.

[0005] This invention provides a method for evaluating intermediate results of speech recognition, comprising the following steps: Obtain the transcribed text of the audio to be evaluated, as well as the intermediate speech recognition results output by the audio during the recognition process; The actual scribing characters in the intermediate speech recognition results are compared with the transcribed text to determine the correctness mark of the actual scribing characters; Determine intermediate process characters from the actual written characters, and obtain the correctness flag corresponding to the intermediate process characters; Based on the correctness flags corresponding to the intermediate process characters, calculate the degree of fluctuation of the intermediate process characters during the output process; Based on the correctness markers of the actual characters written and the degree of fluctuation, the evaluation result of the intermediate speech recognition result is determined.

[0006] According to a method for evaluating intermediate results of speech recognition provided by the present invention, the step of calculating the fluctuation degree of the intermediate process character in the output process based on the correctness marker corresponding to the intermediate process character includes: The process accuracy is determined based on the correctness flags corresponding to all intermediate process characters. Calculate the degree of difference between the correctness marker corresponding to each intermediate process character and the process accuracy; The degree of fluctuation is obtained by summing up the degree of difference corresponding to all intermediate process characters.

[0007] According to the speech recognition intermediate result evaluation method provided by the present invention, the step of obtaining the fluctuation degree by comprehensively considering the degree of difference corresponding to all intermediate process characters includes: The degree of difference corresponding to each intermediate process character is squared to obtain the squared value of the difference for each intermediate process character. The fluctuation level is obtained by summing the squared differences corresponding to all intermediate process characters and combining this sum with the number of the first character of each intermediate process character.

[0008] According to the speech recognition intermediate result evaluation method provided by the present invention, the step of determining the process accuracy based on the correctness markers corresponding to all intermediate process characters includes: Count the number of the first characters in the intermediate process; If the number of the first character is zero, then the preset value is used as the accuracy of the process; If the number of the first character is greater than zero, then the number of the second character corresponding to the correctness mark is counted from the intermediate process characters, and the ratio between the number of the second character and the number of the first character is taken as the process accuracy.

[0009] According to a method for evaluating intermediate results of speech recognition provided by the present invention, determining the evaluation result of the intermediate results of speech recognition based on the correctness marker of the actual scribbled characters and the fluctuation degree includes: Based on the degree of fluctuation, a stability coefficient is determined for applying a score penalty to the intermediate process characters; The stage accuracy is determined based on the correctness markers of the actual characters written. The stability coefficient is used to numerically adjust the stage accuracy, and the effective compliance rate is calculated by combining the weight allocation value corresponding to the stage accuracy. The evaluation result is determined based on the effective compliance rate.

[0010] According to a method for evaluating intermediate results of speech recognition provided by the present invention, determining intermediate process characters from the actual scribbled characters includes: Determine the first character at the beginning position and the last character at the end position in the actual written characters; The remaining characters in the actual written characters, excluding the first and last characters, are determined as the intermediate process characters.

[0011] According to the method for evaluating intermediate results of speech recognition provided by the present invention, the stage accuracy includes the first-part accuracy, the process accuracy, and the last-part accuracy. The step of adjusting the stage accuracy using the stability coefficient and calculating the effective achievement rate by combining the weight allocation value corresponding to the stage accuracy includes: The process accuracy is multiplied by the stability coefficient and combined with the process weight to obtain the process weighted score; The head-weighted score is calculated based on the head-end accuracy and head-end weight. The tail-weighted score is calculated based on the tail accuracy and tail weight. The effective compliance rate is obtained by summing the process weighted score, the head weighted score, and the tail weighted score.

[0012] According to a method for evaluating intermediate results of speech recognition provided by the present invention, determining a stability coefficient for penalizing the intermediate process characters based on the fluctuation level includes: Obtain the fluctuation tolerance attribute of the application scenario corresponding to the audio to be evaluated; Based on the fluctuation tolerance attribute, determine the penalty adjustment coefficient; The target volatility level is obtained by weighting and adjusting the volatility level using the penalty adjustment coefficient. The stability coefficient is determined based on the target fluctuation level.

[0013] According to a method for evaluating intermediate results of speech recognition provided by the present invention, determining the evaluation result based on the effective pass rate includes: Determine the number of third characters to be scribbled for the transcribed text, and the number of fourth characters to be scribbled for the actual transcribed text; The ratio between the number of the fourth character and the number of the third character is used as the basic completion rate to reflect the integrity of the write coverage. The evaluation result is obtained by weighted summing of the basic completion rate and the effective achievement rate.

[0014] This invention also provides a method for evaluating intermediate results of speech recognition, comprising the following modules: The acquisition module is used to acquire the transcribed text of the audio to be evaluated, as well as the intermediate speech recognition results output by the audio to be evaluated during the recognition process; The comparison module is used to compare the actual scribbled characters in the intermediate speech recognition results with the transcribed text to determine the correctness mark of the actual scribbled characters; The determination module is used to determine intermediate process characters from the actual written characters and obtain the correctness mark corresponding to the intermediate process characters; The calculation module is used to calculate the degree of fluctuation of the intermediate process character during the output process based on the correctness mark corresponding to the intermediate process character; The evaluation module is used to determine the evaluation result of the intermediate speech recognition result based on the correctness marker of the actual written characters and the degree of fluctuation.

[0015] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement an evaluation method for intermediate results of speech recognition as described above.

[0016] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the evaluation method for intermediate results of speech recognition as described above.

[0017] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements an evaluation method for intermediate results of speech recognition as described above.

[0018] The present invention provides a method, apparatus, device, and medium for evaluating intermediate results of speech recognition. It acquires the transcribed text of the audio to be evaluated and the intermediate results of speech recognition, comparing them character by character to determine the correctness markers of the actually transcribed characters. Then, it specifically extracts the intermediate process characters to calculate their fluctuation degree during the output process. Finally, it combines the correctness markers and fluctuation degree to determine the evaluation result, achieving objective and accurate quantification of the real-time decoding output quality of speech recognition. By introducing the calculation of the fluctuation degree of intermediate process characters and combining the content correctness markers in the evaluation, it overcomes the limitations of traditional tests that only focus on smoothness while ignoring accuracy and stability. It avoids the distortion of indicators caused by the coexistence of multiple missing characters or flickering due to repeated modifications, significantly improving the accuracy of the evaluation results in reflecting the real user experience. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0020] Figure 1 This is a flowchart illustrating the evaluation method for intermediate results of speech recognition provided by the present invention.

[0021] Figure 2 This is a schematic diagram of the overall process for evaluating the quality of intermediate results of speech recognition provided by the present invention.

[0022] Figure 3 This is a schematic diagram of the stage accuracy calculation process provided by the present invention.

[0023] Figure 4 This is a schematic diagram of the comprehensive calculation process for the evaluation results provided by the present invention.

[0024] Figure 5 This is a schematic diagram of the structure of the speech recognition intermediate result evaluation device provided by the present invention.

[0025] Figure 6 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0026] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0027] During the real-time decoding process of speech recognition, intermediate results (i.e., text display results) are usually displayed to the user in real time, allowing the user to perceive that subtitles are constantly popping up during the conversation, thus making the real-time experience feel faster.

[0028] Currently, the main testing method for measuring the quality of the text-writing results presented to users is as follows: First, all text-writing results are recorded; then, three evaluation indicators are calculated: overall smoothness, overall weighted effect, and similarity between adjacent results. In this scheme, the closer the overall smoothness is to 1, and the higher the weighted effect and the similarity between adjacent results, the higher the overall quality of the intermediate results.

[0029] However, the current metrics and testing methods for measuring the results of text brushing have obvious flaws and problems: First, in actual output, there are often cases of multiple brushing and missing brushing, which leads to an inflated overall smoothness and other metrics, failing to truly reflect the actual text brushing situation; Second, the output of text brushing results is mainly intended to improve the user's real-time experience and make the user perceive a smoother experience, but the current testing scheme does not involve any relevant metrics for the correctness of the text brushing results.

[0030] For example, the user's conversation content is "What's the weather like today", and the intermediate results given by the recognizer are output in sequence as: "Jin Tian" → "Today" → "Today Tian" → "Today Tian Qi" → "Today Tian Qi Zen" → "Today Tian Qi Zen Me" → "Today Tian Qi Zen Me Yang". According to the current measurement method, since the number of characters is increasing steadily, the overall smoothness index is 100%, and the system defines that the user's perception is good. However, the actual effect perceived by the customer has chaos such as two consecutive modifications from "Jin Tian" to "Today", and the sudden error "Qi" in the intermediate results, and the actual experience is extremely poor.

[0031] In response to this, the present invention provides an evaluation method for intermediate results of speech recognition, aiming to compare the actual rewritten characters in the intermediate results of speech recognition with the transcribed text to determine the correctness mark, and specifically extract the intermediate process characters from the actual rewritten characters, and calculate the fluctuation degree of the intermediate process characters during the output process in combination with their correctness marks. Finally, based on the overall correctness mark and the fluctuation degree, the evaluation result is comprehensively determined, so as to solve the problems of the lack of indicators for the accuracy of the brush character results obtained by current users and the one-sidedness of the test, accurately quantify the error flicker and fluctuation, and truly reflect the output quality of the intermediate results and the user experience.

[0032] Among them, all actions of obtaining signal information or data in the present invention are carried out on the premise of complying with the corresponding data protection regulations and policies of the country where it is located and obtaining authorization from the owner of the corresponding device.

[0033] Figure 1 is a schematic flowchart of the evaluation method for intermediate results of speech recognition provided by the present invention. As Figure 1 shown, the method includes step 110, step 120, step 130, step 140 and step 150.

[0034] Step 110: Obtain the transcribed text of the audio to be evaluated, and the intermediate results of speech recognition output during the recognition process of the audio to be evaluated.

[0035] Here, the audio to be evaluated can be understood as the voice conversation data to be evaluated for quality. The transcribed text refers to the final accurate standard text data obtained after speech recognition of the audio to be evaluated. The intermediate results of speech recognition can be understood as the sequence of recognition content that the speech recognition system gradually throws to the user for real-time display on the screen during the real-time decoding process. The intermediate results of speech recognition usually include output texts that are dynamically updated multiple times in the time series.

[0036] As an alternative embodiment, the basic information related to the audio to be evaluated can be collected by the data acquisition module. For example, the transcribed text after the recognition of the audio to be evaluated is collected, as well as all the intermediate results of speech recognition during the entire session. Since the transcribed text and the intermediate results of speech recognition of the audio to be evaluated are obtained, content-level comparison can be performed based on these two.

[0037] Step 120: Compare the actual written characters in the intermediate results of speech recognition with the transcribed text to determine the correctness mark of the actual written characters.

[0038] Considering that the traditional solution only focuses on the overall smoothness, when there are both overwriting and missing writing at the same time, it may lead to false high indicators such as smoothness and cannot truly reflect the writing quality. Based on this, in order to truly restore the accuracy of the written content in this embodiment, by comparing the actual written characters with the transcribed text character by character to determine the correctness mark, abnormal data such as overwriting or misspelled words can be accurately identified, avoiding the problem of evaluation distortion caused by solely relying on fuzzy smoothness.

[0039] Specifically, the actual written characters refer to each single character or character sequence that has actually completed the output operation in the intermediate results of speech recognition. The correctness mark is used to represent whether each actual written character is completely correct in terms of semantics and content.

[0040] As an alternative embodiment, the actual written characters in the intermediate results of speech recognition can be compared character by character with the corresponding standard characters in the transcribed text to determine the correctness mark of the actual written characters. For example, for each actual written character, if it is consistent with the corresponding character in the transcribed text, its correctness mark is determined to be in the correct state; if not, its correctness mark is determined to be in the wrong state.

[0041] For example, the transcribed text corresponding to the user's conversation content is "What's the weather like today", and the actual written characters in an intermediate result of speech recognition given by the recognizer are "Today's Tianqi". By comparing this actual written character with the transcribed text character by character, it can be determined that the first three characters "Jin", "Tian", "Tian" are all recognized correctly, and their correctness marks are in the correct state, while the fourth character "Qi" is recognized incorrectly, and its correctness mark is in the wrong state.

[0042] Step 130: Determine the intermediate process characters from the actual written characters and obtain the correctness marks corresponding to the intermediate process characters.

[0043] Specifically, intermediate characters can be understood as the remaining internal characters after removing specific edge characters from the actual written characters. They typically contain a large number of characters and are the main carriers of core information. In actual speech, the beginning and end are usually used to set the tone and close the semantic loop, while the middle part is often the main carrier of core information. It is numerous and most easily dynamically modified during real-time decoding.

[0044] As an alternative implementation, characters can be determined as intermediate process characters based on their position within the actual character sequence being written. For example, the first and last characters of the actual writing sequence can be extracted, and all remaining characters in between can be identified as intermediate process characters.

[0045] Similarly, after extracting the intermediate process characters, the correctness markers corresponding to each of these intermediate process characters are obtained simultaneously. Considering that the initial characters are often used to define the tone of a statement and have a relatively simple state, the final characters are used to close the statement loop, while the intermediate process characters undergo the most frequent alternation and modification during continuous streaming output. Based on this, this embodiment determines the intermediate process characters from the actual scribbled characters for independent extraction and analysis, which can more accurately locate the flickering shortcomings in the identification process.

[0046] Step 140: Calculate the degree of fluctuation of the intermediate process characters in the output process based on the correctness markers corresponding to the intermediate process characters.

[0047] Specifically, the degree of fluctuation refers to the severity to which the actual characters being written are sometimes correct and sometimes incorrect, or subject to repeated modifications during the dynamic display process. The degree of fluctuation reflects whether the writing quality remains consistently stable. By quantifying the degree of fluctuation, situations where the quality fluctuates or changes repeatedly can be avoided. For example, if some intermediate characters frequently jump between correct and incorrect states in different speech recognition intermediate results, this indicates a high degree of fluctuation.

[0048] As an optional implementation, the degree of fluctuation of intermediate process characters during the output process can be calculated based on the statistical differences in the correctness markers corresponding to the intermediate process characters. For example, the overall accuracy of all intermediate process characters can be calculated first, and then the difference between the correctness marker of each intermediate process character and the overall accuracy can be compared. The degree of fluctuation can be quantified by summing the squares of the differences. The larger the value of the degree of fluctuation, the more violently the intermediate process characters flicker or are modified on the screen, and the more unstable the writing quality.

[0049] For example, assume that the user's conversation content is "What's the weather like today". The actual rewritten characters sequentially thrown out by the intermediate result of speech recognition are "Jin Tian", "Today", "Today Tian", "Today Tian Qi", "Today Weather How", "Today Weather What", "Today Weather What's the weather like". During the above process, the character "Qi" at the intermediate character position was briefly miswritten and corrected later. In addition, there was also a modification from "Jin Tian" to "Today" at the starting part. Only based on the smoothness calculation, the number of characters in the above process still shows an increasing trend, and the smoothness index can reach the full score, but the visual effect experience perceived by the user is not good. Similarly, the calculation of the fluctuation degree can extremely sensitively capture this flickering chaos. Based on this, in this embodiment, the fluctuation degree is calculated according to the correctness mark corresponding to the intermediate process characters, which can perfectly quantify the subtle shortcoming of the above repeated output.

[0050] Step 150: Determine the evaluation result of the intermediate result of speech recognition based on the correctness mark and the fluctuation degree of the actual rewritten characters.

[0051] Specifically, the evaluation result is the comprehensive score or rating data finally used to quantify and feedback the quality of the intermediate result of speech recognition on the screen.

[0052] As an optional embodiment, the recognition accuracy score can be calculated based on the correctness mark of the actual rewritten characters, and at the same time, the corresponding penalty coefficient can be calculated based on the fluctuation degree. Apply this penalty coefficient to the recognition accuracy score for penalty deduction, and finally fuse to obtain the evaluation result of the intermediate result of speech recognition.

[0053] Considering that only evaluating based on the final correct or incorrect may ignore the dynamic jump during the output of the intermediate result, resulting in the inability to guarantee the consistency of the user's real-time experience; and only focusing on the smoothness of the increase in the output number of characters may cover up the serious defect of all wrong content, resulting in deviating from the real experience.

[0054] Based on this, in order to measure truly, efficiently and close to the user's real experience in this embodiment, the evaluation result of the intermediate result of speech recognition is determined by combining the correctness mark of the actual rewritten characters and the fluctuation degree, which can achieve two-level control of the output quality and avoid the problems of inflated and distorted evaluation indicators.

[0055] As an optional embodiment, a stability coefficient for punishing the evaluation score can be generated based on the fluctuation degree. The greater the fluctuation degree, the stronger the punishment intensity applied by the stability coefficient; at the same time, in combination with the recognition accuracy score determined according to the correctness mark, the above recognition accuracy score and the stability coefficient are fused and calculated to further determine the final evaluation result.

[0056] Figure 2 It is the overall flow schematic diagram of the quality evaluation of the intermediate result of speech recognition provided by the present invention, as Figure 2 As shown, the overall process is divided into two main stages: basic data acquisition and core parameter calculation. Considering the evaluation system needs a comprehensive understanding of the audio recognition performance, the system first acquires the transcribed text of the audio to be evaluated, as well as the intermediate speech recognition results output during the recognition process. Based on this, the system further counts the number of third characters that should be written in the transcribed text and the number of fourth characters that are actually written. It then compares the actual written characters in the intermediate speech recognition results with the transcribed text to determine the correctness markers for the actual written characters. Subsequently, the system enters the core parameter calculation stage, calculating the basic completion rate reflecting coverage integrity and the accuracy rate at each stage. It also extracts characters from the intermediate process to calculate their fluctuation during the output process, converting this into a stability coefficient. Finally, based on the correctness markers and fluctuation levels of the actual written characters, the evaluation result of the intermediate speech recognition results is determined.

[0057] The speech recognition intermediate result evaluation method provided in this embodiment obtains the transcribed text of the audio to be evaluated and the speech recognition intermediate result, and compares them character by character to determine the correctness mark of the actual transcribed characters. Then, it specifically extracts the intermediate process characters to calculate their fluctuation degree in the output process. Finally, it combines the correctness mark and the fluctuation degree to determine the evaluation result, achieving objective and accurate quantification of the real-time decoding output quality of speech recognition. By innovatively introducing the calculation of the fluctuation degree of intermediate process characters and combining the content correctness mark in the evaluation, it overcomes the limitation of traditional tests that only focus on smoothness and ignore accuracy and stability. It avoids the problem of indicator distortion caused by the coexistence of multiple missing characters or repeated modification and flickering, and significantly improves the accuracy of the evaluation results in reflecting the real user experience.

[0058] Based on the above embodiments, the fluctuation degree of the intermediate process characters during the output process is calculated according to the correctness markers corresponding to the intermediate process characters, including: The process accuracy is determined based on the correctness flags corresponding to all intermediate process characters. Calculate the difference between the correctness marker corresponding to each intermediate process character and the process accuracy. The degree of fluctuation is obtained by summing up the differences between all intermediate process characters.

[0059] After obtaining the intermediate process characters and their corresponding correctness markers, it's important to consider that judging accuracy solely by a single calculation might lead to frequent repetitive shifts between correct and incorrect intermediate process characters. This "sometimes right, sometimes wrong" state results in severe visual flickering, and simply looking at the overall accuracy rate cannot reflect this poor user experience. For example, an extremely unstable process and a consistently accurate process might have the same accuracy rate, but the user experience would be vastly different.

[0060] Based on this, this embodiment calculates the fluctuation of intermediate process characters in the output process according to the correctness flags corresponding to the intermediate process characters, and uses this as a basis for quantitatively evaluating its stability, thereby avoiding evaluating such poor experience caused by frequent modifications as high quality.

[0061] As an optional implementation, the process accuracy is first determined based on the correctness markers corresponding to all intermediate process characters. Here, the correctness markers corresponding to all intermediate process characters refer to the set of character judgment results for that stage; the process accuracy is a numerical value reflecting the overall recognition accuracy level of that stage. For example, the process accuracy for that stage can be obtained by statistically analyzing the percentage of characters marked "yes" out of the total number of intermediate process characters.

[0062] After obtaining the process accuracy, considering the need to quantify the degree to which the state of each character deviates from this overall benchmark, the degree of difference between the correctness mark corresponding to each intermediate process character and the process accuracy is further calculated.

[0063] The degree of difference here refers to the gap between the specific performance of each individual character and the average level of its group (i.e., process accuracy). For example, if the calculated process accuracy is 66.7%, the degree of difference for a character marked "yes" (corresponding to 100%) is 33.3%; and the degree of difference for a character marked "no" (corresponding to 0%) is -66.7%.

[0064] After calculating all the degree of difference, the degree of difference corresponding to all intermediate process characters is combined to obtain the degree of fluctuation. As an optional embodiment, the degree of difference corresponding to each intermediate process character can be squared to obtain the corresponding squared difference value. Then, all the squared difference values ​​are summed and divided by the total number of intermediate process characters to finally obtain the degree of fluctuation of that group of intermediate process characters. The larger the degree of fluctuation value, the more serious the repeated errors in the intermediate output process, and the worse the user experience; conversely, it represents a more stable output.

[0065] Considering that when calculating the dynamic changes in the output state of characters during intermediate processes, simple linear addition may cause positive and negative differences to cancel each other out, thus failing to accurately reflect the drastic jumps during the actual writing process. In order to accurately quantify the fluctuations in time-accurate and time-inaccurate states, this embodiment preferably uses the variance calculation logic in statistics to amplify and normalize the degree of difference, and then combines the degree of difference corresponding to all intermediate process characters to obtain the degree of fluctuation.

[0066] Specifically, the fluctuation level is obtained by considering the degree of difference corresponding to all intermediate process characters, including: The degree of difference corresponding to each intermediate process character is squared to obtain the squared value of the difference for each intermediate process character. The degree of fluctuation is obtained by summing the squared differences corresponding to all intermediate process characters and combining this sum with the number of the first character of each intermediate process character.

[0067] To prevent positive variances above average accuracy from canceling out negative variances below average accuracy during subsequent merging, and to amplify the impact of extreme fluctuations on overall stability, this embodiment squares the variance for each intermediate character, obtaining a squared variance value for each intermediate character. This squaring operation transforms all variance levels into non-negative values, significantly improving the model's sensitivity to penalizing anomalous jumps.

[0068] Here, the degree of difference is used to characterize the distance between the current correctness state of a single intermediate character and its overall accuracy distribution. The squared difference value is a numerical representation of this distance after being squared and amplified.

[0069] As an optional embodiment, the correctness markers corresponding to the numerical values ​​of each intermediate process character can be obtained, and the difference can be obtained by subtracting them from the overall process accuracy. Then, the difference can be squared to obtain the squared difference value corresponding to each intermediate process character.

[0070] After obtaining the squared difference value corresponding to each intermediate process character, considering that the simple independent squared value cannot reflect the macroscopic stability of the intermediate process characters in the whole audio segment, and that the total number of characters contained in the audio to be evaluated of different lengths is different, it would lose the fairness of cross-audio evaluation if a simple summation is performed directly.

[0071] To eliminate the scaling effect of different audio text lengths on the evaluation metrics and to make the degree of fluctuation a unified standard measure, this embodiment sums the squared differences of all intermediate process characters and combines this sum with the number of the first character in the intermediate process characters to obtain the degree of fluctuation. By combining the number of the first character for mean calculation, a fluctuation index reflecting the stability of the current state of the entire intermediate process character set can be obtained objectively and fairly.

[0072] Here, the first character count refers to the total number of intermediate characters included in the actual scribing characters corresponding to the audio being evaluated. The fluctuation level is the variance characteristic value after mean normalization; the larger the value, the more drastic the output fluctuation and the less stable the scribing quality.

[0073] As an alternative embodiment, all the calculated squared differences can be summed, and the summation result can be divided by the number of the first character to obtain the degree of fluctuation.

[0074] For example, suppose the first character of an intermediate process character extracted from a speech recognition intermediate result is 3, with corresponding correctness states of correct, correct, and incorrect. The overall process accuracy of this intermediate process character is calculated to be approximately 66.7%. Based on this, the difference between the correct state and this accuracy rate for each intermediate process character needs to be calculated. The correct state corresponds to a value of 100%, and its difference from 66.7% is 33.3%; the incorrect state corresponds to a value of 0, and its difference from 66.7% is -66.7%. Then, the calculated 33.3% and -66.7% are squared to obtain three squared difference values. Finally, these three squared difference values ​​are summed, and the sum is divided by the number of first characters (3) to obtain the final fluctuation level used to characterize the stability bottleneck.

[0075] Considering that in actual real-time speech recognition decoding, the number of extracted intermediate process characters is often not fixed due to the length of user pronunciation and the differences in conversation content, and there may even be no intermediate process characters at all in some extremely short conversations, in order to avoid the calculation anomaly of the denominator being zero when directly calculating the ratio, and in order to accurately measure the correctness of the intermediate process characters themselves, this embodiment determines the process accuracy based on the correctness markers corresponding to all intermediate process characters, including: Count the number of the first character in the intermediate process; If the number of the first character is zero, the preset value will be used as the process accuracy. If the number of the first character is greater than zero, then count the number of the second character corresponding to the correctness mark from the intermediate process characters, and use the ratio between the number of the second character and the number of the first character as the process accuracy.

[0076] Specifically, the first character count is used to represent the total number of intermediate process characters remaining after stripping the first and last characters. As an optional embodiment, the intermediate process character sequence obtained can be traversed using character counting rules, and the characters in the sequence can be counted and accumulated to obtain the first character count of the intermediate process characters.

[0077] After obtaining the first character count of the intermediate process characters, considering the extreme cases of the session text length, it is necessary to select the corresponding processing path based on whether the first character count is zero.

[0078] To effectively handle cases where the audio to be evaluated is extremely short, if the number of the first character is zero, a preset value is used as the process accuracy. This avoids program interruptions or missing scores due to the lack of intermediate process characters, ensuring the integrity of the entire speech recognition intermediate result evaluation process. Here, the preset value refers to a default reference value pre-configured by the evaluation system to represent a perfect score or complete accuracy.

[0079] As an optional embodiment, if it is determined that the number of the first character is zero, it means that the actual transcribed text only contains the first and last characters, and the intermediate characters are completely absent. At this time, the evaluation system directly triggers the default assignment logic and uses the preset value (such as 100%) as the process accuracy of the intermediate result of the speech recognition.

[0080] Meanwhile, in order to accurately quantify the correctness of the core information carrier in the speech recognition process, if the number of the first character is greater than zero, the number of the second character marked as correct is counted from the intermediate process characters, and the ratio between the number of the second character and the number of the first character is used as the process accuracy, so as to intuitively and accurately reflect the correctness of the recognition of the intermediate process characters.

[0081] Here, "correct" means that the intermediate process characters output completely match the standard transcribed text. The second character count represents the number of characters correctly identified across all intermediate process characters. Process accuracy is the core evaluation parameter used to quantify the overall correctness of the intermediate process characters.

[0082] As an optional implementation, if the number of the first character is determined to be greater than zero, the evaluation system will extract the correctness markers corresponding to all intermediate process characters one by one, and then filter and count the number of the second characters whose correctness markers are in the correct state. Subsequently, the number of the second characters is divided by the number of the first characters by a division operation, and the ratio is converted into a percentage as the final process accuracy.

[0083] For example, suppose in a conversation in the audio to be evaluated, the total number of the first characters of the extracted intermediate process characters is 13. After word-by-word correctness marking and statistics, it is found that 12 of the intermediate process characters are marked as correct, meaning the number of the second process characters is 12. According to the calculation logic, dividing the number of second characters (12) by the number of first characters (13) yields a value of approximately 92.3%. The evaluation system can then use 92.3% as the process accuracy of this intermediate process character.

[0084] Traditional evaluation schemes focus solely on output fluency while neglecting dynamic fluctuations in real-time accuracy, leading to significant discrepancies between evaluation results and actual user experience. To reflect the impact of stability on writing quality and accurately quantify the accuracy at each stage, this embodiment determines the evaluation results of intermediate speech recognition results based on the correctness markers and fluctuation levels of actual written characters, including: Based on the degree of fluctuation, a stability coefficient is determined for applying a score penalty to intermediate process characters; Determine the stage accuracy based on the correctness markers of the actual characters written; The stability coefficient is used to adjust the stage accuracy numerically, and the effective achievement rate is calculated by combining the weight allocation value corresponding to the stage accuracy. The evaluation results are determined based on the effective compliance rate.

[0085] Specifically, the degree of fluctuation refers to the intensity of the alternation between correct and incorrect characters during the dynamic display of intermediate process characters. The stability coefficient is an adjustment factor derived from the degree of fluctuation, used to discount or reduce the accuracy score of intermediate process characters.

[0086] The greater the fluctuation, the more violently the intermediate characters flicker and modify during output, the smaller the stability coefficient, and the stronger the penalty; conversely, the smaller the fluctuation, the more smoothly the intermediate characters are output, the larger the stability coefficient, and the weaker the penalty. The stability coefficient can be calculated using the following formula: Stability coefficient = 1 / (1 + degree of fluctuation); According to the calculation formula, when the fluctuation level is 0, it means that the output of the intermediate process character is completely stable, the stability coefficient is 1, and there is no penalty; the greater the fluctuation level, the larger the sum of the terms in the denominator, the closer the stability coefficient is to 0, and the stronger the penalty.

[0087] After obtaining the stable coefficients used to penalize intermediate process characters, and considering the need for quantitative evaluation of the writing results from the perspective of recognition correctness, this embodiment determines the stage accuracy rate based on the correctness markers of the actual written characters to differentiate the performance of characters at different positions. Here, stage accuracy rate refers to the proportion of correctly recognized character content at each different position after segmenting the actual written characters according to their sentence position.

[0088] As an optional implementation, the actual characters written can be divided into the first character at the beginning position, the last character at the end position, and the intermediate characters in the middle, and the percentage of each character marked as correct can be counted. For example, the accuracy rate of the first character, the accuracy rate of the intermediate characters, and the accuracy rate of the last character can be counted and calculated separately, and the set of these three can be used as the stage accuracy rate.

[0089] After obtaining the stage accuracy, considering that the characters at different stages have different importance for semantic closure and core information carrying, and that the characters in the intermediate process have dynamic flickering problems, this embodiment uses a stability coefficient to adjust the stage accuracy numerically and calculates the effective compliance rate by combining the weight allocation value corresponding to the stage accuracy.

[0090] Here, the weight allocation value refers to a pre-set proportional value used to measure the importance of characters at different stages. The effective achievement rate refers to a comprehensive quality indicator that integrates accuracy, stability, and the importance of different stages. This weight allocation value can be pre-assigned as a fixed percentage to the initial character, intermediate characters, and final character by obtaining the role of characters at each stage in the overall semantic loop and core information delivery of the sentence, or by conducting statistical analysis based on a large amount of historical evaluation data.

[0091] As an optional implementation, the accuracy of each stage can be multiplied by the corresponding stability coefficient for numerical penalty adjustment, and then multiplied by the corresponding weight allocation value of that stage to obtain the weighted score of each stage. Finally, the weighted scores of each stage are added together to obtain the effective compliance rate.

[0092] For example, based on text features, the weight allocation for the first character is preset to 25%, the weight allocation for the middle characters to 45%, and the weight allocation for the last character to 30%. Assuming the accuracy rate of a certain audio segment is 66.7%, and the corresponding stability coefficient is calculated to be 0.818%, the accuracy rate is adjusted using the stability coefficient and multiplied by the weight allocation value of the middle characters to obtain a weighted score for the middle characters of approximately 24.6%. Similarly, the weighted scores for the first and last characters are calculated. Finally, the weighted scores for each stage are summed to obtain the effective pass rate.

[0093] After obtaining the effective compliance rate, considering that the effective compliance rate only reflects the correctness and stability of the content, and it is also necessary to examine whether the system has completely completed the content writing operation, this embodiment determines the evaluation result based on the effective compliance rate. Here, the evaluation result is the final comprehensive score value used to measure the audio writing quality.

[0094] As an optional implementation, the basic completion rate can be obtained by first calculating the ratio of the actual number of characters written to the number of characters to be written, and then the basic completion rate and the effective compliance rate can be weighted and summed to determine the final evaluation result.

[0095] For example, set the weight of the basic completion rate to 40% and the weight of the effective compliance rate to 60%. If the basic completion rate in a certain audio test is 100% and the effective compliance rate calculated from the previous steps is 79.7%, then the evaluation result is 100%×40% + 79.7%×60%. After calculation, the determined evaluation result is approximately 77.8%.

[0096] Considering that the start and end of a sentence usually determine the overall theme tone and semantic closed-loop of the whole sentence, and they usually only have two fixed states of correct or wrong during output, and there is usually no physical condition for continuous repeated modification. Instead, the middle part is the carrier of the core information and also the hardest-hit area for the dynamic flickering phenomenon. In order to accurately lock the target object that needs to be punished for fluctuations, in this embodiment, the intermediate process characters are determined from the actual written characters, including: Determine the first character at the starting position and the last character at the ending position in the actual written characters; Determine the remaining characters in the actual written characters except the first character and the last character as the intermediate process characters.

[0097] Specifically, the first character refers to a starting character that ranks first in the complete sequence of actual written characters. The last character refers to a finalized character that ranks last in the sequence of actual written characters.

[0098] As an optional embodiment, the directional extraction operation can be performed by parsing the position index identifier of the character sequence. For example, obtain the total length attribute of the actual written characters, directly extract the single character with the position index of the starting position as the first character, and at the same time extract the single character with the position index of the ending position as the last character. Combining the aforementioned audio writing scenario, assuming that the final output sequence of actual written characters is "What's the weather like today", then through position determination, the extracted first character is "今" and the extracted last character is "样".

[0099] After determining the first character at the starting position and the last character at the ending position in the actual written characters, considering that the number of characters in the middle part is relatively large, which is the core information segment that users perceive in real time during the conversation and is also the area where the error and correction alternation phenomenon is most likely to occur during real-time decoding. In order to obtain the purest dynamic monitoring samples, in this embodiment, the remaining characters in the actual written characters except the first character and the last character are determined as the intermediate process characters, so as to accurately strip out the character set that will actually flicker and be repeatedly replaced through the filtering logic of removing the head and tail.

[0100] Here, the remaining characters refer to all the internal character sets remaining after explicitly removing the first character and the last character in the complete actual writing character sequence. The intermediate process characters are the core text segments extracted as described above and specifically used to measure the dynamic output stability.

[0101] As an optional embodiment, it can be achieved by performing bitwise truncation or sequence difference set extraction operations on the actual writing character sequence. For example, for the entire text string composed of the actual writing characters, directly cut off the identified and marked leading character and trailing character, and merge all the remaining consecutive character segments in the middle to determine the intermediate process characters as a whole. Continuing with the previous conversation example, the actual writing characters are "What's the weather like today". After explicitly removing the leading character "今" and the trailing character "样", the remaining character segment "天天气怎么" is determined as the intermediate process characters.

[0102] Figure 3 It is a schematic diagram of the stage accuracy calculation process provided by the present invention, as Figure 3 shown. When calculating the stage accuracy, for the leading accuracy, directly judge the correctness of the leading character. If it is correct, the output accuracy is 100%, and if it is incorrect, the output accuracy is 0. For the process accuracy, count the number of characters with the correct status marked for the intermediate process characters, and divide it by the number of intermediate process characters to calculate and output the process accuracy. For the trailing accuracy, directly judge the correctness of the trailing character. If it is correct, the output accuracy is 100%, and if it is incorrect, the output accuracy is 0.

[0103] Considering that in the text sequence output by speech recognition, the impacts of characters at different positions on the user experience are significantly different. The leading and trailing parts usually determine the theme setting and semantic closed-loop, and they usually only have the right or wrong status of one recognition and do not have the physical conditions for continuous fluctuations; while the characters in the middle part are the core information carriers and are also the areas where dynamic flickering and repeated modification phenomena occur. In order to accurately distinguish this differential performance and specifically implement targeted penalties for the truly fluctuating middle part, the stage accuracy in this embodiment includes the leading accuracy, the process accuracy, and the trailing accuracy.

[0104] Specifically, use the stability coefficient to numerically adjust the stage accuracy, and calculate the effective compliance rate in combination with the weight assignment value corresponding to the stage accuracy, including: Multiply the process accuracy by the stability coefficient and combine it with the process weight to obtain the process weighted score; Calculate the leading weighted score according to the leading accuracy and the leading weight; Calculate the trailing weighted score according to the trailing accuracy and the trailing weight; The effective pass rate is obtained by summing the process-weighted score, the head-weighted score, and the tail-weighted score.

[0105] Considering that intermediate characters are most prone to repeated jumps between real-time accuracy and error during real-time decoding, the stability coefficient, which measures such jumps, is directly applied as a penalty factor to the accuracy. This accurately quantifies the negative experience caused by fluctuations. Combined with its own weighting ratio, it can truly reflect the actual effective quality of the core intermediate content. Based on this, this embodiment multiplies the process accuracy by the stability coefficient and combines it with the process weight to obtain a process-weighted score.

[0106] Here, process accuracy refers to the proportion of correctly identified intermediate process characters out of their total number; stability coefficient refers to the value calculated based on the degree of fluctuation in the aforementioned steps, used to penalize the score by reducing it; process weight refers to the pre-set proportion used to measure the importance of intermediate process characters in the whole sentence; process weighted score refers to the actual quality score contributed by intermediate process characters after fluctuation penalty and weight adjustment.

[0107] As an optional implementation, the process accuracy, stability coefficient, and process weight can be directly multiplied to obtain the process-weighted score. For example, for a certain evaluation audio, the process accuracy is statistically obtained as approximately 66.7%, the stability coefficient calculated based on its fluctuation is approximately 0.818%, and the system's preset process weight is 45%. Multiplying these three values ​​together, the process-weighted score is calculated to be approximately 24.6%.

[0108] After obtaining the process-weighted score, considering that the first character usually only has two states—correct or incorrect—and does not involve continuous fluctuations, directly calculating the score based on its correctness and corresponding weight can effectively avoid the interference of initial thematic deviations in the audio transcription content on the overall evaluation. Based on this, this embodiment calculates the first-character weighted score according to the first-character accuracy and weight. The reason for adopting this step is that... Here, the first-character accuracy refers to the numerical value of the state where the first character is correctly identified, usually 100% or 0; the first-character weight refers to the proportion of importance pre-assigned to the first character; and the first-character weighted score refers to the actual quality score contributed by the first character.

[0109] As an optional embodiment, the first-character accuracy can be directly multiplied by the first-character weight to obtain the first-character weighted score. For example, after character-by-character verification, if the first character of the audio is determined to be completely correct, i.e., the first-character accuracy is 100%, and the system's preset first-character weight is 25%, then multiplying the two yields a first-character weighted score of 25%.

[0110] Furthermore, considering that the final character is crucial to the completeness of the semantic loop of a sentence, assigning a weight to the final character and calculating its score independently can effectively avoid the problem of missing semantic loops in the audio transcription content and ensure the completeness of the final recognition result. Based on this, this embodiment calculates the tail-weighted score according to the tail accuracy and tail weight.

[0111] Here, tail accuracy refers to the numerical value indicating that the tail character is correctly identified; tail weight refers to the proportion of importance pre-assigned to the tail character; and tail weighted score refers to the actual quality score contributed by the tail character.

[0112] As an optional embodiment, the tail-weighted score is also obtained by multiplying the tail accuracy by the tail weight. For example, if the tail character recognition of the audio is correct and the tail accuracy is 100%, and the system's preset tail weight is 30%, multiplying these two results in a tail-weighted score of 30%.

[0113] After calculating the process-weighted score, the head-weighted score, and the tail-weighted score, considering the need for a global core indicator to comprehensively reflect the overall evaluation performance of the above three stages, this embodiment sums the process-weighted score, the head-weighted score, and the tail-weighted score to obtain the effective pass rate. This allows for the high integration of the scores from the three independent stages—anti-flicker penalty processing, initial anti-deviation processing, and semantic loop closure processing—to construct a comprehensive quality evaluation indicator that truly matches the needs of real-time experience optimization.

[0114] Here, the effective pass rate refers to the sum of scores representing the overall effective recognition degree of the intermediate speech recognition results in the three stages of the beginning, middle and end.

[0115] As an optional implementation, the effective compliance rate can be calculated using the following formula: Effective compliance rate = process-weighted score + head-weighted score + tail-weighted score; In this calculation formula, the effective pass rate represents the final sum of the performance across the three stages; the process weighted score represents the score of the intermediate process characters after penalty and weighting; the head weighted score represents the score of the head characters after weighting; and the tail weighted score represents the score of the tail characters after weighting.

[0116] Considering the significant differences in tolerance levels for flickering or repeated modifications in text-writing results across different types of voice interaction scenarios—for example, formal business scenarios have extremely low tolerance for text jumps, while casual conversation scenarios have relatively high tolerance—using a uniform, fixed penalty standard could easily lead to evaluation results that deviate from the real-world experience of specific business scenarios. To enable the evaluation framework to have scenario-adaptive capabilities and accurately quantify the penalty intensity, this embodiment determines a stability coefficient based on the degree of fluctuation for penalizing intermediate process characters, including: Obtain the fluctuation tolerance attribute of the application scenario corresponding to the audio to be evaluated; Determine the penalty adjustment coefficient based on the fluctuation tolerance attribute; The target volatility level is obtained by weighting the volatility level using a penalty adjustment coefficient. The stability coefficient is determined based on the target volatility.

[0117] Specifically, the application scenario refers to the specific environment or business category in which the audio to be evaluated is generated. The fluctuation tolerance attribute refers to the degree to which users can accept the alternating flickering of correct and incorrect characters on the screen under this specific interactive environment.

[0118] As an optional implementation, the application scenario of the audio to be evaluated can be obtained by parsing the additional metadata tags, and then the corresponding fluctuation tolerance attribute can be matched from a preset scenario configuration library. For example, if the audio is identified as belonging to a formal meeting notification scenario, its fluctuation tolerance attribute is obtained as a low tolerance level; if it is identified as belonging to a daily input method chat scenario, its fluctuation tolerance attribute is obtained as a normal default level.

[0119] After obtaining the fluctuation tolerance attribute of the application scenario corresponding to the audio to be evaluated, considering that the qualitative tolerance level cannot be directly involved in the subsequent mathematical calculation mechanism, this embodiment determines the penalty adjustment coefficient based on the fluctuation tolerance attribute, transforming the abstract business tolerance level into a specific mathematical operator, thereby achieving precise quantitative configuration of the penalty intensity. Here, the penalty adjustment coefficient refers to a numerical multiplier used to proportionally amplify or reduce the original fluctuation index.

[0120] As an optional implementation, a numerical mapping relationship between tolerance attribute levels and adjustment coefficients can be pre-built. For example, when the fluctuation tolerance attribute of a certain type of audio is determined to be low tolerance, a value greater than 1 is determined from the mapping relationship as the penalty adjustment coefficient to amplify the penalty effect; when the fluctuation tolerance attribute is the normal default tolerance, a value equal to 1 is determined as the penalty adjustment coefficient.

[0121] After determining the penalty adjustment coefficient, considering that the originally calculated volatility only reflects the physical jump magnitude at the statistical level and does not yet reflect the severity of the business scenario, this embodiment uses the penalty adjustment coefficient to weight and adjust the volatility to obtain the target volatility level. This allows the scenario constraint factor to be directly injected into the volatility index, resulting in customized volatility evaluation data that truly meets the current business experience expectations. Here, the target volatility level refers to the final quantified volatility index after being amplified or reduced by the scenario coefficient.

[0122] As an optional embodiment, the penalty adjustment coefficient determined in the aforementioned steps can be directly multiplied by the basic fluctuation level calculated based on the correct character recognition status, and the product of the two can be determined as the target fluctuation level.

[0123] After obtaining the target volatility level, considering that it ultimately needs to be converted into a normalized adjustment factor that can directly reduce the accuracy score, this embodiment determines a stability coefficient based on the target volatility level. Here, the stability coefficient is the actual discount ratio ultimately used for numerical adjustment.

[0124] Considering that the effective compliance rate primarily focuses on evaluating the accuracy and stability of the output character content, without taking into account whether the speech recognition system misses or discards parts of the audio content during the output process, this embodiment determines the evaluation result based on the effective compliance rate to construct a dual-core evaluation framework covering both coverage integrity and quality compliance rate. This framework aims to avoid the problem of evaluation metric distortion caused by missed or abnormal data output. The evaluation results include: Determine the number of third characters to be scribbled in the transcribed text, and the number of fourth characters to be scribbled in the actual transcribed text; The ratio between the number of fourth characters and the number of third characters is used as the basic completion rate to reflect the integrity of the write coverage. The evaluation result is obtained by weighting and summing the basic completion rate and the effective compliance rate.

[0125] Specifically, the third character count refers to the total number of standard characters to be written in the transcribed text; the fourth character count refers to the total number of characters actually written and output to the screen during the real-time dynamic screen display process.

[0126] As an optional implementation, the text statistics module in the background can be used to count the number of characters in both the standard transcribed text and the actual output log. For example, the system parses the transcribed text to obtain the total number of all characters it contains, and determines this total as the third character count; at the same time, it counts the total number of characters displayed in the actual output recorded in the log, and determines this as the fourth character count.

[0127] After determining the number of third characters to be transcribed and the number of fourth characters actually transcribed in the transcribed text, considering the need for an intuitive and quantifiable indicator to directly reflect whether the system has completely performed the transcription operation and whether any omissions have occurred, this embodiment uses the ratio between the number of fourth characters and the number of third characters as the basic completion rate to reflect the integrity of the transcription coverage. This directly exposes defects such as concealed output or missing characters in the system, assesses the system's basic ability to complete the transcription operation, and prevents the system from obtaining high scores even when there are serious truncation or omissions. Here, the basic completion rate refers to the percentage indicator used to characterize the breadth and completeness of the coverage of the actual transcription result relative to the standard content to be transcribed.

[0128] As an optional implementation, mathematical division can be used directly to divide the number of fourth characters by the number of third characters, and the calculated percentage value can be directly determined as the basic completion rate. For example, if the standard transcribed text corresponding to a certain audio to be evaluated should have 5 characters transcribed, that is, the number of third characters is 5, and the actual number of characters transcribed recorded by the system is also 5, that is, the number of fourth characters is 5, then the calculated basic completion rate is 100%. If the system encounters an anomaly in the output and actually only transcribes 4 characters, resulting in an error of missing 1 intermediate process character, that is, the number of fourth characters is 4, then the calculated basic completion rate is 80%.

[0129] After calculating the basic completion rate to reflect the integrity of the write coverage, considering that either the basic completion rate or the effective compliance rate is one-sided and cannot comprehensively reflect the user's overall experience in the real-time perception process, this embodiment calculates the weighted sum of the basic completion rate and the effective compliance rate to obtain the evaluation result. This organically integrates the basic indicators reflecting the integrity of coverage with the core indicators reflecting accuracy and stability, and finally obtains a true, objective and user-friendly global evaluation.

[0130] Here, the evaluation result refers to the final score that is determined and used to comprehensively reflect the performance of various indicators of the intermediate results of this speech recognition.

[0131] As an optional embodiment, the corresponding weight ratios can be pre-set, and the evaluation result can be calculated using the following formula: Evaluation result = Basic completion rate × Basic completion rate weight + Effective pass rate × Effective pass rate weight; In this calculation formula, the evaluation result represents the final score used to measure the overall screen quality of the intermediate results; the basic completion rate represents the proportion reflecting the integrity of the write coverage; the basic completion rate weight represents the multiplier percentage pre-allocated to the integrity assessment indicator; the effective pass rate represents the overall pass score obtained after fluctuation penalties and stage weighting in the previous steps; and the effective pass rate weight represents the multiplier percentage pre-allocated to the accuracy and stability assessment indicators.

[0132] Figure 4 This is a schematic diagram of the comprehensive calculation process for the evaluation results provided by the present invention, as shown below. Figure 4 As shown, the comprehensive calculation process consists of four steps.

[0133] The first step is to calculate the stability coefficient. The stability coefficient used to penalize intermediate process characters is determined based on the degree of fluctuation in the intermediate process characters. The calculation formula is: Stability coefficient = 1 / (1 + degree of fluctuation).

[0134] The second step is to calculate the weighted scores for each stage. The process accuracy is multiplied by the stability coefficient and combined with the preset process weights to obtain the process weighted score. At the same time, the head weighted score is calculated directly based on the head accuracy, head weights and stability coefficients, and the tail weighted score is calculated based on the tail accuracy, tail weights and stability coefficients.

[0135] The third step is to calculate the effective compliance rate. The process-weighted score, the head-weighted score, and the tail-weighted score obtained in the second step are summed to obtain the effective compliance rate.

[0136] The fourth step is to calculate the evaluation results. Obtain the preset weights for the basic completion rate and the effective compliance rate. Using the formula: Evaluation Result = Basic Completion Rate × Basic Completion Rate Weight + Effective Completion Rate × Effective Completion Rate Weight, a comprehensive score objectively reflecting the quality of the audio displayed on screen is obtained.

[0137] To more clearly and intuitively demonstrate the complete calculation process and practical value of the speech recognition intermediate result evaluation method provided by this invention, detailed examples are given below using two specific audio output scenarios to be evaluated.

[0138] As the first specific application scenario, the transcribed text of the audio to be evaluated is "Notice on Convening This Year's Annual Work Conference".

[0139] First, the number of the third character to be scribing in the transcribed text is determined to be 15. The system also counts the number of the fourth character in the actual, complete output, which is also 15. By dividing the number of the fourth character by the number of the third character, a basic completion rate of 100% is calculated. This metric ensures that no content is missed and guarantees complete coverage.

[0140] Secondly, the actual written characters are compared with the transcribed text word by word. After verification, the correctness marks of the leading character "关" at the starting position and the trailing character "知" at the ending position are both in the correct state, thus determining that both the leading accuracy rate and the trailing accuracy rate are 100%, avoiding the initial deviation of the theme and the lack of semantic closed-loop in the transcribed content. At the same time, it is determined that the number of remaining intermediate process characters is 13, and the number of characters with the correct state of the correctness mark is 12, so the calculated process accuracy rate is 12÷13×100%≈92.3%.

[0141] Subsequently, the system statistically quantifies the correctness marks of the intermediate process characters and calculates that the fluctuation degree of the intermediate process characters during the output process is approximately 0.0709. The smaller the fluctuation degree value, the smoother the fluctuation. Since the fluctuation tolerance attribute of the application scenario corresponding to this type of meeting notice audio is extremely low, the system determines to adopt the scenario-based penalty rule and determines the corresponding penalty adjustment coefficient to be 1.3. After using this penalty adjustment coefficient to weighted-adjust the fluctuation degree to obtain the target fluctuation degree, the stability coefficient is calculated as 1÷(1 + 1.3×0.0709)≈0.915, with only a slight penalty.

[0142] Finally, the stage accuracy rate is numerically adjusted using the stability coefficient. Combining the preset weight distribution values, in this scenario, the leading weight is set to 30%, the process weight is 40%, and the trailing weight is 30%. The calculated leading weighted score is 100%×0.915×30%≈27.45%, the process weighted score is 92.3%×0.915×40%≈33.7%, and the trailing weighted score is 100%×0.915×30%≈27.45%. Summing up the above process weighted score, leading weighted score, and trailing weighted score, the effective compliance rate is approximately 88.6%. Setting the weight of the basic completion rate to 40% and the weight of the effective compliance rate to 60%, the basic completion rate and the effective compliance rate are weighted and summed, and finally the evaluation result of the intermediate result of speech recognition is determined to be 100%×40% + 88.6%×60%≈93.16%.

[0143] As the second specific application scenario, the transcribed text of the audio to be evaluated is "What's the weather like today?".

[0144] During the recognition process of this audio, the actual written characters of the intermediate result of speech recognition show a dynamic incremental output rhythm, and the character sequences thrown on the screen in sequence are: 金天 → 今天 → 今天天 → 今天天启 → 今天天气怎 → 今天天气怎么 → 今天天气怎么样。

[0145] First, it is determined that the number of the third characters corresponding to the transliterated text is 7. The system records that the number of the fourth characters of the actually written characters in each stage is 2, 2, 3, 4, 5, 6, and 7 respectively. After comprehensively reflecting the incremental progress rhythm of the writing process, the overall basic completion rate is calculated as (28.57% + 28.57% + 42.86% + 57.14% + 71.43% + 85.71% + 100%) ÷ 7 ≈ 59.18%.

[0146] Secondly, through the word-by-word verification of the correctness marks, it is found that the first character was misidentified as '金' in the initial stage and corrected to '今' in the subsequent stage. The average value of its first-character accuracy rate is determined to be approximately 85.71%; the correctness mark of the last character '样' is always in the correct state, and the last-character accuracy rate is 100%; for the intermediate process characters, although the character '启' was briefly miswritten and quickly corrected in the intermediate stage, all five intermediate process characters are finally correct, and the process accuracy rate is 100%.

[0147] In addition, the accuracy rates in each stage are 1 / 2 = 50%, 2 / 2 = 100%, 3 / 3 = 100%, 3 / 4 = 75%, 5 / 5 = 100, 6 / 6 = 100%, 7 / 7 = 100%, and the average value is approximately 89.29%. The fluctuation degree of the intermediate process characters in the output process is calculated as [(50% - 89.29%) 2 +(100% - 89.29%) 2 +(100% - 89.29%) 2 +(75% - 89.29%) 2 +(100% - 89.29%) 2 +(100% - 89.29%) 2 +(100% - 89.29%) 2 / 7 = 0.0332. Since the default penalty rule applies to this audio, the stability coefficient is directly calculated based on this fluctuation degree as 1 ÷ (1 + 0.0332) ≈ 0.9679, with only a slight penalty.

[0148] Finally, combining the corresponding weight assignment values, the preset first-character weight is 25%, the process weight is 45%, and the last-character weight is 30% in this scenario. The weighted score of the first character is calculated as 85.71% × 0.9679 × 25% ≈ 20.84%, the weighted score of the process is 100% × 0.9679 × 45% ≈ 43.56%, and the weighted score of the last character is 100% × 0.9679 × 30% ≈ 29.04%. Summing up the above items, the effective compliance rate is 93.44%. Finally, by performing a weighted sum of the basic completion rate and its weight (40%) and the effective compliance rate and its weight (60%), the final evaluation result is determined to be 59.18% × 40% + 93.44% × 60% ≈ 79.73%.

[0149] The evaluation device for intermediate speech recognition results provided by the present invention will be described below. The evaluation device for intermediate speech recognition results described below can be referred to in correspondence with the evaluation method for intermediate speech recognition results described above.

[0150] Based on any of the above embodiments Figure 5 This is a schematic diagram of the structure of the speech recognition intermediate result evaluation device provided by the present invention, as shown below. Figure 5 As shown, the device includes: The acquisition module 510 is used to acquire the transcribed text of the audio to be evaluated, as well as the intermediate speech recognition results output by the audio during the recognition process. The comparison module 520 is used to compare the actual scribbled characters in the intermediate results of speech recognition with the transcribed text to determine the correctness mark of the actual scribbled characters. The determination module 530 is used to determine intermediate process characters from the actual written characters and obtain the correctness flags corresponding to the intermediate process characters; The calculation module 540 is used to calculate the degree of fluctuation of the intermediate process characters during the output process based on the correctness markers corresponding to the intermediate process characters; Evaluation module 550 is used to determine the evaluation results of intermediate speech recognition results based on the correctness marking and fluctuation degree of the actual written characters.

[0151] Figure 6 This is a schematic diagram of the structure of the electronic device provided by the present invention, such as... Figure 6 As shown, the electronic device may include a processor 610, a communications interface 620, a memory 630, and a communication bus 640. The processor 610, communications interface 620, and memory 630 communicate with each other via the communication bus 640. The processor 610 can call logical instructions from the memory 630 to execute evaluation methods for intermediate speech recognition results.

[0152] Furthermore, the logical instructions in the aforementioned memory 630 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0153] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer is able to perform the evaluation method for intermediate results of speech recognition provided by the above methods.

[0154] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements an evaluation method for intermediate results of speech recognition provided by the methods described above.

[0155] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0156] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0157] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for evaluating intermediate results of speech recognition, characterized in that, include: Obtain the transcribed text of the audio to be evaluated, as well as the intermediate speech recognition results output by the audio during the recognition process; The actual scribing characters in the intermediate speech recognition results are compared with the transcribed text to determine the correctness mark of the actual scribing characters; Determine intermediate process characters from the actual written characters, and obtain the correctness flag corresponding to the intermediate process characters; Based on the correctness flags corresponding to the intermediate process characters, calculate the degree of fluctuation of the intermediate process characters during the output process; Based on the correctness markers of the actual characters written and the degree of fluctuation, the evaluation result of the intermediate speech recognition result is determined.

2. The method for evaluating intermediate results of speech recognition according to claim 1, characterized in that, The step of calculating the fluctuation degree of the intermediate process character during the output process based on the correctness marker corresponding to the intermediate process character includes: The process accuracy is determined based on the correctness flags corresponding to all intermediate process characters. Calculate the degree of difference between the correctness marker corresponding to each intermediate process character and the process accuracy; The degree of fluctuation is obtained by summing up the degree of difference corresponding to all intermediate process characters.

3. The method for evaluating intermediate results of speech recognition according to claim 2, characterized in that, The fluctuation level is obtained by comprehensively considering the degree of difference corresponding to all intermediate process characters, including: The degree of difference corresponding to each intermediate process character is squared to obtain the squared value of the difference for each intermediate process character. The fluctuation level is obtained by summing the squared differences corresponding to all intermediate process characters and combining this sum with the number of the first character of each intermediate process character.

4. The method for evaluating intermediate results of speech recognition according to claim 2, characterized in that, The determination of process accuracy based on the correctness markers corresponding to all intermediate process characters includes: Count the number of the first characters in the intermediate process; If the number of the first character is zero, then the preset value is used as the accuracy of the process; If the number of the first character is greater than zero, then the number of the second character corresponding to the correctness mark is counted from the intermediate process characters, and the ratio between the number of the second character and the number of the first character is taken as the process accuracy.

5. The method for evaluating intermediate results of speech recognition according to any one of claims 1 to 4, characterized in that, The evaluation result of the intermediate speech recognition result, based on the correctness marker of the actual written characters and the degree of fluctuation, includes: Based on the degree of fluctuation, a stability coefficient is determined for applying a score penalty to the intermediate process characters; The stage accuracy is determined based on the correctness markers of the actual characters written. The stability coefficient is used to numerically adjust the stage accuracy, and the effective compliance rate is calculated by combining the weight allocation value corresponding to the stage accuracy. The evaluation result is determined based on the effective compliance rate.

6. The method for evaluating intermediate results of speech recognition according to claim 5, characterized in that, Determining intermediate process characters from the actual written characters includes: Determine the first character at the beginning position and the last character at the end position in the actual written characters; The remaining characters in the actual written characters, excluding the first and last characters, are determined as the intermediate process characters.

7. The method for evaluating intermediate results of speech recognition according to claim 6, characterized in that, The stage accuracy includes head accuracy, process accuracy, and tail accuracy; The step of adjusting the stage accuracy using the stability coefficient and calculating the effective achievement rate by combining the weight allocation value corresponding to the stage accuracy includes: The process accuracy is multiplied by the stability coefficient and combined with the process weight to obtain the process weighted score; The head-weighted score is calculated based on the head-end accuracy and head-end weight. The tail-weighted score is calculated based on the tail accuracy and tail weight. The effective compliance rate is obtained by summing the process weighted score, the head weighted score, and the tail weighted score.

8. The method for evaluating intermediate results of speech recognition according to claim 5, characterized in that, The determination of a stability coefficient for applying a score penalty to the intermediate process characters based on the degree of fluctuation includes: Obtain the fluctuation tolerance attribute of the application scenario corresponding to the audio to be evaluated; Based on the fluctuation tolerance attribute, determine the penalty adjustment coefficient; The target volatility level is obtained by weighting and adjusting the volatility level using the penalty adjustment coefficient. The stability coefficient is determined based on the target fluctuation level.

9. The method for evaluating intermediate results of speech recognition according to claim 5, characterized in that, The determination of the evaluation result based on the effective compliance rate includes: Determine the number of third characters to be scribbled for the transcribed text, and the number of fourth characters to be scribbled for the actual transcribed text; The ratio between the number of the fourth character and the number of the third character is used as the basic completion rate to reflect the integrity of the write coverage. The evaluation result is obtained by weighted summing of the basic completion rate and the effective achievement rate.

10. An evaluation device for intermediate results of speech recognition, characterized in that, include: The acquisition module is used to acquire the transcribed text of the audio to be evaluated, as well as the intermediate speech recognition results output by the audio to be evaluated during the recognition process; The comparison module is used to compare the actual scribbled characters in the intermediate speech recognition results with the transcribed text to determine the correctness mark of the actual scribbled characters; The determination module is used to determine intermediate process characters from the actual written characters and obtain the correctness mark corresponding to the intermediate process characters; The calculation module is used to calculate the degree of fluctuation of the intermediate process character during the output process based on the correctness mark corresponding to the intermediate process character; The evaluation module is used to determine the evaluation result of the intermediate speech recognition result based on the correctness marker of the actual written characters and the degree of fluctuation.

11. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the evaluation method for intermediate results of speech recognition as described in any one of claims 1 to 9.

12. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the evaluation method for intermediate results of speech recognition as described in any one of claims 1 to 9.