Oral English test system and method based on multi-dimensional factors

By using a multi-dimensional oral assessment system that combines Transformer-CTC and RoBERTa models, the system addresses the issues of low efficiency and high subjectivity in traditional oral assessments, achieving a comprehensive, fair, and accurate evaluation of oral proficiency, and is suitable for educational and examination scenarios.

CN120808759BActive Publication Date: 2026-06-23读书郎教育科技有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
读书郎教育科技有限公司
Filing Date
2025-07-08
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing oral assessment technologies mainly rely on manual scoring, which is inefficient and highly subjective. They are difficult to comprehensively consider multiple dimensions such as fluency, semantics, grammar, and speech duration, and are particularly ineffective in open-ended question assessments.

Method used

A multi-dimensional spoken language assessment system is adopted, including modules for speech acquisition, speech recognition, fluency analysis, semantic analysis, grammatical analysis, text matching, and pronunciation analysis. The weights are dynamically adjusted through a comprehensive evaluation module to calculate the final spoken language assessment score. The Transformer-CTC architecture and RoBERTa model are combined for phoneme recognition and semantic and grammatical evaluation.

Benefits of technology

It enables a comprehensive, fair, and accurate assessment of oral proficiency, adapts to different language levels and scenarios, enhances the scientific rigor and credibility of the assessment, and is applicable to educational and examination settings.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120808759B_ABST
    Figure CN120808759B_ABST
Patent Text Reader

Abstract

The application discloses a spoken language evaluation system and method based on multiple dimensions, and relates to the technical field of data processing.The system comprises the following steps: obtaining audio text, a phoneme list and a phoneme time boundary list corresponding to speech data according to a pre-trained text phoneme recognition model; calculating a fluency score according to the phoneme list and the phoneme time boundary list; obtaining a semantic score of the audio text through a pre-trained semantic model; obtaining a syntax score of the audio text through a pre-trained syntax model; matching the audio text and an answer text according to a text matching method based on an edit distance to obtain a text matching score; obtaining a pronunciation score of the audio data finally through phoneme confidence of the phoneme list; dynamically adjusting the weight of each dimension score according to the length of the audio text, and calculating a final spoken language evaluation score; and the application comprehensively evaluates the spoken language ability through multi-dimensional evaluation and dynamic weight adjustment according to the scene, and realizes the comprehensiveness, fairness, accuracy and high efficiency of the spoken language evaluation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, and in particular to a spoken language assessment system and method based on multi-dimensional factors. Background Technology

[0002] With the development of educational informatization, oral assessment is playing an increasingly important role in English learning, teaching, and examinations. In particular, the promotion of English listening and speaking tests has accelerated the development of oral assessment technology.

[0003] Traditional oral assessment relies mainly on human scoring, which is not only inefficient but also highly subjective. Existing oral assessment technologies also largely depend on the analysis of speech features and text matching for scoring. They do not adequately consider fluency, semantics, grammar, or different test scenarios, especially open-ended questions, and are therefore inadequate to meet actual assessment needs.

[0004] Chinese patent application publication number CN108154735A discloses a method that, after outputting a test question, receives a voice signal input by a tester; performs speech recognition on the voice signal to obtain corresponding text information; performs text matching between the text information and answers in a set of answers corresponding to the test question, wherein the set of answers includes the standard answer to the test question and derivative answers extended from the standard answer; obtains an evaluation result based at least on the text matching result; and outputs the evaluation result. This method can effectively evaluate the semantics and logic of spoken text, improving the comprehensiveness and accuracy of the evaluation, and providing fast feedback. However, it does not consider the synergistic effect of speech rate and pauses on fluency, nor does it consider the different evaluation focuses of answers under different spoken pronunciation durations.

[0005] Therefore, there is an urgent need for an evaluation method based on multiple dimensions such as fluency, semantics, grammar, text matching, and speaking duration, which can provide more comprehensive and accurate guidance for students' oral communication skills. Summary of the Invention

[0006] To overcome the shortcomings of the prior art, the present invention provides a spoken language assessment system and method based on multi-dimensional factors.

[0007] To achieve the above-mentioned objectives, the present invention adopts the following technical solution:

[0008] In a first aspect, the present invention provides a spoken language assessment system based on multi-dimensional factors, comprising the following modules:

[0009] The voice acquisition module is used to collect the voice data of the user's answers to questions;

[0010] The scenarios for the questions include oral composition and situational spoken language;

[0011] The speech recognition module is used to extract Fbank features from speech data and input the Fbank features into a pre-trained text phoneme recognition model to obtain the audio text, phoneme list, and phoneme time boundary list corresponding to the speech data.

[0012] The fluency analysis module is used to calculate the number of phonemes, pause duration, and number of pauses per minute based on the phoneme list and phoneme time boundary list, and to calculate the fluency score SP based on the number of phonemes, pause duration, and number of pauses per minute.

[0013] The semantic analysis module is used to obtain the semantic score SP2 of the audio text through a pre-trained semantic model; the pre-trained semantic model is fine-tuned based on the pre-trained SentenceTransformer model.

[0014] The grammar analysis module is used to obtain the grammar score SP3 of the audio text through a pre-trained grammar model; the pre-trained grammar model is fine-tuned based on the RoBERTa model.

[0015] The text matching module is used to match the audio text and the answer text using an edit distance-based text matching method to obtain a text matching score SP4.

[0016] The pronunciation analysis module is used to calculate the average phoneme confidence of each word as the word pronunciation score by using the phoneme confidence of the phoneme list, and then calculate the average of all word pronunciation scores and multiply it by 100 to obtain the final pronunciation score SP5 of the audio data.

[0017] The comprehensive evaluation module dynamically adjusts the weights of each score based on the length of the audio text using the fluency score SP, grammar score SP3, pronunciation score SP5, and completeness score SP6, and calculates the final oral assessment score. The completeness score SP6 is the larger value between the semantic score SP2 and the text matching score SP4.

[0018] Specifically, the text phoneme recognition model adopts a Transformer-CTC architecture, with the following structure: The input is the FBank features of the audio. The FBank features are downsampled and local features are extracted through two convolutional layers. Then, a linear layer projects the local features onto the input dimension of the Transformer. Next, multiple Transformer encoder layers are passed, each containing a multi-head self-attention mechanism and a feedforward neural network to capture long-term dependent high-level acoustic features. Finally, a linear layer maps the high-level acoustic features to the phoneme label space, outputting the probability of each phoneme label for each time frame. Finally, the CTC decodes and outputs the audio text, a phoneme list, and a phoneme time boundary list consisting of the time boundaries of each phoneme. Each element of the phoneme list includes the specific phoneme and its corresponding confidence score.

[0019] Specifically, the fluency score SP is shown in the following formula:

[0020] SP = S × 0.85 - P × 0.15;

[0021] SP represents the final smoothness score; S represents the basic smoothness score; and P represents the pause penalty score.

[0022] The formula for calculating the basic fluency score S is as follows:

[0023] S=100×[1-|(PPM-PPM') / PPM'|^α]

[0024] Where PPM' is the optimal number of phonemes per minute; α is the speech rate sensitivity adjustment coefficient, and the value of α varies depending on the context of the question; the value of S ranges from [0,100], and the unit is minutes.

[0025] The formula for calculating the pause penalty score P is as follows:

[0026] P = β1 × N2 + β2 × N3 + β3 × T p

[0027] Where N2 represents the number of pauses in 2 seconds; N3 represents the number of pauses in 3 seconds; T p β1 is the total pause duration; β2 is the 2-second pause penalty coefficient; β3 is the total pause duration penalty coefficient.

[0028] Specifically, in the oral composition scenario, α takes values ​​in [1.1, 1.3]; while in the situational oral communication scenario, α takes values ​​in [1.5, 1.7].

[0029] Specifically, the input to the pre-trained semantic model is the audio text and the answer text. The semantic model performs word segmentation and embedding on the two texts respectively, and finally obtains two sentence vector representations. Then, the cosine similarity between the two sentence vectors is calculated, and the cosine similarity is multiplied by 100 to obtain the semantic score SP2 of the audio text. The answer text is obtained in advance based on the question.

[0030] Specifically, the input of the pre-trained grammar model is the word sequence [wd1, wd2, ..., wdi] corresponding to the audio text, where i is the number of words in the audio text. The output of the pre-trained grammar model is a predicted label sequence [l1, l2, ..., li] of the same length as the word sequence. Each label li belongs to a predefined set of operation labels, including CORRECT, DELETE, REPLACE, INSE, RT_AFTER, which represent the model's grammatical judgment and suggested operation for the word wdi. The predicted label sequence is traversed, and the number of words marked as CORRECT, N2, is counted. N2 is divided by the length of the word sequence corresponding to the audio text and multiplied by 100 to obtain the grammar score of the audio text.

[0031] Specifically, the text matching method based on edit distance is shown in the following formula:

[0032]

[0033] SP4 represents the text matching score; wd1 i wd2 represents the i-th word in the audio text. i This represents the i-th word in the answer text; N is the total number of matched word pairs; ES(wd1) i wd2 i ) is the edit distance similarity function, used to represent wd1 i With wd2 i Edit distance similarity between them, with values ​​ranging from [0-1], where 0 indicates completely different and 1 indicates completely identical; w i wd2 i The weight of high-frequency words.

[0034] Specifically, the edit distance similarity function is as follows:

[0035]

[0036] Among them, EditDistance() is the edit distance algorithm; len() is the length function; and max() is the maximum value function.

[0037] Specifically, the formula for calculating the final oral assessment score is as follows:

[0038] Score=a1(L)×SP+a2(L)×SP3+a3(L)×SP5+a4(L)×SP6

[0039] Where a1(L) is the dynamic weight of fluency score; a2(L) is the dynamic weight of grammar score; a3(L) is the dynamic weight of pronunciation score; a4(L) is the dynamic weight of completeness score; and L is the length of the audio text, i.e. the number of words in the audio text.

[0040] Specifically, the fluency score (SP), grammar score (SP3), pronunciation score (SP5), and completeness score (SP6) are dynamically weighted based on the audio text length using a dynamic weighting formula. The dynamic weighting formula is shown below:

[0041]

[0042] Secondly, the present invention provides a spoken language assessment method based on multi-dimensional factors, and based on the spoken language assessment system based on multi-dimensional factors described in the first aspect above, includes the following steps:

[0043] S1. Collect voice data of users answering questions; the scenarios of the questions include oral composition and situational spoken language;

[0044] S2. Extract the Fbank features from the speech data and input the Fbank features into the pre-trained text phoneme recognition model to obtain the audio text, phoneme list and phoneme time boundary list corresponding to the speech data.

[0045] S3. Calculate the number of phonemes, pause duration, and number of pauses per minute based on the phoneme list and phoneme time boundary list, and calculate the fluency score SP based on the number of phonemes, pause duration, and number of pauses per minute.

[0046] S4. Obtain the semantic score SP2 of the audio text through a pre-trained semantic model; the pre-trained semantic model is fine-tuned based on the pre-trained SentenceTransformer model.

[0047] S5. Obtain the grammar score SP3 of the audio text through a pre-trained grammar model; the pre-trained grammar model is fine-tuned based on the Rob ERTa model;

[0048] S6. Match the audio text and the answer text using the text matching method based on edit distance to obtain the text matching score SP4;

[0049] S7. Calculate the average phoneme confidence of each word using the phoneme confidence of the phoneme list as the word pronunciation score, and then calculate the average of all word pronunciation scores and multiply it by 100 to obtain the final audio data pronunciation score SP5.

[0050] S8. The weights of each score are dynamically adjusted based on the length of the audio text using the fluency score SP, grammar score SP3, pronunciation score SP5, and completeness score SP6, and the final oral assessment score Score is calculated; the completeness score SP6 is the larger value between the semantic score SP2 and the text matching score SP4.

[0051] Specifically, the method also includes;

[0052] S9. Display the answer text and scores for each dimension on the application terminal; the scores for each dimension include fluency score SP, semantic score, grammar score SP3, pronunciation score SP5, completeness score SP6, and final oral assessment score Score.

[0053] Thirdly, the present invention also provides an electronic device including a processor, a memory, a communication interface, and one or more programs, said one or more programs being stored in the memory and configured to be executed by the processor, said programs including functions for performing the system described in the first aspect.

[0054] This invention provides a spoken language assessment system and method based on multiple dimensions, including: collecting voice data of users answering questions; obtaining audio text, a phoneme list, and a phoneme time boundary list corresponding to the voice data based on a pre-trained text phoneme recognition model; calculating the number of phonemes pronounced per minute, pause duration, and number of pauses based on the phoneme list and phoneme time boundary list, and calculating a fluency score; obtaining a semantic score of the audio text through a pre-trained semantic model; obtaining a grammatical score of the audio text through a pre-trained grammar model; matching the audio text and answer text using a text matching method based on edit distance to obtain a text matching score; obtaining a pronunciation score of the final audio data based on the phoneme confidence of the phoneme list; and a comprehensive evaluation module that dynamically adjusts the weights of each dimension score based on the length of the audio text and calculates the final spoken language assessment score. The method in this embodiment comprehensively evaluates spoken language ability through five dimensions: fluency, semantics, grammar, text matching, and pronunciation, avoiding the limitations of a single indicator.

[0055] Furthermore, the fluency analysis module of this invention, through phoneme-level analysis, utilizes multi-dimensional quantitative evaluation, dynamic parameter configuration, reasonable weight allocation, and precise penalty mechanisms to avoid the limitations of a single indicator and ensure the coordinated consideration of speech rate and pauses; it enhances the fairness and practicality of the evaluation; it conforms to authoritative language proficiency standards, enhancing the authority and credibility of the evaluation results, and realizing a scientific, scenario-based, and personalized evaluation of spoken fluency, adapting to different language levels and scenarios;

[0056] Furthermore, the comprehensive evaluation module of this invention adopts different scoring weights for different text lengths of audio responses, ensuring that the evaluation results are closer to actual oral proficiency.

[0057] Furthermore, this invention captures the deep intent of the answer through a pre-trained semantic model, avoiding misjudgment due to different expressions, and quickly filters completely identical answers based on the editing distance matching method. At the same time, the combination of semantic and text matching to judge completeness improves the accuracy and robustness of the evaluation.

[0058] This invention achieves comprehensiveness, fairness, accuracy, and efficiency in oral assessment through multi-dimensional evaluation, dynamic weight adjustment, semantic and text matching, and phoneme-level analysis. It is particularly suitable for educational, examination, and language learning scenarios and can more scientifically reflect students' oral abilities. Attached Figure Description

[0059] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0060] Figure 1 This is a schematic diagram of a spoken language assessment system based on multi-dimensional factors according to an embodiment of the present invention;

[0061] Figure 2 This is a schematic diagram of a spoken language assessment method based on multi-dimensional factors provided by an embodiment of the present invention;

[0062] Figure 3 This is a schematic diagram of an oral assessment device based on multi-dimensional factors according to an embodiment of the present invention. Detailed Implementation

[0063] The present invention will be explained in detail through the following embodiments. The purpose of this invention is to protect all technical improvements within the scope of this invention. In the description of this invention, it should be understood that if there are terms such as "upper", "lower", "front", "rear", "left", "right" indicating the orientation or positional relationship, they are only corresponding to the drawings of this application for the convenience of describing the present invention, and are not intended to indicate or imply that the device or element referred to must have a specific orientation.

[0064] Example 1

[0065] refer to Figure 1 This embodiment provides a spoken language assessment system based on multi-dimensional factors, including the following modules:

[0066] The voice acquisition module is used to collect voice data of users answering questions; the scenarios of the questions include oral composition and situational spoken language.

[0067] The user's spoken audio information is collected through the recording function of the application. The application can be a mobile phone or tablet App, a computer desktop application, or a web browser. The mobile phone, tablet, and computer all have microphone recording functions. The application calls the microphone to collect audio. This is existing technology and will not be described in detail here.

[0068] The audio format for voice data supports mono, 16kHz, WAV, or MP3 formats, and the audio duration of voice data is in seconds.

[0069] The speech recognition module is used to extract Fbank features from speech data and input the Fbank features into a pre-trained text phoneme recognition model to obtain the audio text, phoneme list, and phoneme time boundary list corresponding to the speech data.

[0070] Extracting Fbank features from speech data includes the following steps: pre-emphasizing, framing, and windowing the speech data, performing Fourier transform, calculating spectral line energy, performing Mel filtering, and taking the logarithm to obtain Fbank features;

[0071] Fbank (Filter Bank) is the logarithmic feature after Mel filtering. It is often used in speech signal processing. Its shape is [T,D], where T is the number of time frames and D is the feature dimension.

[0072] Specifically, this embodiment extracts FBank features from speech data using the librosa library;

[0073] Furthermore, the audio data is loaded using the librosa.load() function, the unlogarithmic Mel spectrogram energy is obtained using the librosa.feature.melspectrogram() function, and then the unlogarithmic Mel spectrogram energy is converted to a logarithmic scale using the librosa.power_to_db() function to obtain the Fbank features.

[0074] librosa is a Python library for audio and music analysis, focusing on Music Information Retrieval (MIR) tasks. It provides a powerful set of tools for processing audio signals, extracting features, visualizing, and analyzing music content.

[0075] The text phoneme recognition model adopts the Transformer-CTC architecture, with the following structure: The input is the FBank features of the audio (shape [time frame number, feature dimension]). The FBank features are downsampled and local features are extracted through two convolutional layers. Then, a linear layer projects the local features onto the input dimension of the Transformer. Next, multiple Transformer encoder layers are passed, each containing a multi-head self-attention mechanism and a feedforward neural network to capture long-term dependent high-level acoustic features. Finally, a linear layer maps the high-level acoustic features to the phoneme label space, outputting the probability of each phoneme label for each time frame. Finally, CTC decodes and outputs the audio text, a phoneme list, and a phoneme time boundary list consisting of the time boundaries of each phoneme. Each element of the phoneme list includes the specific phoneme and its corresponding confidence score.

[0076] Text phoneme recognition models can be trained in advance using pre-collected audio data. The specific training methods are existing technologies and will not be elaborated here.

[0077] A phoneme is the smallest unit of speech determined by the natural properties of sound. It is analyzed based on the articulation movements within a syllable; one movement constitutes one phoneme. Phonemes are divided into two main categories: vowels and consonants. Phonemes are generally represented using the International Phonetic Alphabet (IPA). The English IPA has 48 phonemes, including 20 vowel phonemes and 28 consonant phonemes.

[0078] It is worth noting that the phoneme tag in this embodiment also includes the silence symbol sil;

[0079] For example, this embodiment provides an example to illustrate the above solution:

[0080] Audio data of students' answers to the question "what's your age?" was recorded using a mobile app, assuming the audio duration was 4.2 seconds. The Fbank feature fb of the audio data was extracted using librosa. The specific code is shown below:

[0081]

[0082]

[0083] The input obtained by inputting fb.T into the text phoneme recognition model is shown below:

[0084] {"text":"what's your age",

[0085] "phonemeList":[{"phoneme":"sil","confidence":0.9},{"phoneme":"w","confidence":0.87},{"phoneme":"ah","confidence":0.9 1},{"phoneme":"t","confidence":0.85},{"phoneme":"s","confidence":0.93},{"phoneme":"sil","confidence":0.9},{"phoneme": "y","confidence":0.88},{"phoneme":"ao","confidence":0.90},{"phoneme":"r","confidence":0.92},{"phoneme":"sil","confid ence":0.9},{"phoneme":"ey","confidence":0.89},{"phoneme":"jh","confidence":0.94},{"phoneme":"sil","confidence":0.9}],

[0086] "phonemeTs":[[0,180],[180,300],[300,450],[450,600],[600,720],[720,2820],[2820, 2970],[2970,3120],[3120,3320],[3320,3620],[3620,3820],[3820,4070],[4070,4200]]}

[0087] Wherein, text represents the audio text, phonemeList represents the phoneme list, each element includes the phoneme and its corresponding confidence score, the confidence score ranges from [0,1] and is of floating-point type; sil is the silence marker, indicating a pause; phonemeTs represents the phoneme time boundary list, the number of which corresponds one-to-one with phonemeList, each element represents the time range of the phoneme with the same index, start is the start time, end is the end time, and the unit is ms;

[0088] The phoneme time boundary list is designed so that the boundaries of the phoneme time range are the same as those of the adjacent phonemes (i.e., the end time of the previous phoneme is equal to the start time of the next phoneme). This indicates a seamless transition between phonemes, which conforms to the smooth transition of acoustic characteristics. It also simplifies the alignment algorithm, balances acoustic realism and computational feasibility, and provides key information such as the number of phonemes per minute, pause duration, and number of pauses for subsequent fluency calculations.

[0089] The fluency analysis module is used to calculate the number of phonemes, pause duration, and number of pauses per minute based on the phoneme list and phoneme time boundary list, and to calculate the fluency score SP based on the number of phonemes, pause duration, and number of pauses per minute.

[0090] Fluency is primarily calculated based on pronunciation speed, specifically the number of phonemes pronounced per minute, combined with pause duration and number of pauses. Specifically, the total number of phonemes in the final recognized text (excluding silence markers) is divided by the total audio duration to obtain a basic pronunciation score. This basic score is then penalized based on speech speed to arrive at a basic fluency score. A lower speech speed indicates less proficiency and a lower score; conversely, a too-fast speech speed, leading to unclear pronunciation, also negatively impacts the score, thus requiring a penalty based on speech speed to arrive at a basic fluency score.

[0091] First, calculate the number of phonemes produced per minute (PPM = PN). num / (T total / 60), where T total Audio duration, in seconds, PN num The number of phonemes contained in the audio is obtained by removing the silence marker from the phoneme list (phonemeList) obtained by the speech recognition module.

[0092] The formula for calculating the basic fluency score S is as follows:

[0093] S=100×[1-|(PPM-PPM') / PPM'|^α]

[0094] Wherein, PPM' is the optimal number of phonemes per minute; α is the speech rate sensitivity adjustment coefficient, which penalizes the basic score based on speech rate. The larger the value, the more obvious the penalty for deviating from the standard speech rate. The value of the speech rate sensitivity adjustment coefficient α varies depending on the scenario of the question; the value of S ranges from [0,100], and the unit is minutes.

[0095] The value of the speech rate sensitivity adjustment coefficient α varies depending on the scenario of the question. In this embodiment, the scenarios of the questions include oral composition and situational speaking. In the oral composition scenario, a lower α value helps to maintain the stability of the speech rate and ensure the effective transmission of information. α takes values ​​in [1.1, 1.3]. In the situational speaking scenario, a higher α value encourages the speaker to adjust the speech rate according to the situation. Increasing α may reduce the penalty for small speeding to adapt to different communication needs. α takes values ​​in [1.5, 1.7].

[0096] It is understood that the scenario of the problem described in the article is not intended to limit this application. The scenario of the problem can be expanded according to the actual situation, and the value of α can be adjusted according to the different expanded scenarios.

[0097] According to the Chinese Standards of English Language Ability, speaking speed is defined as follows:

[0098] Speech rate is relatively fast: approximately 140-180 words per minute;

[0099] Normal speaking speed: approximately 100-140 words per minute;

[0100] Speaks at a relatively slow pace: approximately 80-100 words per minute;

[0101] Speaks slowly: about 60-80 words per minute.

[0102] Each English word typically contains 4-5 phonemes. Assuming a speech rate of 100-180 words per minute, this translates to 400-900 phonemes per minute. Therefore, a speaking speed of 400-900 phonemes per minute is generally considered fluent. This embodiment therefore uses [600-700] phonemes per minute as the optimal range for the number of phonemes spoken per minute.

[0103] Preferably, in this embodiment, 650 phonemes / minute is taken as the optimal number of phonemes to be pronounced per minute;

[0104] It is worth noting that this embodiment does not specifically limit the range of the optimal number of phonemes per minute or the optimal number of phonemes per minute, and these can be adjusted according to different exam levels.

[0105] For example, the Graded Examination in Spoken English (GESE), jointly launched by the Beijing Education Examinations Authority and Trinity College London, is divided into 4 levels and 12 grades: Preparatory Level 1-3, Elementary Level 4-6, Intermediate Level 7-9, and Advanced Level 10-12.

[0106] For example, in the preparatory level, considering that learners may speak slowly due to being beginners, the range of the optimal number of phonemes per minute can be appropriately lowered, taking [550-650] phonemes / minute as the optimal range of the number of phonemes per minute; the optimal number of phonemes per minute is taken as 600 phonemes / minute.

[0107] To answer questions fluently and proficiently at the beginner or intermediate level, the optimal range for the number of phonemes per minute is [600-700] phonemes / minute; the optimal number of phonemes per minute is 650 phonemes / minute.

[0108] At an advanced level, to be able to use various sentence structures and vocabulary fluently and skillfully, and to answer questions at a near-rapid speaking speed, it is necessary to appropriately increase the range of optimal phoneme quantity per minute. Therefore, [650-800] phonemes / minute is taken as the optimal range of optimal phoneme quantity per minute; the optimal phoneme quantity per minute is 750 phonemes / minute.

[0109] The pause duration between words is calculated using a phoneme list and a phoneme time boundary list. Then, the number of pauses exceeding 2 seconds, 3 seconds, or more is counted. The more and longer the pauses, the more points are deducted. The final fluency score is obtained by applying pause penalties to the base score.

[0110] Pause duration T p The time boundary is determined based on the time boundary list of the phoneme time boundary corresponding to the silence marker sil in the phoneme list (phonemeList).

[0111] The final fluency score formula is shown below:

[0112] SP = S × 0.85 - P × 0.15;

[0113] Among them, SP is the final fluency score; S is the basic fluency score; P is the pause penalty score; in the final fluency score formula, 0.85 is the speech rate weight, which reflects the dominant influence of speech rate on fluency, and 0.15 is the pause weight, which serves as an important correction factor to avoid extreme pauses affecting the evaluation; speech rate weight + pause weight = 1, and the speech rate weight and pause weight can also be adjusted according to the actual situation.

[0114] The formula for calculating the pause penalty score P is as follows:

[0115] P = β1 × N2 + β2 × N3 + β3 × T p

[0116] Where N2 represents the number of pauses in 2 seconds, with an initial value of 0. When the time boundary corresponding to the mute marker sil is greater than or equal to 2 seconds but less than 3 seconds, the value of N2 is incremented by 1. N3 represents the number of pauses in 3 seconds, with an initial value of 0. When the time boundary corresponding to the mute marker sil is greater than or equal to 3 seconds, the value of N3 is incremented by 1. T p The total pause duration is determined based on the time boundary of the phoneme time boundary list corresponding to the silence mark in the phoneme list; β1 is the 2-second pause penalty coefficient, which is 5 minutes / time in this embodiment; β2 is the 3-second pause penalty coefficient, which is 9 minutes / time in this embodiment; β3 is the total pause duration penalty coefficient, which is 0.5 minutes / second in this embodiment.

[0117] The final fluency score formula achieves a multi-dimensional quantitative assessment of spoken fluency through a non-linear speech rate scoring and a compound pause penalty mechanism. This ensures the dominance of the core speech rate indicator while avoiding the bias of a single indicator through multi-dimensional penalties. The pause penalty score employs a dual penalty mechanism, penalizing both high-frequency long pauses and the total pause duration. Humans, when assessing language fluency, simultaneously consider both the frequency and duration of pauses. For example, high-frequency short pauses (e.g., a 1-second pause after each sentence) disrupt coherence and sound abrupt; low-frequency long pauses (e.g., a 3-second pause) may distract the listener or even lead to misunderstandings of forgotten content. The dual penalty mechanism simultaneously constrains N2, N3 (number of long pauses), and T... p (Total pause duration) more comprehensively covers the two typical modes that disrupt smoothness.

[0118] For example, in phonemeList, the time boundaries corresponding to "sil" are [0, 180], [720, 2820], [3320, 3620], and [4070, 4200], then the pause duration T p =180+2100+300+170=2750ms=2.75s; where [720,2820] corresponds to a silence duration of 2.1 seconds, which is obviously due to the candidate's hesitation, resulting in a longer pause. Therefore, a penalty should be imposed for excessively long silence durations.

[0119] The semantic analysis module is used to obtain the semantic score SP2 of the audio text through a pre-trained semantic model. The input of the pre-trained semantic model is the audio text and the answer text. The semantic model performs word segmentation and embedding on the two texts respectively, and finally obtains two sentence vector representations. Then, the cosine similarity between the two sentence vectors is calculated, and the cosine similarity is multiplied by 100 to obtain the semantic score SP2 of the audio text. The answer text is obtained in advance based on the question.

[0120] The pre-trained semantic model is fine-tuned based on the pre-trained SentenceTransformer model. It is obtained by collecting a large number of standard similarity sentence pairs for fine-tuning training. The principle is to segment and embed the text to obtain a vector representation, and then calculate the cosine similarity between the two sentence vectors. The cosine similarity value is between [0,1]. If the value is closer to 1, it means that the two sentences are semantically similar.

[0121] The Sentence Transformer model is a deep learning model based on the Transformer architecture, specifically designed to convert sentences or text paragraphs into high-dimensional vector representations. It is widely used in semantic search, text similarity calculation, information retrieval, and other scenarios. Its core value lies in capturing the deep semantic features of text through vectorized representation, enabling computers to understand natural language more accurately.

[0122] The syntax analysis module is used to obtain the syntax score SP3 of the audio text through a pre-trained syntax model;

[0123] The RoBERTa model (Robustly Optimized BERT Pretraining Approach), released by Meta AI in July 2019, is an optimized version of the BERT model. This model builds its language model on top of BERT's language masking policy by learning and predicting intentionally masked text portions, and is trained using larger mini-batch sizes and learning rates. Compared to BERT, this allows RoBERTa to improve the masking language modeling objective and produce better performance on downstream tasks. RoBERTa is a pre-trained model that performs well on GLUE or general language understanding evaluations.

[0124] The pre-trained grammar model is derived from a fine-tuned RoBERTa model on a large dataset of grammar correction datasets. The model outputs information about whether each word is correct or requires an operation (such as insertion, replacement, or deletion), and then uses the proportion of correct words as a grammar score.

[0125] Specifically, the input to the pre-trained grammar model is the word sequence [wd1, wd2, ..., wdi] corresponding to the audio text, where i is the number of words in the audio text. The output of the pre-trained grammar model is a predicted label sequence [l1, l2, ..., li] of the same length as the word sequence. Each label li belongs to a predefined set of operation labels, such as {CORRECT, DELETE, REPLACE, INSERT_AFTER, ...}, which represents the model's grammatical judgment and suggested operation for the word wdi.

[0126] Traverse the predicted label sequence and count the number of words marked as CORRECT, N2. Divide N2 by the length of the word sequence corresponding to the audio text and multiply by 100 to obtain the grammar score of the audio text.

[0127] The text matching module is used to match the audio text and the answer text using an edit distance-based text matching method to obtain a text matching score SP4; the edit distance-based text matching method is shown in the following formula:

[0128]

[0129] SP4 represents the text matching score; wd1 i wd2 represents the i-th word in the audio text. i This represents the i-th word in the answer text; N is the total number of matched word pairs; ES(wd1) i wd2 i ) is the edit distance similarity function, used to represent wd1 i With wd2 i Edit distance similarity between them, with values ​​ranging from [0-1], where 0 indicates completely different and 1 indicates completely identical; w i wd2 i High-frequency word weights;

[0130] The edit distance-based text matching method includes the following steps:

[0131] 1) Segment the audio text and answer text to obtain the corresponding word segmentation list, and use the Niedermann-Onsch algorithm to achieve global alignment of the word segmentation list;

[0132] The Needleman-Wunsch algorithm is an algorithm based on bioinformatics to match protein or DNA sequences. It is one of the earliest examples of applying dynamic algorithms to the comparison of biological sequences. Invented in 1970 by Saul B. Needleman and Christian D. Wunsch, this algorithm efficiently solves the problem of decomposing a large mathematical problem into a series of smaller problems and reconstructing the solution to the larger problem from the solutions to the smaller problems. This algorithm is also known as an optimization matching algorithm and a global sequence comparison method. The Needleman-Wunsch algorithm is still widely used in optimizing global sequence comparisons.

[0133] For example, the audio text segmentation list wd1: ["the","cat","sits"];

[0134] Answer text segmentation list wd2: ["a","cat","sat"];

[0135] Aligned word segmentation list: ["the / a","cat / cat","sits / sat"]

[0136] 2) Calculate the edit distance similarity of each pair of words using the EditDistance algorithm. The calculation formula is as follows:

[0137]

[0138] Among them, EditDistance() is the edit distance algorithm; len() is the length function; max() is the maximum value function; ES(wd1) i wd2 i ) is the edit distance similarity function, used to represent the i-th word wd1 in the audio text segmentation list wd1. i wd2 of the i-th word in the answer text segmentation list i Edit distance similarity between them, with a value range of [0-1], where 0 indicates completely different and 1 indicates completely the same; i = 0, ..., M, where M is the length of the aligned word segmentation list;

[0139] Edit distance, also known as Levenshtein distance, is a method for measuring the similarity between two strings. It was proposed by the Russian scientist Vladimir Levenshtein in 1965. It is defined as the minimum number of single-character editing operations required to transform one string into another, including inserting, deleting, or replacing a character. This is existing technology, and the specific implementation process of the EditDistance() algorithm will not be elaborated here.

[0140] 3)w i wd2 i High-frequency word weights, to determine wd2 i Whether it is a high-frequency word determines its corresponding w i The value; in this embodiment, binary weights are used to determine: if wd2 i If it is a high-frequency word, then w i It is 0.5; if wd2 i If it is not a high-frequency word, then w i =1;

[0141] The identification of high-frequency words can be done in advance by determining a list of high-frequency words through a large amount of text statistics. This is an existing technology and will not be elaborated here.

[0142] 4) Aggregate the edit distance similarity and weight of all words in the word list, then take the average and multiply by 100 to obtain the final text matching score SP4.

[0143] The pronunciation analysis module is used to calculate the average phoneme confidence of each word as the word pronunciation score by using the phoneme confidence of the phoneme list, and then calculate the average of all word pronunciation scores and multiply it by 100 to obtain the final pronunciation score SP5 of the audio data.

[0144] Specifically, the audio text is segmented to obtain a word list. The confidence scores of each word's corresponding phoneme are calculated by averaging the phoneme confidence scores of each word based on the phoneme list. Then, the pronunciation scores of all words are summed, divided by the length of the word list, and multiplied by 100 to obtain the final pronunciation score SP5 of the audio data.

[0145] The comprehensive evaluation module is used to dynamically adjust the weight of each score based on the length of the audio text using the fluency score SP, grammar score SP3, pronunciation score SP5, and completeness score SP6, and to calculate the final oral assessment score Score; the completeness score SP6 is the larger value between the semantic score SP2 and the text matching score SP4.

[0146] Score=a1(L)×SP+a2(L)×SP3+a3(L)×SP5+a4(L)×SP6

[0147] Where a1(L) is the dynamic weight of fluency score; a2(L) is the dynamic weight of grammar score; a3(L) is the dynamic weight of pronunciation score; a4(L) is the dynamic weight of completeness score; and L is the length of the audio text, i.e., the number of words in the audio text.

[0148] The fluency score (SP), grammar score (SP3), pronunciation score (SP5), and completeness score (SP6) are dynamically weighted based on the length of the audio text. Specifically, the weighting of each score is adjusted:

[0149] When the number of words L is greater than or equal to 1 and less than or equal to 5, it is judged as a short sentence. In this case, the pronunciation score (SP5) and completeness score (SP6) have higher weights because short sentences need to quickly convey the core information, and clear pronunciation and semantic accuracy are key. Grammar (score SP3) has a lower weight because the grammatical structure of short sentences is simple (such as "I'm eight"), and grammatical errors have less impact on comprehension.

[0150] When the number of words L is greater than 5 and less than or equal to 10, it is interpreted as a normal sentence;

[0151] The weighting of grammar score (SP3) and fluency score (SP) has been increased because longer sentences require more complex grammatical structures and coherent expressions; the weighting of pronunciation score (SP5) has been moderately reduced, but comprehensibility still needs to be ensured.

[0152] A sentence is considered a long sentence when the number of words L is greater than 10.

[0153] Fluency score (SP) and grammar score (SP3) have the highest weights because long sentences need to avoid pauses and grammatical errors to ensure logical clarity; completeness score (SP6) has a lower weight because long sentences already contain more information, so the impact of semantic matching is weakened; the following dynamic weight allocation formula is finally obtained:

[0154]

[0155] The completeness score SP6 measures how well a student's answer matches the requirements of the question. Its core objective is to assess whether the answer completely and accurately conveys the core intent of the question. To avoid the limitations of relying solely on text matching (such as matching failures due to paraphrasing of the question), the completeness score uses the following rules:

[0156] SP6 = max(SP4, SP2)

[0157] SP4 is the text matching score, and SP2 is the semantic score.

[0158] The fluency analysis module uses phoneme-level analysis and employs multi-dimensional quantitative assessment, dynamic parameter configuration, reasonable weight allocation, and precise penalty mechanisms to avoid the limitations of a single indicator and ensure the coordinated consideration of speech rate and pauses. This enhances the fairness and practicality of the assessment, conforms to authoritative language proficiency standards, and strengthens the authority and credibility of the assessment results. It achieves a scientific, scenario-based, and personalized assessment of spoken fluency, adapting to different language levels and scenarios.

[0159] The comprehensive evaluation module uses different scoring weights for responses with different text lengths in different audio files to ensure that the evaluation results are closer to actual oral proficiency.

[0160] By capturing the deep intent of answers through pre-trained semantic models, misjudgments due to different expressions are avoided. The matching method based on edit distance quickly filters out completely identical answers. At the same time, the combination of semantic and text matching to judge completeness improves the accuracy and robustness of the evaluation.

[0161] Example 2

[0162] refer to Figure 2 This embodiment provides a spoken language assessment method based on multiple factors. Based on the aforementioned spoken language assessment system based on multiple factors, it includes the following steps:

[0163] S1. Collect voice data of users answering questions; the scenarios of the questions include oral composition and situational spoken language;

[0164] S2. Extract the Fbank features from the speech data and input the Fbank features into the pre-trained text phoneme recognition model to obtain the audio text, phoneme list and phoneme time boundary list corresponding to the speech data.

[0165] S3. Calculate the number of phonemes, pause duration, and number of pauses per minute based on the phoneme list and phoneme time boundary list. Then, calculate the fluency score SP based on these parameters. The final fluency score formula is shown below:

[0166] SP = S × 0.85 - P × 0.15;

[0167] Where SP is the final fluency score; S is the basic fluency score; P is the pause penalty score; in the final fluency score formula, 0.85 is the speech rate weight and 0.15 is the pause weight; speech rate weight + pause weight = 1.

[0168] S4. Obtain the semantic score of the audio text through a pre-trained semantic model. The input of the pre-trained semantic model is the audio text and the answer text. The semantic model performs word segmentation and embedding on the two texts respectively, and finally obtains two sentence vector representations. Then, the cosine similarity between the two sentence vectors is calculated, and the cosine similarity is multiplied by 100 to obtain the semantic score SP2 of the audio text.

[0169] S5. Obtain the grammar score SP3 of the audio text through a pre-trained grammar model; the pre-trained grammar model is obtained by fine-tuning the RoBERTa model on a large number of grammar correction datasets.

[0170] S6. Match the audio text and the answer text using the text matching method based on edit distance to obtain the text matching score SP4;

[0171] S7. Calculate the average phoneme confidence of each word using the phoneme confidence of the phoneme list as the word pronunciation score, and then calculate the average of all word pronunciation scores and multiply it by 100 to obtain the final audio data pronunciation score SP5.

[0172] S8. The larger value between the semantic score SP2 and the text matching score SP4 is taken as the completeness score SP6. The weights of each score are dynamically adjusted according to the length of the audio text based on the fluency score SP, grammar score SP3, pronunciation score SP5, and completeness score SP6, and the final oral assessment score Score is calculated.

[0173] Score=a1(L)×SP+a2(L)×SP3+a3(L)×SP5+a4(L)×SP6

[0174] Where a1(L) is the dynamic weight of fluency score; a2(L) is the dynamic weight of grammar score; a3(L) is the dynamic weight of pronunciation score; a4(L) is the dynamic weight of completeness score; and L is the length of the audio text, i.e. the number of words in the audio text.

[0175] S9. Display the answer text and scores for each dimension on the application terminal; the scores for each dimension include fluency score SP, semantic score, grammar score SP3, pronunciation score SP5, completeness score SP6, and final oral assessment score Score.

[0176] The evaluation results are displayed on the interactive interface of the application, showing the answer text and scores for each dimension. Users can clearly see their evaluation scores for each dimension and use this information to improve their oral expression and enhance their oral expression skills.

[0177] This embodiment provides a spoken language assessment method based on multiple dimensions, including: collecting voice data of users answering questions; obtaining the audio text, phoneme list, and phoneme time boundary list corresponding to the voice data based on a pre-trained text phoneme recognition model; calculating the number of phonemes pronounced per minute, pause duration, and number of pauses based on the phoneme list and phoneme time boundary list, and calculating a fluency score; obtaining a semantic score of the audio text through a pre-trained semantic model; obtaining a grammatical score of the audio text through a pre-trained grammar model; matching the audio text and answer text using a text matching method based on edit distance to obtain a text matching score; obtaining a pronunciation score of the final audio data based on the phoneme confidence of the phoneme list; and a comprehensive evaluation module that dynamically adjusts the weights of each dimension score based on the length of the audio text and calculates the final spoken language assessment score. This embodiment's method comprehensively evaluates spoken language ability through five dimensions: fluency, semantics, grammar, text matching, and pronunciation, avoiding the limitations of a single indicator.

[0178] The method in this embodiment achieves comprehensiveness, fairness, accuracy, and efficiency in oral assessment through multi-dimensional evaluation, dynamic weight adjustment, semantic and text matching, and phoneme-level analysis. It is particularly suitable for scenarios such as education, examinations, and language learning, and can more scientifically reflect students' oral abilities.

[0179] Example 3

[0180] refer to Figure 3 , Figure 3 These are the steps described in the above method embodiments. Alternatively, when the processor 21 executes the computer program, it implements a structural schematic diagram of a multi-dimensional factor-based oral assessment device according to this embodiment. The multi-dimensional factor-based oral assessment device 20 of this embodiment includes a processor 21, a memory 22, and a computer program stored in the memory 22 and executable on the processor 21. When the processor 21 executes the computer program, it implements the functions of each module / unit in the above-described device embodiments.

[0181] For example, the computer program can be divided into one or more modules / units, which are stored in the memory 22 and executed by the processor 21 to complete the present invention. The one or more modules / units can be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of the computer program in the device 20 for multi-dimensional factor-based oral assessment. For example, the computer program can be divided into the modules shown in Embodiment 2. The specific functions of each module are described in the working process of the device described in the above embodiments, and will not be repeated here.

[0182] The device 20 for multi-dimensional factor-based oral assessment may include, but is not limited to, a processor 21 and a memory 22. Those skilled in the art will understand that the schematic diagram is merely an example of the device 20 for multi-dimensional factor-based oral assessment and does not constitute a limitation on the device 20. It may include more or fewer components than illustrated, or combine certain components, or use different components. For example, the device 20 for multi-dimensional factor-based oral assessment may also include input / output devices, network access devices, buses, etc.

[0183] The processor 21 can be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor. The processor 21 is the control center of the multi-dimensional factor-based oral assessment device 20, connecting all parts of the device 20 via various interfaces and lines.

[0184] The memory 22 can be used to store the computer programs and / or modules. The processor 21 implements various functions of the multi-dimensional factor-based oral assessment device 20 by running or executing the computer programs and / or modules stored in the memory 22 and calling the data stored in the memory 22. The memory 22 may mainly include a program storage area and a data storage area. The program storage area may store the operating system, at least one application program required for a function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created based on the use of the mobile phone (such as audio data, phonebook, etc.). In addition, the memory 22 may include high-speed random access memory, and may also include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital card (SD) card, flash card, at least one disk storage device, flash memory device, or other volatile solid-state storage device.

[0185] The modules / units integrated into the multi-dimensional oral assessment device 20, if implemented as software functional units and sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by the processor 21, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include: any entity or device capable of carrying the computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc.

[0186] It should be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Furthermore, in the accompanying drawings of the device embodiments provided by this invention, the connection relationships between modules indicate that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines. Those skilled in the art can understand and implement this without any creative effort.

[0187] In this embodiment, the device can be a mobile phone, tablet, or computer.

[0188] The parts of this invention not described in detail are prior art. It will be apparent to those skilled in the art that this invention is not limited to the details of the above exemplary embodiments, and that the invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the invention. Therefore, the embodiments should be regarded as exemplary and non-limiting in all respects, and are intended to encompass all changes falling within the meaning and scope of equivalents within this invention.

Claims

1. A spoken language assessment system based on multi-dimensional factors, characterized in that, Specifically, it includes the following modules: The voice acquisition module is used to collect voice data of users answering questions; the scenarios of the questions include oral composition and situational spoken language. The speech recognition module is used to extract Fbank features from speech data and input the Fbank features into a pre-trained text phoneme recognition model to obtain the audio text, phoneme list, and phoneme time boundary list corresponding to the speech data. The fluency analysis module is used to calculate the number of phonemes, pause duration, and number of pauses per minute based on the phoneme list and phoneme time boundary list, and to calculate the fluency score SP based on the number of phonemes, pause duration, and number of pauses per minute. The fluency score SP is shown in the following formula: SP = S × 0.85 - P × 0.15; SP represents the final smoothness score; S represents the basic smoothness score; and P represents the pause penalty score. The formula for calculating the basic fluency score S is as follows: S = 100×[1-|(PPM-PPM') / PPM'|^α]; Wherein, PPM is the number of phonemes spoken per minute; PPM' is the optimal number of phonemes spoken per minute; α is the speech rate sensitivity adjustment coefficient; the value of the speech rate sensitivity adjustment coefficient α varies depending on the scenario of the topic; in the oral composition scenario, α takes a value in [1.1, 1.3]; while in the situational oral communication scenario, α takes a value in [1.5, 1.7]. The formula for calculating the pause penalty score P is as follows: P= β 1 ×N2 + β 2 ×N3 + β 3 × T p ; Where N2 represents the number of pauses in 2 seconds; N3 represents the number of pauses in 3 seconds; T p The total pause duration; β 1 The penalty coefficient is 2 seconds for pause. β 2 The penalty coefficient is 3 seconds for pause. β 3 This is the penalty coefficient for the total pause duration; The semantic analysis module is used to obtain the semantic score SP2 of the audio text through a pre-trained semantic model; the pre-trained semantic model is fine-tuned based on the pre-trained SentenceTransformer model. The syntax analysis module is used to obtain grammatical scores for audio text through a pre-trained syntax model. SP3 The pre-trained grammar model is obtained by fine-tuning the RoBERTa model. The text matching module is used to match audio text and answer text using an edit distance-based text matching method to obtain a text matching score. SP4 ; The pronunciation analysis module calculates the average phoneme confidence score for each word based on the phoneme confidence score of the phoneme list. Then, it calculates the average of all word pronunciation scores and multiplies it by 100 to obtain the final pronunciation score for the audio data. SP5 ; The comprehensive evaluation module uses fluency score (SP) and grammar score as criteria. SP3 Pronunciation score SP5 and completeness score SP6 The scoring weights are dynamically adjusted based on the audio text length using a dynamic weighting formula, and the final oral assessment score (Score) is calculated. The completeness score... SP6 For semantic score SP2 Text matching score SP4 The larger value; The dynamic weight allocation formula is as follows: , in, a 1 (L) represents the dynamic weight of the fluency score; a 2 (L) represents the dynamic weight of the grammar score; a 3 (L) represents the dynamic weight of pronunciation scores; a 4 (L) represents the dynamic weight of the completeness score; L represents the length of the audio text, i.e., the number of words L in the audio text.

2. The system according to claim 1, characterized in that, The text phoneme recognition model adopts a Transformer-CTC architecture, with the following specific structure: The input is the FBank features of the audio. The FBank features are downsampled and local features are extracted through two convolutional layers. Then, a linear layer projects the local features onto the input dimension of the Transformer. Next, multiple Transformer encoder layers are passed, each containing a multi-head self-attention mechanism and a feedforward neural network to capture long-term dependent high-level acoustic features. Finally, a linear layer maps the high-level acoustic features to the phoneme label space, outputting the probability of each phoneme label for each time frame. Finally, CTC decodes and outputs the audio text, a phoneme list, and a phoneme time boundary list consisting of the time boundaries of each phoneme. Each element of the phoneme list includes the specific phoneme and its corresponding confidence score.

3. The system according to claim 1, characterized in that, The basic fluency score S ranges from [0, 100], and the unit is points.

4. The system according to claim 1, characterized in that, The input to the pre-trained semantic model is audio text and answer text. The semantic model performs word segmentation and embedding on the two texts respectively, resulting in two sentence vector representations. The cosine similarity between the two sentence vectors is then calculated, and the cosine similarity is multiplied by 100 to obtain the semantic score of the audio text. SP2 The answer text was obtained in advance based on the question.

5. The system according to claim 1, characterized in that, The input to the pre-trained grammar model is the word sequence [wd1, wd2, ..., wdi] corresponding to the audio text, where i is the number of words in the audio text. The output of the pre-trained grammar model is a predicted label sequence [l1, l2, ..., li] of the same length as the word sequence. Each label li belongs to a predefined set of operation labels, including CORRECT, DELETE, REPLACE, and INSERT_AFTER, representing the model's grammatical judgment and suggested operation for the word wdi. The predicted label sequence is traversed, and the number of words marked as CORRECT, N2, is counted. N2 is divided by the length of the word sequence corresponding to the audio text and multiplied by 100 to obtain the grammar score of the audio text.

6. The system according to claim 1, characterized in that, The edit distance-based text matching method is shown in the following formula: , in, SP4 Score the text matching. wd1 i This represents the i-th word in the audio text; wd2 i This represents the i-th word in the answer text; N is the total number of matched word pairs. ES ( wd1 i , wd2 i () is the edit distance similarity function, used to represent wd1 i and wd2 i The edit distance similarity between them, with a value range of [0-1], where 0 means completely different and 1 means completely the same; w i for wd2 i The weight of high-frequency words.

7. The system according to claim 6, characterized in that, The edit distance similarity function is as follows: , in, This is the edit distance algorithm; len() is the length function; max() is the maximum value function.

8. The system according to claim 1, characterized in that, The final oral assessment score is calculated using the following formula: 。 9. A spoken language assessment method based on multi-dimensional factors, based on the spoken language assessment system based on multi-dimensional factors according to any one of claims 1-8, characterized in that, Includes the following steps: S1. Collect voice data of users answering questions; the scenarios of the questions include oral composition and situational spoken language; S2. Extract the Fbank features from the speech data and input the Fbank features into the pre-trained text phoneme recognition model to obtain the audio text, phoneme list and phoneme time boundary list corresponding to the speech data. S3. Calculate the number of phonemes, pause duration, and number of pauses per minute based on the phoneme list and phoneme time boundary list, and calculate the fluency score SP based on the number of phonemes, pause duration, and number of pauses per minute. S4. Obtain the semantic score SP2 of the audio text through a pre-trained semantic model; the pre-trained semantic model is fine-tuned based on the pre-trained SentenceTransformer model. S5. Obtain the grammar score of the audio text through a pre-trained grammar model. SP3 The pre-trained grammar model is obtained by fine-tuning the RoBERTa model. S6. Match the audio text and answer text using an edit distance-based text matching method to obtain a text matching score. SP4 ; S7. Calculate the average phoneme confidence score for each word using the phoneme confidence score from the phoneme list. Then, calculate the average of all word pronunciation scores and multiply it by 100 to obtain the final pronunciation score for the audio data. SP5 ; S8, based on fluency score (SP) and syntax score. SP3 Pronunciation score SP5 and completeness score SP6 The weights of each score are dynamically adjusted based on the length of the audio text, and the final oral assessment score (Score) is calculated; the completeness score... SP6 For semantic score SP2 Text matching score SP4 The larger value.