Vocally-recognized oral english test system for higher vocational colleges
By constructing a professional domain corpus and multi-dimensional assessment, the problems of inconsistent scoring and data-driven teaching in vocational college English oral assessment have been solved, achieving efficient and accurate English oral assessment and personalized feedback, and supporting data-driven teaching in vocational education.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- JIAN COLLEGE
- Filing Date
- 2026-03-20
- Publication Date
- 2026-06-12
AI Technical Summary
Existing English oral assessment technologies are insufficient to meet the needs of large-scale class assessments in higher vocational education. They suffer from delayed feedback, inconsistent scoring standards, a lack of effective modeling of accents and professional terminology among higher vocational students, difficulty in ensuring the fairness and reliability of assessment results, and a lack of data-driven teaching support.
We construct a corpus categorized by professional field, use an end-to-end deep learning model with Conformer architecture for speech recognition, and automatically score speech based on five dimensions: pronunciation accuracy, fluency, grammatical correctness, vocabulary professionalism, and content relevance through a multi-dimensional evaluation unit. This generates personalized feedback, supports data storage and analysis, and produces teaching reports.
It improves the accuracy and applicability of assessments, reduces teachers' workload, provides precise and personalized learning suggestions, realizes data-driven teaching intervention, and adapts to the professional scenario needs of higher vocational education.
Smart Images

Figure CN122201095A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence speech recognition, and more specifically, to a speech recognition-based English oral assessment system for higher vocational colleges. Background Technology
[0002] Vocational education aims to cultivate highly skilled technical personnel, with a particular emphasis on industry-education integration and the development of students' workplace English communication abilities. English oral proficiency is a crucial indicator of students' professional competence and employment competitiveness; therefore, English oral assessment is a core component of vocational English teaching. However, existing English oral assessment technologies have significant shortcomings in their application within vocational education.
[0003] Traditional methods of manual interviews or teacher grading are ill-suited to the assessment needs of large-scale vocational college classes. Feedback is often delayed, and grading standards are difficult to standardize, leading to significant subjectivity. On average, a teacher spends over 8 hours manually assessing 50 students, and the consistency coefficient of the grading is usually below 0.7, making it difficult to guarantee the fairness and reliability of the assessment results.
[0004] In the field of automated assessment technology, several related patents have been published. For example, patent CN106847260A discloses "an automatic scoring method for spoken English based on feature fusion." This method mainly extracts and fuses multiple features of speech to achieve automatic scoring of spoken English. However, the corpus of this method is usually built based on standard American English or general English reading data, and has not been specifically optimized for the accent characteristics of Chinese vocational college students. In particular, it lacks effective modeling of common accent errors of vocational college students (such as confusion of alveolar consonants / θ / / ð / , loss of word-final consonants) and the problem of stress deviation in professional terms. Its assessment questions are mainly limited to fixed text reading, making it difficult to simulate real workplace interaction scenarios such as tourist attraction explanations, troubleshooting of electromechanical equipment, agricultural and forestry technical guidance, intelligent construction communication, postoperative nursing consultation, explanation of artificial intelligence solutions, and live-streaming e-commerce in the digital economy.
[0005] For example, patent CN107818795B discloses "a method and device for assessing spoken English." This technology collects users' voice data, converts it into text, and annotates the spoken language habit text data. It combines text features and acoustic features to obtain information about users' spoken language habits, and then assesses their spoken English proficiency. This method can reflect fluency and sentence organization ability to some extent, but its scoring dimensions are still mainly focused on pronunciation accuracy and fluency, failing to effectively assess the accuracy and relevance of students' use of professional vocabulary in specific workplace scenarios. At the same time, its feedback is relatively simple, mostly consisting of a total score or a simple grade assessment, and cannot provide students with targeted personalized improvement suggestions such as phoneme-level error correction and professional terminology recommendations.
[0006] Furthermore, existing technologies (including the aforementioned patents) are typically only used as one-time assessment tools, lacking structured storage and multi-dimensional statistical analysis capabilities for assessment data. The systems cannot aggregate and analyze individual student assessment data to create competency profiles for classes or grades, making it difficult for teachers to conduct data-driven, precise teaching interventions, and the teaching workload remains heavy. Statistics show that teachers spend an average of over 40 hours per semester on oral assessment and feedback.
[0007] Therefore, we have made improvements to this and proposed a speech recognition-based English oral assessment system for higher vocational colleges. Summary of the Invention
[0008] The purpose of this invention is to provide a speech recognition-based English oral assessment system for higher vocational colleges, which realizes professional scenario-based, interactive, multi-dimensional automatic assessment and teaching data closed loop, and supports the cultivation of talents through industry-education integration in higher vocational colleges.
[0009] To achieve the above-mentioned objectives, this invention provides the following speech recognition-based English oral assessment system for higher vocational colleges, in order to improve the aforementioned problems.
[0010] The application is as follows:
[0011] A speech recognition-based vocational college English oral assessment system includes:
[0012] A voice acquisition unit is configured to acquire spoken English voice signals from vocational college students. The voice acquisition unit is deployed on a client device, which may include a mobile terminal, a personal computer, or a dedicated intelligent voice acquisition device. The voice acquisition unit has built-in noise reduction and voice activity detection algorithms to extract effective voice segments in noisy environments.
[0013] A speech recognition unit, connected to the speech acquisition unit, is configured to preprocess and recognize the speech signal, and output the corresponding text sequence and acoustic features;
[0014] The vocational college English major corpus unit is configured to store standard spoken language data categorized by vocational college major fields. The standard spoken language data includes standard pronunciation audio, reference text, professional vocabulary, and scenario templates. The vocational college major fields include seven major fields: tourism, electromechanical engineering, agriculture and forestry, intelligent construction, nursing, artificial intelligence, and digital economy.
[0015] The multi-dimensional assessment unit is connected to the speech recognition unit and the vocational college English major corpus unit, respectively. It is configured to automatically score the oral English of vocational college students based on the text sequence, acoustic features and the standard spoken language corpus, from five dimensions: pronunciation accuracy, fluency, grammatical correctness, vocabulary professionalism and content relevance.
[0016] The feedback generation unit is connected to the multi-dimensional evaluation unit and is configured to generate personalized improvement suggestions and learning paths based on the scoring results.
[0017] The data storage and analysis unit is connected to the multi-dimensional assessment unit and the feedback generation unit, respectively, and is configured to store all assessment records, perform statistical analysis on the data, and generate teaching analysis reports for teachers to use.
[0018] As a preferred implementation, the speech recognition unit employs an end-to-end deep learning model with a Conformer architecture. This model is first pre-trained on a large-scale general English corpus, and then fine-tuned using a corpus of spoken English from vocational college students to accommodate non-native accents. The corpus covers seven major vocational college professional fields: tourism, electromechanical engineering, agriculture and forestry, intelligent construction, nursing, artificial intelligence, and digital economy, and includes annotations for errors in professional terminology stress and typical accent errors involving alveolar consonants. The acoustic features include Mel-frequency cepstral coefficients, fundamental frequency, formants, and posterior phoneme probabilities.
[0019] As a preferred implementation, the vocational college English major corpus unit is further configured to support dynamic updates, enabling the addition or modification of the standard spoken language corpus through a web interface based on industry development and teacher needs, and triggering incremental updates to the relevant models in the speech recognition unit or the multi-dimensional assessment unit.
[0020] As a preferred embodiment, the multi-dimensional evaluation unit includes:
[0021] The pronunciation accuracy submodule is configured to calculate phoneme-level pronunciation scores using the GOP algorithm based on the aforementioned acoustic features. The calculation formula is as follows:
[0022] ;
[0023] in For phonemes, To observe acoustic feature sequences, For frame number, For the posterior probability of a phoneme, For the prior probability of a phoneme;
[0024] The fluency submodule is configured to calculate speech rate, average pause duration, and number of pauses based on the timestamp information of the text sequence, and generate a fluency score.
[0025] The grammar correctness submodule is configured to perform syntactic analysis on the text sequence, detect grammatical errors, and generate a grammar correctness score.
[0026] The vocabulary specialization submodule is configured to match the text sequence with the professional vocabulary list, calculate the coverage and accuracy of professional vocabulary, and generate a vocabulary specialization score.
[0027] The content relevance submodule is configured to perform semantic similarity calculation between the text sequence and the scene template to generate a content relevance score.
[0028] In a preferred embodiment, the content relevance submodule uses the Sentence-BERT model to encode the text sequence and the scene template into semantic vectors, respectively, and obtains the content relevance score by calculating the cosine similarity between the two vectors. The calculation formula is as follows: ;
[0029] Where A is the semantic vector of the text sequence of vocational college students, and B is the semantic vector of the scene template.
[0030] In one preferred embodiment, the vocabulary specialization submodule calculates the coverage and accuracy of specialized vocabulary according to the formula:
[0031] ;
[0032] Calculate the vocabulary professionalism score, among which To improve the coverage of professional vocabulary, For the accuracy of professional terminology.
[0033] As a preferred implementation, the feedback generation unit supports multiple feedback formats, including text feedback, voice feedback generated through speech synthesis technology, and feedback in the form of a visual graph. The feedback generation unit is based on a collaborative filtering algorithm and recommends learning question types and micro-lesson resources for the next stage based on the current vocational college students' scores in various dimensions of weaknesses and historical resource selection data of student groups with similar learning trajectories.
[0034] As a preferred implementation, the data storage and analysis unit can generate oral communication ability statistical reports for classes or individuals, heat maps of weaknesses in various dimensions, and progress trend curves, and supports teachers to query, export, and perform teaching intervention operations.
[0035] As a preferred implementation, the system supports various oral assessment question types, including reading aloud, retelling, role-playing, free expression, and guided dialogue, to simulate interactive workplace scenarios.
[0036] A method for assessing spoken English in vocational colleges based on speech recognition, applied to any of the systems described above, includes the following steps:
[0037] Step S1: Collect the spoken English voice signals of vocational college students through the voice acquisition unit;
[0038] Step S2: The speech recognition unit preprocesses and performs speech recognition on the speech signal, and outputs the corresponding text sequence and acoustic features;
[0039] Step S3: The multi-dimensional assessment unit retrieves standard spoken language data corresponding to the current assessment scenario from the vocational college English major corpus unit. Combining the text sequence and acoustic features, it automatically scores the spoken language of vocational college students from five dimensions: pronunciation accuracy, fluency, grammatical correctness, vocabulary professionalism, and content relevance, generating scores for each dimension and a total score.
[0040] Step S4: The feedback generation unit matches the preset feedback template with the scores of each dimension and the total score to generate personalized improvement suggestions and learning paths in the form of text, voice or visual graphs.
[0041] Step S5: The data storage and analysis unit will store the voice signals, text sequences, scores for each dimension, and feedback suggestions from this evaluation in a structured manner, and perform statistical analysis on historical evaluation data;
[0042] Step S6: The teacher calls the statistical analysis results generated by the data storage and analysis unit to obtain the teaching analysis report for the class or individual, thus completing the closed-loop process of oral assessment and teaching feedback.
[0043] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0044] In the scheme of this application:
[0045] 1. By constructing a scenario corpus categorized by professional field and fine-tuning the speech recognition model using a massive dataset of vocational college students, the system effectively addresses the issues of general systems being out of touch with workplace scenarios and having low recognition rates for Chinglish. Real-world testing data shows that in complex scenarios containing professional terminology and interference from non-native accents, the system's accuracy in recognizing core content is more than 20% higher than that of general models (such as the iFlytek Suntone general version). The corpus covers seven popular vocational college professional fields: tourism, electromechanical engineering, agriculture and forestry, intelligent construction, nursing, artificial intelligence, and digital economy, aligning with current trends in vocational education and thus having a wider range of applications.
[0046] 2. For the first time, the assessment dimensions have been expanded from the traditional two dimensions (pronunciation and fluency) to five dimensions (pronunciation, fluency, grammar, vocabulary, and content). In particular, the introduction of "vocabulary specialization" and "content relevance" makes the assessment results more accurately reflect students' workplace English application ability. Furthermore, in-depth diagnosis can pinpoint specific phoneme errors and missing professional terms, providing a solid foundation for personalized learning.
[0047] 3. The system is not only an assessment tool but also a teaching data analysis platform. Through data storage and analysis units, individual assessment data is aggregated into large-scale teaching data for classes and grades, providing teachers with visualized and actionable learning reports. This frees teachers from tedious manual scoring, allowing them to focus their energy on data-driven, precise teaching interventions. Practice shows that this system can reduce teachers' workload in oral assessment and feedback by more than 85%.
[0048] 4. The dynamic update mechanism of the corpus allows schools and enterprises to import new corpora at any time according to the latest industry needs, ensuring that the assessment content remains synchronized with actual workplace situations. The system supports teachers to upload new scenario corpora through a web interface. After automatic format checking and manual review, the corpus can be added to the database and the incremental updates of related models can be completed within 24 hours, truly serving the talent training goals of industry-education integration.
[0049] 5. The system supports various oral assessment question types, including reading aloud, retelling, role-playing, free expression, and guided dialogue. It can simulate interactive workplace scenarios in seven major fields, such as tourism services, electromechanical equipment operation and maintenance, agricultural and forestry technology communication, intelligent construction scheme explanation, nursing consultation, artificial intelligence technology exchange, and digital economy business negotiation. This allows students to experience real workplace communication during the assessment and improve their learning outcomes. Attached Figure Description
[0050] Figure 1 This is a schematic diagram of the overall structure of the present invention.
[0051] Figure 2 This is an internal flowchart of the multi-dimensional evaluation unit of the present invention.
[0052] Figure 3 This is a schematic diagram of the nursing professional consultation and assessment interface of the present invention.
[0053] Figure 4 This is a flowchart illustrating the dynamic corpus update process of this invention. Detailed Implementation
[0054] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0055] Example 1
[0056] like Figure 1As shown in the figure, this embodiment provides a vocational college English oral assessment system based on speech recognition, which mainly includes two parts: a client and a cloud server.
[0057] The voice acquisition unit is located on the client side and collects the user's spoken English voice signal in real time through the device's microphone. The unit has built-in noise reduction algorithm and voice activity detection algorithm, which can effectively extract the user's voice in noisy environments such as classrooms and training rooms, and automatically detect the start and end points of the voice. After packaging the effective voice segments, it uploads them to the cloud server. In this embodiment, the noise reduction algorithm can be, for example, spectral subtraction, and the voice activity detection algorithm can be, for example, a dual threshold detection method based on energy and zero-crossing rate.
[0058] The speech recognition unit is deployed on a cloud server and receives speech signals from the speech acquisition unit, such as... Figure 2 As shown, the speech recognition unit includes a preprocessing module, a feature extraction module, and a deep learning model module.
[0059] The preprocessing module performs pre-emphasis, framing, and windowing on the speech signal. For example, the pre-emphasis coefficient can be set to 0.97, the frame length can be set to 25ms, the frame shift can be set to 10ms, and the Hamming window function can be selected.
[0060] The feature extraction module 122 extracts 80-dimensional FilterBank features and simultaneously calculates auxiliary features such as fundamental frequency and formant.
[0061] The deep learning model module adopts an end-to-end model with the Conformer architecture. It is first pre-trained on the LibriSpeech corpus, and then fine-tuned using domain data, such as 156,000 hours of spoken English data of Chinese vocational college students collected from 32 vocational colleges across the country. This fine-tuned corpus covers seven major vocational college professional fields, including tourism, electromechanical engineering, agriculture and forestry, intelligent construction, nursing, artificial intelligence, and digital economy. The model finally outputs text sequences, word-level timestamps, phoneme-level alignment, and phoneme posterior probabilities.
[0062] The vocational college English major corpus is deployed in the cloud and stored using a distributed database. The core of the corpus is a standard corpus, which is classified into seven major fields: tourism, electromechanical engineering, agriculture and forestry, intelligent construction, nursing, artificial intelligence, and digital economy. Each major field is further subdivided into specific workplace scenarios. For example, the tourism category includes scenic spot explanations, hotel services, and tour guide communication; the electromechanical category includes equipment operation instructions and troubleshooting dialogues; the agriculture and forestry category includes crop planting technology exchanges and agricultural product sales negotiations; the intelligent construction category includes construction plan communication and project acceptance dialogues; the nursing category includes consultation communication and nursing guidance; the artificial intelligence category includes technical solution explanations and algorithm exchanges; and the digital economy category includes e-commerce operation communication and data analysis reporting. Each scenario includes a standard scenario template, reference text, standard pronunciation audio, professional vocabulary, and evaluation points.
[0063] The corpus supports dynamic updates. For example, teachers or industry experts can upload new scenario corpora through a secure web interface. After automatic format checks and manual review, they can be automatically added to the assessment question bank and trigger incremental updates to the relevant models in the speech recognition unit or multi-dimensional assessment unit, ensuring that the teaching content keeps pace with industry development.
[0064] Multi-dimensional assessment units such as Figure 2 As shown, this unit contains five sub-modules that work together to complete multi-dimensional scoring.
[0065] The pronunciation accuracy submodule uses the GOP algorithm for phoneme-level scoring. The specific steps include: utilizing the phoneme-level timestamps and posterior phoneme probabilities output by the speech recognition unit. Obtain the prior probability of phonemes for standard pronunciation from the corpus. For each phoneme According to the formula Calculate the GOP score; take a weighted average of the GOP scores of all phonemes to obtain the final pronunciation accuracy score.
[0066] The fluency submodule calculates metrics such as speech rate, average pause duration, and number of pauses based on word-level timestamps, and obtains a fluency score through weighted fusion.
[0067] The grammatical correctness submodule uses a BERT-based syntactic analysis model to perform dependency parsing on user text, identify common grammatical errors such as subject-verb disagreement, tense errors, and missing articles, and deduct points based on the number and severity of errors to obtain a grammatical correctness score.
[0068] The vocabulary specialization submodule extracts the professional vocabulary list corresponding to the current assessment task from the vocational college English major corpus unit, matches the user text sequence with the professional vocabulary list, calculates the coverage and accuracy of professional vocabulary, and obtains a comprehensive vocabulary specialization score.
[0069] The content relevance submodule uses the Sentence-BERT model to encode user text sequences and standard scene template texts in the corpus into semantic vectors, calculates cosine similarity, and combines keyword matching technology to ensure that the core points are covered, and finally obtains the content relevance score.
[0070] Finally, the system weights and sums the scores of the five dimensions according to the default weights to obtain the total score of this oral assessment.
[0071] The feedback generation unit generates personalized feedback based on the output of the multi-dimensional assessment unit. The feedback can take the form of text feedback, voice feedback generated by TTS technology, and visualization charts such as radar charts and bar charts. The system triggers preset feedback templates based on the score threshold and error type, and recommends suitable micro-lesson resources and exercises based on the collaborative filtering algorithm.
[0072] The data storage and analysis unit stores the raw audio, recognized text, scores for each dimension, total score, and feedback information of all users' assessments in a structured manner in a cloud database. This unit regularly runs data analysis tasks to generate various statistical reports, such as an overview of class oral communication skills, a heatmap of the distribution of weaknesses in each dimension, and individual progress trend curves. These reports are pushed to the teacher's web interface via API, allowing teachers to view, filter, and export them at any time, and to conduct targeted teaching interventions based on the report results.
[0073] Example 2
[0074] This embodiment demonstrates the application of the system in a specific teaching scenario for nursing professionals.
[0075] Students log in to the system via a mobile app and select the "Simulated Postoperative Care Consultation for Elderly Patients" scenario within the nursing field, such as... Figure 3 As shown, the system plays a voice and text prompt: "You are a nurse in a general hospital. An elderly patient just had knee replacement surgery and complains about leg pain and poor sleep. Please ask about the pain level, sleep situation, and provide appropriate care guidance. You have 90 seconds."
[0076] Students began to speak freely: "Hello, Mr. Smith. I heard you're having leg pain after the surgery. On a scale of 1 to 10, how would you rate your pain? Anddid you have trouble falling asleep last night? It's normal to feel pain after surgery. You can press the pain relief button if it's unbearable. Also, try to keep your leg elevated when resting, which helps reduce swelling and improve sleep quality."
[0077] After voice acquisition and recognition, the multi-dimensional evaluation unit performs analysis:
[0078] The pronunciation accuracy score was 93, and the professional terms such as knee replacement surgery, pain relief button, elevated, and swelling were clearly recognized. Phoneme-level GOP analysis showed that all phonemes scored above 82.
[0079] The fluency score was 87, the speaking speed was 118 words per minute, there were 2 pauses, and the pause duration was 0.5 seconds.
[0080] The grammatical correctness score is 91. The sentence structure is complete, the tense and prepositions are correct, and there are no grammatical errors.
[0081] The vocabulary score was 94, with correct use of core nursing terms such as knee replacement surgery, pain level, pain relief button, elevator, and swelling.
[0082] The content relevance score was 96. According to Sentence-BERT calculation, the core semantics of the user's answer highly matched the standard template "difficult to explain, propose a compromise solution", with a cosine similarity of 0.94 and a keyword coverage of 100%.
[0083] The feedback generation unit then generates feedback: a voice broadcast of the total score and scores for each dimension; a radar chart showing the distribution of the five-dimensional abilities, with vocabulary professionalism scoring relatively low; and text feedback adding: "Your nursing consultation was professional and fluent, covering all the core communication points! We suggest adding the high-frequency professional term 'postoperative' to your expressions, for example, optimizing 'after the surgery' to 'after the postoperative period' to make the expression more relevant to the medical and nursing workplace. We recommend that you complete the 'Postoperative Nursing Communication' unit exercise in 'Core Dialogues for Nursing Professional English'."
[0084] Compared with existing technologies, this embodiment not only scores, but also simulates real workplace interactions and provides precise and actionable vocabulary improvement suggestions.
[0085] Example 3
[0086] At the end of the semester, the data storage and analysis unit generated a teaching analysis report for the entire grade. The report showed that:
[0087] The overall oral communication skills of Class 1 of the Mechatronics major showed an average score of 82.3, with pronunciation accuracy at 85.1, fluency at 83.7, grammatical correctness at 84.2, vocabulary professionalism at 76.8, and content relevance at 81.5.
[0088] The heatmap of weaknesses across all dimensions shows that the average score for vocabulary proficiency in all classes of the Mechatronics major was 78, significantly lower than other dimensions. In particular, the mastery of professional vocabulary such as hydraulic, pneumatic, and valve in the "troubleshooting" scenario was generally poor, with a correct usage rate of only 62%.
[0089] The individual progress trend curve shows that student Zhang San's fluency score improved significantly from 72 points at the beginning of the term to 85 points at the end of the term.
[0090] After seeing the report, the teachers decided to conduct a targeted professional vocabulary reinforcement week. At the same time, the teachers uploaded the latest industry corpus through the corpus management interface. The content was a fault diagnosis dialogue of a new type of CNC machine tool, which contained more cutting-edge professional terms. After receiving the new corpus, the system backend automatically triggered the incremental training process of the model, using the new corpus to make a light-weight fine-tuning of the original vocabulary professional sub-module, so that it could better understand and evaluate the spoken expressions containing these new terms.
[0091] This forms a complete closed loop from "identifying problems through assessment" to "intervention in teaching" and then to "updating the question bank and model," ensuring that the system's assessment capabilities keep pace with the times.
[0092] Example 4
[0093] The system is deployed in the cloud and adopts a microservice architecture. The units communicate with each other through RESTful APIs. The client supports iOS and Android apps as well as H5 web pages, ensuring a good access experience on different terminals.
[0094] Students can complete various question types on their mobile devices:
[0095] In the reading aloud section, digital economy students read the standard text provided by the system: "Please read the following sentence: Digital economy has become an important driving force for economic growth, relying on data resources and digital technology." The system evaluated pronunciation accuracy and fluency, and detected that the student's pronunciation of "driving force" was incorrect. The GOP algorithm calculated a score of only 68 points for this phrase, and the feedback generation unit prompted: "Your pronunciation of 'driving force' is incorrect. The correct pronunciation is..." Please repeat after me.
[0096] In the retelling question type, agricultural and forestry students logged in via iPad. The system played an English audio explanation of "greenhouse vegetable planting technology." The students retell it in their own words as, "Greenhouse vegetables need enough sunlight and water, and we should control temperature and humidity." The content relevance score was 80 points, but the grammar correctness score was 65 points. The system feedback was: "Your retelling grasped the core points, but the grammar needs improvement: 'Greenhouse vegetable' should be 'Greenhouse vegetables', and 'need' should be 'need s'."
[0097] As shown in Example 2, the role-playing question type involves the system playing one role and the student playing another role, engaging in multiple rounds of dialogue. For example, tourism students can simulate a "hotel check-in service" dialogue. The system evaluates all five dimensions and records diagnostic information for each round of dialogue.
[0098] In the free-expression question type, digital economy students were given the open-ended topic "Describe the impact of e-commerce live streaming on the development of small and medium-sized enterprises." Students were free to express their opinions, and the system focused on evaluating the relevance of the content, the professionalism of the vocabulary, and the correctness of the grammar. If the students' answers showed insufficient use of professional vocabulary, the system would generate comprehensive feedback and recommend corresponding professional vocabulary learning resources.
[0099] In the guided dialogue type, nursing students complete the "simulated elderly patient care consultation" assessment in the training room using intelligent voice acquisition equipment. The system acts as the patient to ask questions dynamically, and the students act as nurses to conduct multiple rounds of dialogue. The system evaluates the students' performance in real time after each round of dialogue and generates a comprehensive report after the dialogue ends.
[0100] Regardless of the question type or professional field, the system will call its core speech recognition, corpus matching and multi-dimensional evaluation functions to ensure the consistency and accuracy of the evaluation. All processing logic is completed in the cloud, and the local client is only responsible for data collection and result display, which greatly reduces the performance requirements of mobile devices and ensures unified management and rapid iteration of models and corpora.
[0101] Example 5
[0102] This example details how teachers can conduct instructional interventions through a teacher terminal.
[0103] After logging into the web-based teacher platform, teachers can perform the following operations:
[0104] When viewing class reports, select the class and date range, and the system will generate an oral communication skills analysis report for that class, including an average score trend chart, a radar chart of distribution in each dimension, a heat map of weaknesses, and a list of students who have made the most progress. You can view the performance of different professional classes in seven major fields: tourism, electromechanical engineering, agriculture and forestry, intelligent construction, nursing, artificial intelligence, and digital economy.
[0105] When selecting students, teachers can use various criteria to select students. For example, they can select students with a total score below 70 based on the score range, students with a vocabulary proficiency score below 60 based on the weakness dimension, students with frequent subject-verb disagreement errors based on the error type, and students with a continuous decline in scores for three consecutive tests based on the progress trend. For example, the teacher selected 15 students with a vocabulary proficiency score below 70.
[0106] When teaching in groups, teachers can group selected students into a group named "AI Vocabulary Enhancement Group" and push customized practice tasks to them. The system supports creating new groups, adding students to groups, setting learning goals for groups, and tracking the learning progress of groups.
[0107] When pushing learning resources, teachers can select suitable micro-lesson videos, exercises, professional vocabulary lists, etc. from the resource library and push them to the designated student groups. The resource library includes pronunciation enhancement series of micro-lesson videos, role-playing exercises, professional vocabulary lists categorized by scenario, etc. Teachers can also create custom assessment tasks, specifying question types, scenarios, time, etc. The system will automatically record students' completion status and display statistics such as completion rate and average score on the teacher's end.
[0108] When viewing individual details, teachers can see all of a single student's assessment records, including score trend charts for each dimension, a list of specific errors, a history of system-generated feedback suggestions, and learning trajectory.
[0109] When exporting reports, teachers can export class reports and individual reports as PDF or Excel format for teaching archiving or communication with parents. The exported reports include a cover, table of contents, statistical analysis charts and data tables, and an appendix of the original assessment records.
[0110] Through the above operations, teachers have shifted from teaching based on experience to data-driven teaching, greatly improving the relevance and efficiency of their teaching.
[0111] Example 6
[0112] This embodiment details the complete process of dynamic corpus updating, such as... Figure 4 As shown
[0113] Teachers or industry experts can log in to the corpus management web interface and upload new scenario corpora, including the scenario name, the professional field (one of the seven major fields of tourism, electromechanical engineering, agriculture and forestry, intelligent construction, nursing, artificial intelligence and digital economy), scenario description, standard scenario template text, reference text, standard pronunciation audio file, professional vocabulary and evaluation points;
[0114] The system automatically performs format checks, including audio file format checks, text encoding checks, and vocabulary format checks. If the format does not meet the requirements, the system returns an error message, requiring modification and re-uploading.
[0115] After the format check is passed, the text enters the manual review process, where reviewers check the accuracy, professionalism, and suitability of the text content.
[0116] Once approved, the corpus is automatically added to the database, and the corpus index is updated.
[0117] Triggering incremental model updates, the system uses new corpus to perform lightweight fine-tuning on the relevant models in the original speech recognition unit or multi-dimensional evaluation unit, and trains for 3 epochs with a low learning rate of 1e-6, enabling the model to recognize and evaluate professional terms in the new corpus.
[0118] The update is complete. The new corpus can be selected for use in assessment tasks. The entire process can be completed within 24 hours, ensuring that the teaching content is in sync with industry development.
[0119] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Those skilled in the art can make various improvements and modifications without departing from the spirit and principles of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
[0120] In this invention, unless otherwise explicitly specified and limited, the terms "installation," "connection," "linking," and "fixing," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral part; they can refer to a mechanical connection, an electrical connection, or a connection that allows communication between them; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication of two components or the interaction between two components, unless otherwise explicitly limited. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.
[0121] Obviously, the embodiments described above are merely some embodiments of the present invention, not all embodiments. The accompanying drawings show preferred embodiments of the present invention, but do not limit the patent scope of the present invention. The present invention can be implemented in many different forms; rather, these embodiments are provided to provide a more thorough and complete understanding of the disclosure of the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or make equivalent substitutions for some of the technical features. Any equivalent structures made using the content of this specification and drawings, directly or indirectly applied to other related technical fields, are similarly within the patent protection scope of this invention.
Claims
1. A vocational college English oral proficiency assessment system based on speech recognition, characterized in that: include: A voice acquisition unit is configured to acquire spoken English voice signals from vocational college students. The voice acquisition unit is deployed on a client device, which may include a mobile terminal, a personal computer, or a dedicated intelligent voice acquisition device. The voice acquisition unit has built-in noise reduction and voice activity detection algorithms to extract effective voice segments in noisy environments. A speech recognition unit, connected to the speech acquisition unit, is configured to preprocess and recognize the speech signal, and output the corresponding text sequence and acoustic features; The vocational college English major corpus unit is configured to store standard spoken language data categorized by vocational college major fields. The standard spoken language data includes standard pronunciation audio, reference text, professional vocabulary, and scenario templates. The vocational college major fields include seven major fields: tourism, electromechanical engineering, agriculture and forestry, intelligent construction, nursing, artificial intelligence, and digital economy. The multi-dimensional assessment unit is connected to the speech recognition unit and the vocational college English major corpus unit, respectively. It is configured to automatically score the oral English of vocational college students based on the text sequence, acoustic features and the standard spoken language corpus, from five dimensions: pronunciation accuracy, fluency, grammatical correctness, vocabulary professionalism and content relevance. The feedback generation unit is connected to the multi-dimensional evaluation unit and is configured to generate personalized improvement suggestions and learning paths based on the scoring results. The data storage and analysis unit is connected to the multi-dimensional assessment unit and the feedback generation unit, respectively, and is configured to store all assessment records, perform statistical analysis on the data, and generate teaching analysis reports for teachers to use.
2. The vocational college English oral assessment system based on speech recognition according to claim 1, characterized in that, The speech recognition unit employs an end-to-end deep learning model based on the Conformer architecture. This model is first pre-trained on a large-scale general English corpus, and then fine-tuned using a corpus of spoken English from vocational college students to accommodate non-native accents. The corpus covers seven major vocational college fields: tourism, electromechanical engineering, agriculture and forestry, intelligent construction, nursing, artificial intelligence, and digital economy. It includes annotations for errors in professional terminology stress and typical accent errors involving alveolar consonants. The acoustic features include Mel-frequency cepstral coefficients, fundamental frequency, formants, and posterior phoneme probabilities.
3. The vocational college English oral assessment system based on speech recognition according to claim 1, characterized in that, The corpus unit for higher vocational English majors is further configured to support dynamic updates, enabling the addition or modification of the standard spoken language corpus through a web interface based on industry development and teacher needs, and triggering incremental model updates.
4. The vocational college English oral assessment system based on speech recognition according to claim 1, characterized in that, The multi-dimensional evaluation units include: The pronunciation accuracy submodule is configured to calculate phoneme-level pronunciation scores using the GOP algorithm based on the aforementioned acoustic features. The calculation formula is as follows: ; in For phonemes, To observe acoustic feature sequences, For frame number, Let P(p) be the posterior probability of the phoneme, and P(p) be the prior probability of the phoneme. The fluency submodule is configured to calculate speech rate, average pause duration, and number of pauses based on the timestamp information of the text sequence, and generate a fluency score. The grammar correctness submodule is configured to perform syntactic analysis on the text sequence, detect grammatical errors, and generate a grammar correctness score. The vocabulary specialization submodule is configured to match the text sequence with the professional vocabulary list, calculate the coverage and accuracy of professional vocabulary, and generate a vocabulary specialization score. The content relevance submodule is configured to perform semantic similarity calculation between the text sequence and the scene template to generate a content relevance score.
5. The vocational college English oral assessment system based on speech recognition according to claim 4, characterized in that, The content relevance submodule uses the Sentence-BERT model to encode the text sequence and the scene template into semantic vectors, and obtains the content relevance score by calculating the cosine similarity between the two vectors. The calculation formula is as follows: ; in The semantic vector of the text sequence for vocational college students. This is a semantic vector for the scene template.
6. The vocational college English oral assessment system based on speech recognition according to claim 4, characterized in that, The vocabulary specialization submodule calculates the coverage and accuracy of specialized vocabulary according to the formula: ; Calculate the vocabulary professionalism score, among which To improve the coverage of professional vocabulary, For the accuracy of professional terminology.
7. The vocational college English oral assessment system based on speech recognition according to claim 1, characterized in that, The feedback generation unit supports multiple feedback formats, including text feedback, voice feedback generated through speech synthesis technology, and feedback in the form of a visual graph. Based on a collaborative filtering algorithm, the feedback generation unit recommends learning question types and micro-lesson resources for the next stage based on the current vocational college students' scores in various dimensions of weaknesses and the historical resource selection data of student groups with similar learning trajectories.
8. The vocational college English oral assessment system based on speech recognition according to claim 1, characterized in that, The data storage and analysis unit can generate statistical reports on oral communication skills for classes or individuals, heat maps of weaknesses across various dimensions, and progress trend curves, and supports teachers in querying, exporting, and intervening in teaching.
9. The vocational college English oral assessment system based on speech recognition according to any one of claims 1-8, characterized in that, The system supports various oral assessment question types, including reading aloud, retelling, role-playing, free expression, and guided dialogue, to simulate interactive workplace scenarios in seven major fields: tourism, electromechanical engineering, agriculture and forestry, intelligent construction, nursing, artificial intelligence, and digital economy.
10. A method for assessing spoken English in higher vocational colleges based on speech recognition, applied to the system described in any one of claims 1-9, characterized in that, Includes the following steps: Step S1: Collect the spoken English voice signals of vocational college students through the voice acquisition unit; Step S2: The speech recognition unit preprocesses and performs speech recognition on the speech signal, and outputs the corresponding text sequence and acoustic features; Step S3: The multi-dimensional assessment unit retrieves standard spoken language data corresponding to the current assessment scenario from the vocational college English major corpus unit. Combining the text sequence and acoustic features, it automatically scores the spoken language of vocational college students from five dimensions: pronunciation accuracy, fluency, grammatical correctness, vocabulary professionalism, and content relevance, generating scores for each dimension and a total score. Step S4: The feedback generation unit matches the preset feedback template with the scores of each dimension and the total score to generate personalized improvement suggestions and learning paths in the form of text, voice or visual graphs. Step S5: The data storage and analysis unit will store the voice signals, text sequences, scores for each dimension, and feedback suggestions from this evaluation in a structured manner, and perform statistical analysis on historical evaluation data; Step S6: The teacher calls the statistical analysis results generated by the data storage and analysis unit to obtain the teaching analysis report for the class or individual, thus completing the closed-loop process of oral assessment and teaching feedback.
Citation Information
Patent Citations
Automatic oral English marking method based on feature fusion
CN106847260A
A method and device for assessing spoken English
CN107818795B