Conducting an intelligent assessment and augmentation of a population's expressive abilities
By combining a dual-branch deep learning evaluation module and a track adaptation module, the professional competence and personalized style of the presenter's oral expression are quantitatively evaluated and trained, solving the problem of the disconnect between evaluation and training in existing technologies and improving the accuracy of evaluation and the relevance of training.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU UNIVERSITY
- Filing Date
- 2026-04-03
- Publication Date
- 2026-06-19
Smart Images

Figure CN122245290A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence speech and semantic processing and intelligent evaluation of spoken expression, and in particular to an intelligent evaluation and enhancement system for presenters' spoken expression. Background Technology
[0002] With the rapid development of the converged media industry, the hosting profession is showing a trend of vertical segmentation. Different fields such as news, variety shows, interviews, and galas have put forward differentiated professional requirements for the oral expression ability of hosts. At the same time, the audience's demand for the personalized style and differentiated expressive charm of hosts continues to increase. The professional ability and personal style of hosts have become the two core dimensions of their core professional competitiveness.
[0003] Currently, the assessment and training of presenters' oral expression still relies primarily on traditional manual methods. Existing intelligent assessment technologies are mostly concentrated in the areas of general Mandarin proficiency assessment and general oral expression assessment. Mature technical solutions specifically for the vertical scenarios of presenting are extremely scarce, and there are core technical defects: First, the assessment dimensions of existing general assessment technologies are out of touch with the professional needs of presenting scenarios, only covering the basic voice level and failing to establish differentiated assessment systems and industry benchmarks for specific presenting tracks, thus failing to meet the professional assessment needs of the industry; Second, existing technologies suffer from a technical bias of prioritizing standardization, adopting a single-branch standardized assessment architecture that cannot take into account both professional ability assessment and the quantification of personalized style value, easily misjudging unique styles as defects, leading to homogenization of presenters' expressions; Third, the manual mode has inherent defects such as strong subjectivity, low efficiency, and poor replicability, resulting in poor consistency of assessment results and an inability to accurately identify weaknesses and strengths; Fourth, existing technologies generally suffer from the problem of separation between assessment and training, failing to form a closed-loop capability enhancement system and unable to generate personalized training programs and dynamically iteratively optimize based on assessment results. Summary of the Invention
[0004] This invention overcomes the shortcomings of existing technologies and provides an intelligent evaluation and enhancement system for presenters' oral expression.
[0005] To achieve the above objectives, the technical solution adopted by the present invention is: an intelligent evaluation and enhancement system for presenter's oral expression, comprising: a data acquisition module, a dual-branch deep learning evaluation module, a training target generation module, and a personalized optimization training module;
[0006] The data acquisition module is used to acquire the target host's historical spoken expression data, spoken expression text data, personal trait label data, and track positioning data. At the same time, it acquires big data on the expression patterns of outstanding hosts in the same track, and constructs a two-dimensional dataset of personal style and professional ability. The two-dimensional dataset is then output to the two-branch deep learning evaluation module.
[0007] The dual-branch deep learning evaluation module receives the dual-dimensional dataset and processes it in parallel through a standardized evaluation branch and a personalized style recognition branch to generate standardized quantitative evaluation results of the target presenter's professional oral expression ability, personal style characteristics, and corresponding style value quantitative data.
[0008] The training objective generation module receives the standardized quantitative evaluation results and style value quantitative data, compares them with the industry benchmark database and the excellent style library in the same track, identifies professional ability shortcomings and high-value strengthenable style features, generates two-dimensional training objectives for shortcomings and strengths, and outputs the two-dimensional training objectives to the personalized optimization training module.
[0009] The personalized optimization training module receives the dual-dimensional training objectives, generates a personalized phased optimization training scheme and outputs the corresponding training content, while dynamically collecting training process data and iteratively optimizing the personalized phased optimization training scheme.
[0010] In a preferred embodiment of the present invention, the data acquisition module preprocesses and standardizes the historical spoken expression audio data, spoken expression text data, personal trait label data, track positioning data, and big data of expression patterns of outstanding hosts in the same track in terms of style type in the dual-dimensional dataset, and outputs a standardized temporal speech feature sequence, a text semantic feature matrix, a labeled feature vector, and a benchmark feature library of the same track style.
[0011] The standardized temporal speech feature sequence and text semantic feature matrix are simultaneously input into the standardized evaluation branch, and the standardized temporal speech feature sequence, text semantic feature matrix, labeled feature vector, and same track style benchmark feature library are simultaneously input into the personalized style recognition branch, triggering parallel independent computation of the two branches.
[0012] In a preferred embodiment of the present invention, the dual-branch deep learning evaluation module includes a standardized evaluation branch and a personalized style recognition branch. The standardized evaluation branch has a built-in language rhythm analysis unit and a logical structure evaluation engine unit, and the personalized style recognition branch has a built-in expressive charm calculation unit and a style value quantification unit.
[0013] The language rhythm analysis unit receives the standardized temporal speech feature sequence and performs temporal segmentation and feature quantification of inter-sentence pauses, intra-sentence pauses, stress placement, and intonation fluctuations. It calculates four core rhythm indicators: speech rate fluctuation rate, pause duration ratio, stress position matching degree, and intonation fluctuation smoothness. The four core rhythm indicators are matched with the standardized rhythm templates of the same track in the industry benchmark database for similarity matching. A language rhythm standardized scoring matrix is generated by calculating the weighted scoring formula and output to the logical structure evaluation engine unit.
[0014] The logical structure evaluation engine unit performs text-level logical link analysis on the language rhythm standardized scoring matrix, identifies the logical structure levels of the spoken text, including the introduction, core viewpoint explanation, case evidence, viewpoint progression, and conclusion, and calculates four core logical indicators: logical link completeness, core viewpoint support, contextual semantic coherence, and information density fit, generating a logical structure standardized scoring matrix. The logical structure standardized scoring matrix and the language rhythm standardized scoring matrix are then input into a multimodal coupled evaluation model, outputting standardized individual scores and comprehensive scores for each dimension of spoken expression professional ability, forming a standardized quantitative evaluation result of the target presenter's spoken expression professional ability.
[0015] The expressive charm calculation unit receives the standardized temporal speech feature sequence, text semantic feature matrix, and labeled feature vector. Through the Transformer model for style feature extraction based on contrastive learning, it performs cross-modal fusion encoding of speech temporal features and text semantic features, extracts five core style dimensions: emotional intensity, language persuasiveness, personal identification, audience empathy, and expressive relaxation. It then combines the labeled feature vector to perform feature alignment and dimension correction, generates the target host's personal style feature vector, and outputs it to the style value quantification unit.
[0016] The style value quantification unit receives the style benchmark feature library of the same track and the personal style feature vector. It uses a density clustering algorithm to cluster and match the personal style feature vector with the features in the style benchmark feature library of the same track to locate the style cluster to which the target host belongs. Then, it calculates three core value indicators of the personal style feature vector in the corresponding track: style scarcity, audience preference matching degree, and track competitiveness. The style value quantification data is generated by calculating the value weighted fusion formula.
[0017] In a preferred embodiment of the present invention, when the training target generation module performs processing, it receives the standardized quantitative evaluation result, compares the evaluation data of each sub-item in the standardized quantitative evaluation result with the benchmark threshold of the corresponding track in the industry benchmark database, and marks the sub-items below the benchmark threshold as professional capability shortcomings.
[0018] The system receives the individual style features and style value quantification data, performs similarity matching with style features in the excellent style library of the same track, marks features with similarity below the threshold and scores above the passing line as high-value style features that can be strengthened, and marks features with similarity above the threshold as homogeneous features that need to be avoided.
[0019] In a preferred embodiment of the present invention, when the personalized optimization training module generates a personalized phased optimization training scheme, it matches the corresponding rhythm control training unit, logical expression training unit, and expressive charm enhancement training unit based on the professional ability shortcomings in the dual-dimensional training objectives.
[0020] Based on the intensifiable style features in the dual-dimensional training objectives, corresponding style-oriented reinforcement training units are matched. The training stages are divided in the order of basic ability completion, style-oriented reinforcement, and comprehensive ability integration, and training objectives and content for each stage are set to generate the personalized phased optimization training scheme.
[0021] In a preferred embodiment of the present invention, when the personalized optimization training module performs iterative optimization, during the execution of each training stage, the training audio data, training text data and training completion data of the target host are collected in real time, and real-time standardized evaluation data and real-time style feature retention data are generated through the dual-branch deep learning evaluation module.
[0022] The real-time standardized evaluation data is compared with the training objectives of the corresponding stage to determine the professional competence level. Based on the judgment results, the difficulty and training pace of subsequent training content are adjusted. At the same time, based on the real-time style feature retention data, the training weight of the style enhancement training unit is adjusted, and the personalized phased optimization training scheme is iteratively optimized.
[0023] In a preferred embodiment of the present invention, the intelligent evaluation and enhancement system for the host's spoken expression also includes a track adaptation module. The track adaptation module receives the track positioning data of the target host and matches the corresponding track type. The track type includes news track, variety show track, interview track, and gala track.
[0024] Based on the matched track type, the weight ratio of each sub-evaluation data in the standardized evaluation branch is adjusted, the calculation dimension of the style value quantification model in the personalized style recognition branch is adjusted, and the industry benchmark database and the excellent style library of the same track are retrieved to adapt the evaluation rules and training content generation rules of the corresponding track.
[0025] In a preferred embodiment of the present invention, the logic structure evaluation engine unit further includes a built-in spoken logic feature graph, the generation process of which includes:
[0026] S10. Collect labeled spoken language text data of hosts from different tracks, extract the logical structure units and logical connection relationships of spoken language expressions in the data, and construct a spoken language logical structure ontology.
[0027] S20. Based on the spoken logic structure ontology, label the logical node attributes corresponding to topic introduction, logical progression, viewpoint argumentation, and concluding sublimation of each logical structure unit, and construct a hierarchical spoken logic feature map for the identification and matching of spoken logic chains.
[0028] In a preferred embodiment of the present invention, the data acquisition module further performs the following steps:
[0029] S101. Perform timestamp alignment processing on the acquired historical spoken expression data and spoken expression text data to generate audio-text synchronization annotation data;
[0030] S102. Perform track classification, style tag labeling and standardization preprocessing on the big data of the style-type expression patterns of the outstanding hosts in the same track, and build an outstanding style library and industry benchmark database in the same track.
[0031] S103. Integrate the audio-text synchronization annotation data, personal trait label data, and track positioning data into the dual-dimensional dataset of personal style and professional ability, and output the excellent style library of the same track and the industry benchmark database to the training target generation module respectively.
[0032] This invention addresses the shortcomings of the prior art and has the following beneficial effects:
[0033] (1) Since the present invention adopts a dual-branch deep learning evaluation architecture with a standardized evaluation branch and a personalized style recognition branch running in parallel, it effectively solves the technical bias of prioritizing standardization and homogenizing evaluation that is common in existing technologies, as well as the technical problems that a single-branch standardized evaluation architecture cannot take into account both professional ability evaluation and personalized style value recognition, and is prone to misjudging the unique style advantages of the host as expression defects. Thus, it realizes the parallel consideration of the two dimensions of standardized quantification of the host's oral expression professional ability and quantification of personal style value, fundamentally solving the core contradiction between standardization and personalization in the industry, avoiding the homogenization of the host's expression mode, and providing core technical support for the cultivation of differentiated host talents.
[0034] (2) Since the present invention adopts a track adaptation module that matches the sub-track of the host, and builds an industry benchmark database and an excellent style library of the same track, it effectively solves the technical problems of the existing general oral intelligent assessment technology that the assessment dimensions are seriously out of sync with the professional needs of the host scenario, that it has not established a differentiated assessment system and industry benchmark for the sub-track of the host, and that it cannot adapt to the professional requirements of different tracks. Thus, it realizes the full track adaptation and accurate assessment of the host vertical scenario. The assessment dimensions cover the core professional capabilities of the host, such as language rhythm, logical structure, and information transmission efficiency, which greatly improves the industry reference value and scenario adaptability of the assessment results.
[0035] (3) Since the present invention adopts a multi-dimensional index quantification calculation method based on reproducible mathematical formulas and completes multi-dimensional feature fusion evaluation with a multi-modal coupling evaluation model, it effectively solves the technical problems of strong subjectivity, large scoring deviation, and lack of consistency and reproducibility of evaluation results in the existing traditional manual evaluation mode, thereby realizing the objective and accurate quantitative evaluation of the host's oral expression ability.
[0036] (4) Since the present invention adopts a two-dimensional training objective generation system based on the dual-branch evaluation results to supplement the shortcomings and strengthen the strengths, it effectively solves the technical problems of the existing technology that the evaluation and training links are completely separated, that it is impossible to generate targeted training objectives based on the evaluation results, and that the training program is generalized. Thus, it realizes the accurate transformation of the evaluation results into the capability improvement path, forming a precise training orientation that supplements what is lacking and strengthens what is advantageous, avoiding redundant ineffective training content, and greatly improving the pertinence and effectiveness of training.
[0037] (5) Since the present invention adopts the Transformer model for style feature extraction based on contrastive learning, and is equipped with a three-dimensional value quantification calculation method for style scarcity, audience preference matching degree and track competitiveness, it effectively solves the technical problems in the prior art that the host's oral expression style can only be evaluated based on subjective feelings, cannot be quantified, and cannot accurately distinguish between high-value differentiated styles and low-value homogeneous styles. Thus, it realizes for the first time the quantifiable value assessment of the host's oral expression style, providing objective and practical data guidance for the creation of the host's differentiated core competitiveness, the cultivation of personal style and career development planning, and adapting to the core needs of the diversified and personalized development of hosts in the era of converged media. Attached Figure Description
[0038] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0039] Figure 1 This is the overall architecture diagram of the presenter's oral expression intelligent assessment and enhancement system of the present invention;
[0040] Figure 2 This is an internal architecture diagram of the dual-branch deep learning evaluation module of the present invention;
[0041] Figure 3 This is a flowchart of the closed-loop iterative process of the personalized optimization training module of the present invention. Detailed Implementation
[0042] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0043] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein. Therefore, the scope of protection of the invention is not limited to the specific embodiments disclosed below.
[0044] like Figure 1 As shown, the intelligent evaluation and enhancement system for presenters' oral expression includes: a data acquisition module, a dual-branch deep learning evaluation module, a training objective generation module, and a personalized optimization training module.
[0045] The data acquisition module is used to obtain the target host's historical spoken expression data, spoken expression text data, personal trait label data, and track positioning data. At the same time, it obtains big data on the expression patterns of outstanding hosts in the same track by style type, constructs a two-dimensional dataset of personal style and professional ability, and outputs the two-dimensional dataset to the two-branch deep learning evaluation module.
[0046] The historical spoken expression data consists of high-definition audio data from the target presenter's past hosting scenarios. After collection, it is standardized to a 16kHz sampling rate, mono, and 16-bit audio format. First, background noise reduction is performed using spectral subtraction. Then, silent segments are removed using the VAD speech activity detection algorithm, and the WebRTC algorithm is used to separate human voice from background noise, resulting in clean human voice audio data. Simultaneously, corresponding spoken expression text data is collected. The audio and text data are aligned at the frame level using the Montreal Forced Alignment (MFA) algorithm, with the alignment error controlled within 10ms. Each text segmentation unit is matched with the corresponding audio start and end timestamps to generate synchronized audio-text annotation data, ensuring accurate matching of subsequent speech features and text semantic features.
[0047] Personal trait tag data includes the target host's gender, age, voice trait tags, years of hosting experience, and past hosting scenarios. This tag data is vectorized using a combination of one-hot encoding and word embedding encoding. The encoding formula is as follows:
[0048] ;
[0049] In the formula, Let be the encoding vector of the i-th label. This is a vector concatenation operation. The pre-trained embedding layer weight matrix is used to concatenate all the label encoding vectors to generate a 128-dimensional labeled feature vector.
[0050] The track positioning data consists of the core hosting track classification tags for the target host, including four main tracks: news track, variety show track, interview track, and gala track, as well as subdivided vertical tracks. These are converted into track feature vectors through one-hot encoding for subsequent track adaptation and benchmark matching.
[0051] Furthermore, the data acquisition module preprocesses and standardizes the features of historical spoken audio data, spoken text data, personal trait label data, track positioning data, and big data on the style-type expression patterns of outstanding hosts in the same track from the dual-dimensional dataset, and outputs standardized temporal speech feature sequences, text semantic feature matrices, labeled feature vectors, and a style benchmark feature library for the same track.
[0052] The standardized temporal speech feature sequence and text semantic feature matrix are simultaneously input into the standardized evaluation branch, and the standardized temporal speech feature sequence, text semantic feature matrix, labeled feature vector, and benchmark feature library of the same track style are simultaneously input into the personalized style recognition branch, triggering parallel independent computation of the two branches.
[0053] The industry benchmark data consists of big data on the expression patterns of outstanding hosts in the same track, categorized by style type. It collects publicly available audio and text data of top hosts covering the entire track and different style types, with a minimum of 1,000 hours of effective data samples for each track. First, the track is classified and style tags are labeled. The style tags include four core categories: calm and composed, lively and dynamic, wise and thoughtful, and approachable and empathetic. Then, an audio standardization and audio-text alignment preprocessing process that is completely consistent with the individual dimension data is executed. Finally, an excellent style library and industry benchmark database are constructed in the same track.
[0054] Among them, the industry benchmark database stores the benchmark thresholds and standardized scoring templates for various indicators of the professional oral expression ability of the hosts in the corresponding track, while the excellent style database in the same track stores the style feature vectors and style value scoring benchmark data for different style types in the corresponding track.
[0055] Furthermore, the data acquisition module also performs the following steps:
[0056] S101. Perform timestamp alignment processing on the acquired historical spoken language expression data and spoken language expression text data to generate audio-text synchronization annotation data;
[0057] S102. Perform track classification, style tagging and standardization preprocessing on the big data of the expression patterns of outstanding hosts in the same track, and build an excellent style library and industry benchmark database in the same track.
[0058] S103. Integrate audio-text synchronization annotation data, personal trait label data, and track positioning data into a dual-dimensional dataset of personal style and professional ability, and output the excellent style library and industry benchmark database of the same track to the training target generation module respectively.
[0059] In practical implementation, after multi-source data preprocessing, the data acquisition module executes a standardized feature extraction process to generate the core feature data required for subsequent system calculations. For audio data, a 25ms frame length and a 10ms frame shift are used for frame segmentation processing to extract 39 acoustic features, including Mel-frequency cepstral coefficients (MFCC), fundamental frequency (F0), energy entropy, and speech rate-related features. The extracted acoustic features are then subjected to z-score normalization, using the following formula:
[0060] ;
[0061] In the formula, These are the original acoustic feature values. This is the mean of the feature in the training set. Let be the standard deviation of this feature in the training set, and the final generated dimension be ( The standardized temporal speech feature sequence, where T is the audio temporal length.
[0062] For text data, preprocessing is first performed including word segmentation, stop word removal, and part-of-speech tagging. Semantic encoding is then completed using a pre-trained Chinese RoBERTa model, extracting a 768-dimensional semantic feature vector for each segmented unit. Simultaneously, considering the characteristics of colloquial expressions, the feature weights of logical connectors and modifiers are adjusted, ultimately generating a dimensional vector. The text semantic feature matrix, where N is the number of text segmentation units.
[0063] Furthermore, based on the audio-text synchronization timestamp, the text semantic feature matrix and the standardized temporal speech feature sequence are aligned in time. For the expression pattern data of outstanding presenters in the same track, a completely consistent feature extraction process is performed to generate a style feature vector for each sample. These vectors are then clustered and stored according to track and style type to form a benchmark feature library for the same track.
[0064] The data acquisition module integrates audio-text synchronized annotation data, personal trait label data, and track positioning data to construct a two-dimensional dataset of personal style and professional ability, which is then output to the dual-branch deep learning evaluation module. The standardized temporal speech feature sequence and text semantic feature matrix are simultaneously output to the standardized evaluation branch, while the standardized temporal speech feature sequence, text semantic feature matrix, labeled feature vector, and track-specific style benchmark feature library are simultaneously output to the personalized style recognition branch. At the same time, the track-specific excellent style library and industry benchmark database are output to the training target generation module respectively.
[0065] like Figure 2 As shown, the dual-branch deep learning evaluation module receives a dual-dimensional dataset and processes it in parallel through a standardized evaluation branch and a personalized style recognition branch, generating standardized quantitative evaluation results of the target presenter's professional oral expression ability, personal style characteristics, and corresponding style value quantitative data, respectively.
[0066] In practical implementation, the dual-branch deep learning evaluation module is the core computing unit of the system, including a standardized evaluation branch and a personalized style recognition branch. The two branches are based on a GPU parallel computing architecture to achieve synchronous and independent computing, avoiding computational interference between branches and ensuring evaluation efficiency. Finally, they generate standardized quantitative evaluation results of the target presenter's oral expression professional ability, personal style characteristics, and corresponding style value quantitative data.
[0067] The standardized evaluation branch includes a built-in language rhythm analysis unit and a logical structure evaluation engine unit; the personalized style recognition branch includes a built-in expressive charm calculation unit and a style value quantification unit; and the module end includes a built-in result verification unit to complete the final data verification.
[0068] Furthermore, the dual-branch deep learning evaluation module includes a standardized evaluation branch and a personalized style recognition branch. The standardized evaluation branch has a built-in language rhythm analysis unit and a logical structure evaluation engine unit, while the personalized style recognition branch has a built-in expressive charm calculation unit and a style value quantification unit.
[0069] The language rhythm analysis unit receives standardized temporal speech feature sequences and performs temporal segmentation and feature quantification of inter-sentence pauses, intra-sentence pauses, stress placement, and intonation fluctuations. It calculates four core rhythm indicators: speech rate fluctuation rate, pause duration ratio, stress position matching degree, and intonation fluctuation smoothness. The four core rhythm indicators are then matched with standardized rhythm templates in the same track in the industry benchmark database for similarity. A language rhythm standardized scoring matrix is generated by calculating a weighted scoring formula and output to the logical structure evaluation engine unit.
[0070] The logical structure assessment engine unit performs text-level logical link analysis on the language rhythm standardized scoring matrix, identifying the logical structure levels of the spoken text, including the introduction, core viewpoint statement, case evidence, viewpoint progression, and concluding elevation. It calculates four core logical indicators: logical link completeness, core viewpoint support, contextual semantic coherence, and information density fit, generating a standardized logical structure scoring matrix. The standardized logical structure scoring matrix and the language rhythm standardized scoring matrix are then input into a multimodal coupled assessment model, outputting standardized individual scores and comprehensive scores for each dimension of spoken expression professional ability, forming a standardized quantitative assessment result of the target presenter's spoken expression professional ability.
[0071] In practical implementation, the core objective of the standardized assessment branch is to achieve an objective, standardized, and quantitative assessment of the presenter's professional oral expression ability, eliminating the interference of subjective factors in manual assessment. Specifically, the language rhythm analysis unit receives standardized temporal speech feature sequences, completes temporal segmentation based on audio-text synchronization timestamps, and accurately identifies temporal segments corresponding to inter-sentence pauses, intra-sentence pauses, stress points, and intonation fluctuations. Inter-sentence pauses are defined as the duration of silence between two adjacent sentences; intra-sentence pauses are defined as the duration of silence between adjacent word segments within a sentence; stress points are defined as the temporal segments corresponding to word segments where both the audio energy entropy and fundamental frequency F0 are above a preset threshold; and intonation fluctuations are defined as the temporal variation curve of the fundamental frequency F0.
[0072] After segmentation, the unit performs feature quantification on the temporal segments and calculates four core rhythm indicators: speech rate fluctuation rate, pause duration ratio, stress position matching degree, and intonation smoothness.
[0073] Speech rate volatility is calculated based on the sliding window variance, using the following formula:
[0074] ;
[0075] In the formula, The number of sliding windows. The speech rate within the k-th window. The average speech rate of the entire segment is represented by the value of the fluctuation rate. The smaller the fluctuation rate value, the more stable the speech rate control.
[0076] The formula for calculating the percentage of pause time is as follows: ,in The sum of the duration of pauses between sentences. This is the total duration of pauses within the sentence. This represents the total effective duration of the audio.
[0077] The stress position matching degree is calculated based on the matching result between the semantic baseline stress position and the actual stress point, using the following formula: ,in Match the correct number of keywords to the stressed syllables. This represents the total number of keywords in the text. The smoothness of intonation fluctuations is calculated based on the variance of the fundamental frequency change rate, using the following formula:
[0078] ;
[0079] In the formula, Let be the fundamental frequency difference between frame t and frame t+1. The variance is the mean of the fundamental frequency difference across the entire segment. The smaller the variance, the smoother the intonation fluctuations.
[0080] After obtaining the four core rhythm indicators, the language rhythm analysis unit performs cosine similarity matching between the indicators and standardized rhythm templates in the same track from the industry benchmark database. It then calculates and generates a standardized language rhythm scoring matrix using a weighted scoring formula:
[0081] ;
[0082] In the formula, A standardized score for language rhythm is provided, with a range of 0 to 100 points. , , , For the preset weighting coefficients, satisfy The weighting coefficients can be dynamically adjusted based on the track type through the track adaptation module; The cosine similarity between the speech rate variability and the baseline template. The cosine similarity between the pause duration percentage and the baseline template is used. For accent position matching degree, For intonation smoothness and cosine similarity to the baseline template, all four similarity values range from 0 to 1. The final generated language rhythm standardized scoring matrix includes individual scores and a comprehensive score for each of the four indicators, and is output to the logical structure evaluation engine unit.
[0083] The logical structure evaluation engine unit synchronously receives the language rhythm standardized scoring matrix and the corresponding text semantic feature matrix, and completes the chapter-level logical link parsing based on the built-in spoken language logical feature map.
[0084] The spoken language logic feature graph is a pre-constructed hierarchical knowledge graph. The construction process is as follows: Collect spoken language expression text corpora labeled with different tracks, with a corpus size of no less than 5 million sentences. Extract the logical structure units and logical connections of spoken language expressions from the corpus. The logical structure units include introduction, core viewpoint statement, case evidence, viewpoint progression, transition and supplement, and conclusion and sublimation. The logical connections include causality, progression, parallelism, transition, and example. Construct a spoken language logic structure ontology. Based on the ontology, label the logical node attributes and connection rules between each logical structure unit. Finally, a hierarchical spoken language logic feature graph is formed. Each logical node in the graph corresponds to a preset semantic feature vector and matching rules, which are used for the identification and matching of spoken language logic chains.
[0085] The logical structure evaluation engine unit is based on the spoken logic feature graph. It uses a semantic matching algorithm to match each sentence in the text with the logical nodes in the graph, identifies the logical structure unit corresponding to each sentence, constructs the logical link topology of the target host's spoken expression based on logical connection relationships, and calculates four core logical indicators: logical link completeness, core viewpoint support, contextual semantic coherence, and information density fit.
[0086] The formula for calculating the integrity of a logical link is as follows: ,in The target text represents the number of standard logical nodes covered. This represents the total number of standard logical nodes for the corresponding track. The formula for calculating the support level of the core viewpoint is:
[0087] ;
[0088] In the formula, K represents the total number of core viewpoint nodes. This represents the number of supporting child nodes corresponding to the k-th core viewpoint. This is the preset threshold for the number of qualified support child nodes. The formula for calculating contextual semantic coherence is:
[0089] ;
[0090] In the formula, Let be the semantic feature vector of the nth sentence. This represents the cosine similarity of the semantic vectors of adjacent sentences.
[0091] Information density fit is calculated based on text information entropy, first using the formula Calculate the text information entropy, where Let L be the probability of the i-th word segmentation unit appearing in the text, and L be the total number of words in the text. Then, the formula is used to... Calculate fit, This represents the mean of the baseline information entropy for the corresponding track.
[0092] After obtaining the four core logical indicators, the unit generates a standardized logical structure scoring matrix through weighted scoring. Then, the standardized logical structure scoring matrix and the standardized language rhythm scoring matrix are input into a pre-trained multimodal coupled evaluation model. This model is a multimodal fusion neural network based on the attention mechanism. It completes feature fusion through fully connected layers and activation functions, and finally outputs standardized individual scores and comprehensive scores for each dimension of oral expression professional ability. The individual scores include language rhythm score, logical expression score, articulation clarity score, and information transmission efficiency score. The comprehensive score is the weighted fusion result of each individual score, forming a standardized quantitative evaluation result of the target presenter's oral expression professional ability, which is output to the training target generation module and the result verification unit.
[0093] The core objective of the personalized style recognition branch is to identify the individual style characteristics of the target presenter and quantify their track value based on standardized evaluation, avoiding the obliteration of personalized expression characteristics by standardized evaluation. Specifically, the expression charm calculation unit receives standardized temporal speech feature sequences, text semantic feature matrices, and labeled feature vectors. It completes temporal alignment and feature fusion encoding through a cross-modal feature fusion layer. The encoding employs a Transformer model for style feature extraction based on contrastive learning. This model is pre-trained using large-scale expression data from presenters in the same track. The pre-training task is style classification and contrastive learning, enabling the model to learn the feature differences between different styles and the feature similarities within the same style. The model encoding process is as follows: the standardized temporal speech feature sequence is input into the speech Transformer encoder to generate a speech hidden layer feature sequence; the text semantic feature matrix is input into the text Transformer encoder to generate a text hidden layer feature sequence; the two hidden layer feature sequences are fused through a cross-modal attention layer to generate a cross-modal fused feature sequence; then, it is converted into a fixed-dimensional fused feature vector through an average pooling layer; finally, feature alignment and dimensionality correction are performed by combining the labeled feature vectors. The correction formula is as follows:
[0094] ;
[0095] In the formula, For cross-modal fusion feature vectors, The feature mapping weight matrix, For labeled feature vectors, The vector concatenation operation ultimately generates a 256-dimensional target host personal style feature vector. At the same time, five core style dimensions are extracted: emotional intensity, linguistic appeal, personal identity, audience empathy, and expressive relaxation. The feature value of each dimension is calculated by the cosine similarity between the style feature vector and the corresponding dimension's baseline feature vector, with a value range of 0 to 1. The personal style feature vector and the five style dimension features are then output to the style value quantification unit.
[0096] The style value quantification unit receives a baseline feature library of styles from the same track and personal style feature vectors. It first uses the DBSCAN density clustering algorithm to cluster and match the personal style feature vectors with the features in the baseline feature library. The algorithm's neighborhood radius... Based on a pre-defined minimum sample size (MinPts) for the track data scale, clustering matching is used to locate the style cluster to which the target host belongs, identify its style type and unique distinguishing features, and then calculate three core value indicators: style scarcity, audience preference matching degree, and track competitiveness. The formula for calculating style scarcity is:
[0097] ;
[0098] In the formula, This represents the cluster center vector of the cluster it belongs to. The Euclidean distance between the individual style feature vector and the cluster center. This represents the maximum Euclidean distance within the same track style feature library. The number of samples in the corresponding cluster. This represents the total number of samples in the same track, ranging from 0 to 1. A higher value indicates a higher scarcity of style.
[0099] Audience preference matching The score is calculated using a pre-built audience preference rating model. This model is trained based on online public opinion data and user survey data. By inputting a personal style feature vector, it can output a matching degree value in the range of 0 to 1. The higher the value, the more the style matches the audience's aesthetic preferences.
[0100] The formula for calculating track competitiveness is: ,in The percentile ranking of the target host's style value score within the same category. This represents the total number of samples in the same track, ranging from 0 to 1. A higher value indicates a stronger competitive track.
[0101] After obtaining the three core value indicators, the unit calculates and generates quantitative style value data using a value-weighted fusion formula, which is as follows:
[0102] ;
[0103] In the formula, A quantitative score is given for the overall value of the style, ranging from 0 to 100 points; , , To preset dynamic weighting coefficients, satisfying It can be dynamically adjusted based on track type and market demand; the final generated style value quantification data includes individual scores and comprehensive scores of three indicators, which are simultaneously output to the training target generation module and the result verification unit.
[0104] The result verification unit at the end of the module receives standardized quantitative evaluation results, personal style feature vectors, and style value quantitative data, and uses... The criteria complete outlier detection, remove abnormal results that deviate from the preset confidence interval, and correct them using linear interpolation. At the same time, it verifies the consistency between the standardized evaluation results and style feature data to avoid logically contradictory results. Finally, the verified standardized quantitative evaluation results, personal style features, and style value quantitative data are packaged into a structured evaluation result set and output to the training target generation module.
[0105] Furthermore, the logical structure evaluation engine unit also has a built-in spoken logic feature graph. The generation process of the spoken logic feature graph includes:
[0106] S10. Collect labeled spoken language texts of hosts from different tracks, extract the logical structure units and logical connections of spoken language expressions in the texts, and construct a spoken language logical structure ontology.
[0107] S20. Based on the oral logic structure ontology, the logical node attributes corresponding to each logical structure unit, namely topic introduction, logical progression, viewpoint argumentation, and concluding sublimation, are labeled to construct a hierarchical oral logic feature map for the identification and matching of oral logic chains.
[0108] The expressive charm calculation unit receives standardized temporal speech feature sequences, text semantic feature matrices, and labeled feature vectors. Through a Transformer model based on contrastive learning, it performs cross-modal fusion encoding of speech temporal features and text semantic features, extracting five core style dimensions: emotional intensity, language persuasiveness, personal identification, audience empathy, and expressive relaxation. It then combines the labeled feature vectors to perform feature alignment and dimension correction, generating the target presenter's personal style feature vector, which is then output to the style value quantification unit.
[0109] The style value quantification unit receives the style benchmark feature library of the same track and the personal style feature vector. It uses a density clustering algorithm to cluster and match the personal style feature vector with the features in the style benchmark feature library of the same track to locate the style cluster to which the target host belongs. Then, it calculates three core value indicators of the personal style feature vector in the corresponding track: style scarcity, audience preference matching degree, and track competitiveness. The style value quantification data is generated by calculating the value weighted fusion formula.
[0110] The training objective generation module receives standardized quantitative evaluation results and style value quantitative data. It compares these with industry benchmark databases and excellent style libraries in the same track to identify professional capability shortcomings and high-value style features that can be strengthened. It then generates two-dimensional training objectives to address shortcomings and enhance strengths, and outputs these two-dimensional training objectives to the personalized optimization training module.
[0111] Furthermore, when the training objective generation module performs processing, it receives the standardized quantitative evaluation results, compares the evaluation data of each sub-item in the standardized quantitative evaluation results with the benchmark threshold of the corresponding track in the industry benchmark database, and marks the sub-items that are lower than the benchmark threshold as professional capability shortcomings.
[0112] It receives individual style characteristics and style value quantification data, performs similarity matching with style characteristics in the excellent style library of the same track, marks features with similarity below the threshold and scores above the passing line as high-value style characteristics that can be strengthened, and marks features with similarity above the threshold as homogeneous features that need to be avoided.
[0113] In practice, the training objective generation module receives a structured evaluation result set and simultaneously retrieves industry benchmark databases and excellent style libraries from the same track. The core is to generate dual-dimensional training objectives that address weaknesses and strengthen strengths, thereby achieving personalized training guidance that addresses weaknesses and enhances strengths, and avoiding homogeneous training.
[0114] The module first executes the professional competency gap identification process, comparing the evaluation data of each sub-item in the standardized quantitative evaluation results with the benchmark threshold of the corresponding track in the industry benchmark database. The benchmark threshold is set to the 80th percentile of the industry average in the corresponding track. Sub-items below the benchmark threshold are marked as professional competency gaps. At the same time, the score difference of the gap sub-items is calculated. The larger the difference, the higher the training priority, and a gap filling priority list is generated.
[0115] The process involves identifying high-value style features by matching the individual's style feature vector with style features from a pool of excellent styles within the same competition category using cosine similarity. Features with a similarity score below 0.6 and a style value score above 60 are identified as high-value, upliftable style features. These features represent the unique style advantages of the target presenter, possessing high market value and competitiveness within the competition category, and require targeted reinforcement. Features with a similarity score above 0.8 are considered homogeneous features to be avoided, as they overlap with the styles of most presenters in the same category, lacking differentiated competitiveness, and should be avoided from being over-reinforced during training.
[0116] The training objective generation module generates two-dimensional training objectives based on a priority list for addressing weaknesses and a list of high-value, reinforceable style features. Each training objective has clearly defined quantitative indicators, completion thresholds, and training priorities, which are then output to the personalized optimization training module.
[0117] The personalized optimization training module receives two-dimensional training objectives, generates a personalized phased optimization training plan and outputs the corresponding training content, while dynamically collecting training process data and iteratively optimizing the personalized phased optimization training plan.
[0118] Furthermore, when the personalized optimization training module generates a personalized phased optimization training plan, it matches the corresponding rhythm control training unit, logical expression training unit, and expressive charm enhancement training unit based on the professional ability shortcomings in the dual-dimensional training objectives.
[0119] Based on the intensifiable style features in the dual-dimensional training objectives, corresponding style-oriented reinforcement training units are matched. The training stages are divided in the order of basic ability completion, style-oriented reinforcement, and comprehensive ability integration, and training objectives and content for each stage are set to generate a personalized, phased optimization training plan.
[0120] Furthermore, such as Figure 3 As shown, when the personalized optimization training module performs iterative optimization, it collects the target host's training audio data, training text data, and training completion data in real time during the execution of each training stage. The dual-branch deep learning evaluation module generates real-time standardized evaluation data and real-time style feature retention data.
[0121] By comparing real-time standardized assessment data with the training objectives of the corresponding stage, the professional competence is judged. Based on the judgment results, the difficulty and pace of subsequent training content are adjusted. At the same time, based on real-time style feature retention data, the training weight of style enhancement training units is adjusted, and personalized phased optimization training schemes are iteratively optimized.
[0122] In practice, the personalized optimization training module receives two-dimensional training objectives. Its core function is to generate personalized, phased optimization training plans and simultaneously achieve dynamic iterative optimization of the training process, forming a closed-loop training system.
[0123] When generating training plans, the personalized optimization training module matches corresponding basic training units based on the professional skill gaps in the dual-dimensional training objectives. These units include rhythm control training units, logical expression training units, articulation clarity training units, and information transmission efficiency training units. Each training unit contains training content, training methods, and assessment standards at different difficulty levels.
[0124] Based on high-value, resilient style features, corresponding style-oriented reinforcement training units are matched, including emotional expression reinforcement units, personal identity reinforcement units, and empathy reinforcement units. Each training unit contains excellent examples, imitation training content, and innovative expression training tasks for the corresponding style.
[0125] The personalized optimization training module is divided into three progressive training stages, following the order of basic ability improvement, style-oriented reinforcement, and comprehensive ability integration. Each stage has a corresponding training cycle, training objectives, training content, and assessment standards.
[0126] The training is divided into three phases: Phase 1 is the basic skills enhancement phase, with core training content consisting of basic training units corresponding to weaknesses in professional skills, aiming to ensure that the scores of all weak areas reach or exceed the industry benchmark threshold; Phase 2 is the style-oriented enhancement phase, with core training content consisting of style-oriented enhancement training units corresponding to high-value, cultivable style features, supplemented by basic skills consolidation training, aiming to strengthen unique personal style advantages and improve the quantitative score of style value; Phase 3 is the comprehensive skills integration phase, with core training content consisting of full-scenario simulated hosting training, deeply integrating basic professional skills with personal style, aiming to achieve a comprehensive expression effect that meets professional skill standards and highlights personal style, ultimately generating a complete personalized phased optimization training plan and delivering the corresponding training content to the target host.
[0127] During training execution, the personalized optimization training module synchronously performs dynamic iterative optimization of the training plan. During each training phase, it collects training audio data, training text data, and training completion data of the target presenter in real time. The collected training data undergoes the same preprocessing and feature extraction process as the data acquisition module. A dual-branch deep learning evaluation module generates real-time standardized evaluation data and real-time style feature retention data. The personalized optimization training module compares the real-time standardized evaluation data with the training objectives for the corresponding phase to calculate the objective achievement rate, using the formula:
[0128] ;
[0129] In the formula, To evaluate scores in real time, This is the initial assessment score. The training objectives for each stage are scored. When the target achievement rate is below 60%, it is judged that the professional ability has not been met, and the difficulty of subsequent training content is adjusted, the training pace is reduced, and the training frequency and duration of the corresponding sub-items are increased; when the target achievement rate is above 90%, it is judged that the professional ability has been met ahead of schedule, and the difficulty of training content is increased and the training pace is accelerated.
[0130] Meanwhile, the personalized optimization training module adjusts the training weights of the style enhancement training unit based on real-time style feature retention data. Style feature retention is the cosine similarity between the trained style feature vector and the initial personal style feature vector. When the retention is below 0.7, it indicates that personal style characteristics have been lost during the basic ability training process. The weight of the basic training unit is reduced, and the weight of the style enhancement training unit is increased to avoid standardized training from erasing personalized style. When the retention is above 0.9 and the style value score improvement meets expectations, the weight of the style enhancement training unit is further increased to strengthen style advantages.
[0131] Based on the above adjustments, the personalized optimization training module iteratively optimizes the personalized phased optimization training scheme in real time. At the same time, after the end of each training phase, a full-dimensional re-evaluation is performed. Based on the re-evaluation results, the training objectives and training content of the next phase are adjusted to achieve a closed-loop iteration of training-evaluation-optimization.
[0132] The intelligent evaluation and enhancement system for presenters' oral expression also has a built-in track adaptation module. The track adaptation module receives the track positioning data of the target presenter and matches the corresponding track type, which includes news track, variety show track, interview track, and gala track.
[0133] Based on the matched track type, the weight ratio of each sub-evaluation data in the standardized evaluation branch is adjusted, the calculation dimension of the style value quantification model in the personalized style recognition branch is adjusted, and the industry benchmark database and the excellent style library of the same track are retrieved to adapt the evaluation rules and training content generation rules of the corresponding track.
[0134] In practice, the system’s built-in track adaptation module receives the track location data of the target host, matches the corresponding track type, and completes the track adaptation adjustment of the entire process based on the matching results, ensuring that the system’s evaluation rules, weight configuration, benchmark data, and training content are fully in line with the industry requirements and scenario characteristics of the corresponding track.
[0135] The track adaptation module first adjusts the weighting of each sub-item in the standardized evaluation branch. For example, in the news track, logical expression, speech rate control, and information transmission efficiency have higher weights, while in the variety show track, language rhythm, emotional expression, and interactivity have higher weights.
[0136] Adjust the weighting of the calculation dimensions of the style value quantification model in the personalized style recognition branch. For example, the weighting of audience empathy is higher in the interview track, and the weighting of aura and personal identity is higher in the gala track. At the same time, retrieve the industry benchmark database and the excellent style library of the same track for the corresponding track, and replace the general benchmark data to ensure that the evaluation results meet the industry standards of the corresponding track.
[0137] The training content generation rules are adapted to the corresponding track, ensuring that the scenarios, text, and assessment standards of the training content are consistent with the real hosting scenarios of the corresponding track, thus guaranteeing the relevance and practicality of the training content.
[0138] This embodiment uses a dual-branch parallel evaluation architecture to simultaneously achieve standardized quantitative evaluation of the presenter's professional oral expression ability and value quantification of personal style, solving the technical problem of the separation between standardization and personalization in existing technologies; through the generation of dual-dimensional training objectives and iterative optimization of phased training schemes, it achieves personalized, closed-loop enhancement of oral expression ability.
[0139] Based on the preferred embodiments of the present invention described above, those skilled in the art can make various changes and modifications without departing from the inventive concept. The technical scope of this invention is not limited to the contents of the specification, but must be determined according to the scope of the claims.
Claims
1. A system for intelligent evaluation and enhancement of a presenter's oral expression, characterized in that, include: The module includes a data acquisition module, a dual-branch deep learning evaluation module, a training target generation module, and a personalized optimization training module. The data acquisition module is used to acquire the target host's historical spoken expression data, spoken expression text data, personal trait label data, and track positioning data. At the same time, it acquires big data on the expression patterns of outstanding hosts in the same track, and constructs a two-dimensional dataset of personal style and professional ability. The two-dimensional dataset is then output to the two-branch deep learning evaluation module. The dual-branch deep learning evaluation module receives the dual-dimensional dataset and processes it in parallel through a standardized evaluation branch and a personalized style recognition branch to generate standardized quantitative evaluation results of the target presenter's professional oral expression ability, personal style characteristics, and corresponding style value quantitative data. The training objective generation module receives the standardized quantitative evaluation results and style value quantitative data, compares them with the industry benchmark database and the excellent style library in the same track, identifies professional ability shortcomings and high-value strengthenable style features, generates two-dimensional training objectives for shortcomings and strengths, and outputs the two-dimensional training objectives to the personalized optimization training module. The personalized optimization training module receives the dual-dimensional training objectives, generates a personalized phased optimization training scheme and outputs the corresponding training content, while dynamically collecting training process data and iteratively optimizing the personalized phased optimization training scheme.
2. The intelligent evaluation and enhancement system for presenter's oral expression according to claim 1, characterized in that: The data acquisition module preprocesses and standardizes the features of the historical spoken audio data, spoken text data, personal trait label data, track positioning data, and big data of the style-type expression patterns of outstanding hosts in the same track in the dual-dimensional dataset, and outputs standardized temporal speech feature sequences, text semantic feature matrices, labeled feature vectors, and a style benchmark feature library in the same track. The standardized temporal speech feature sequence and text semantic feature matrix are simultaneously input into the standardized evaluation branch, and the standardized temporal speech feature sequence, text semantic feature matrix, labeled feature vector, and same track style benchmark feature library are simultaneously input into the personalized style recognition branch, triggering parallel independent computation of the two branches.
3. The intelligent evaluation and enhancement system for presenter's oral expression according to claim 2, characterized in that: The dual-branch deep learning evaluation module includes a standardized evaluation branch and a personalized style recognition branch. The standardized evaluation branch has a built-in language rhythm analysis unit and a logical structure evaluation engine unit. The personalized style recognition branch has a built-in expressive charm calculation unit and a style value quantification unit. The language rhythm analysis unit receives the standardized temporal speech feature sequence and performs temporal segmentation and feature quantification of inter-sentence pauses, intra-sentence pauses, stress placement, and intonation fluctuations. It calculates four core rhythm indicators: speech rate fluctuation rate, pause duration ratio, stress position matching degree, and intonation fluctuation smoothness. The four core rhythm indicators are matched with the standardized rhythm templates of the same track in the industry benchmark database for similarity matching. A language rhythm standardized scoring matrix is generated by calculating the weighted scoring formula and output to the logical structure evaluation engine unit. The logical structure evaluation engine unit performs text-level logical link analysis on the language rhythm standardized scoring matrix, identifies the logical structure levels of the spoken text, including the introduction, core viewpoint explanation, case evidence, viewpoint progression, and conclusion, and calculates four core logical indicators: logical link completeness, core viewpoint support, contextual semantic coherence, and information density fit, generating a logical structure standardized scoring matrix. The logical structure standardized scoring matrix and the language rhythm standardized scoring matrix are then input into a multimodal coupled evaluation model, outputting standardized individual scores and comprehensive scores for each dimension of spoken expression professional ability, forming a standardized quantitative evaluation result of the target presenter's spoken expression professional ability. The expressive charm calculation unit receives the standardized temporal speech feature sequence, text semantic feature matrix, and labeled feature vector. Through the Transformer model for style feature extraction based on contrastive learning, it performs cross-modal fusion encoding of speech temporal features and text semantic features, extracts five core style dimensions: emotional intensity, language persuasiveness, personal identification, audience empathy, and expressive relaxation. It then combines the labeled feature vector to perform feature alignment and dimension correction, generates the target host's personal style feature vector, and outputs it to the style value quantification unit. The style value quantification unit receives the style benchmark feature library of the same track and the personal style feature vector. It uses a density clustering algorithm to cluster and match the personal style feature vector with the features in the style benchmark feature library of the same track to locate the style cluster to which the target host belongs. Then, it calculates three core value indicators of the personal style feature vector in the corresponding track: style scarcity, audience preference matching degree, and track competitiveness. The style value quantification data is generated by calculating the value weighted fusion formula.
4. The intelligent evaluation and enhancement system for presenter's oral expression according to claim 1, characterized in that: When the training target generation module performs processing, it receives the standardized quantitative evaluation results, compares the evaluation data of each sub-item in the standardized quantitative evaluation results with the benchmark threshold of the corresponding track in the industry benchmark database, and marks the sub-items that are lower than the benchmark threshold as professional capability shortcomings. The system receives the individual style features and style value quantification data, performs similarity matching with style features in the excellent style library of the same track, marks features with similarity below the threshold and scores above the passing line as high-value style features that can be strengthened, and marks features with similarity above the threshold as homogeneous features that need to be avoided.
5. The intelligent evaluation and enhancement system for presenter's oral expression according to claim 1, characterized in that: When the personalized optimization training module generates a personalized phased optimization training plan, it matches the corresponding rhythm control training unit, logical expression training unit, and expressive charm enhancement training unit based on the professional ability shortcomings in the dual-dimensional training objectives. Based on the intensifiable style features in the dual-dimensional training objectives, corresponding style-oriented reinforcement training units are matched. The training stages are divided in the order of basic ability completion, style-oriented reinforcement, and comprehensive ability integration, and training objectives and content for each stage are set to generate the personalized phased optimization training scheme.
6. The intelligent evaluation and enhancement system for presenter's oral expression according to claim 1, characterized in that: When the personalized optimization training module performs iterative optimization, it collects the target host's training audio data, training text data, and training completion data in real time during the execution of each training stage, and generates real-time standardized evaluation data and real-time style feature retention data through the dual-branch deep learning evaluation module. The real-time standardized evaluation data is compared with the training objectives of the corresponding stage to determine the professional competence level. Based on the judgment results, the difficulty and training pace of subsequent training content are adjusted. At the same time, based on the real-time style feature retention data, the training weight of the style enhancement training unit is adjusted, and the personalized phased optimization training scheme is iteratively optimized.
7. The intelligent evaluation and enhancement system for presenter's oral expression according to claim 1, characterized in that: The intelligent evaluation and enhancement system for the host's oral expression also has a built-in track adaptation module. The track adaptation module receives the track positioning data of the target host and matches the corresponding track type. The track types include news track, variety show track, interview track, and gala track. Based on the matched track type, the weight ratio of each sub-evaluation data in the standardized evaluation branch is adjusted, the calculation dimension of the style value quantification model in the personalized style recognition branch is adjusted, and the industry benchmark database and the excellent style library of the same track are retrieved to adapt the evaluation rules and training content generation rules of the corresponding track.
8. The intelligent evaluation and enhancement system for presenter's oral expression according to claim 3, characterized in that: The logical structure evaluation engine unit also has a built-in spoken logic feature graph, the generation process of which includes: S10. Collect labeled spoken language text data of hosts from different tracks, extract the logical structure units and logical connection relationships of spoken language expressions in the data, and construct a spoken language logical structure ontology. S20. Based on the spoken logic structure ontology, label the logical node attributes corresponding to topic introduction, logical progression, viewpoint argumentation, and concluding sublimation of each logical structure unit, and construct a hierarchical spoken logic feature map for the identification and matching of spoken logic chains.
9. The intelligent evaluation and enhancement system for presenter's oral expression according to claim 2, characterized in that: The data acquisition module also performs the following steps: S101. Perform timestamp alignment processing on the acquired historical spoken expression data and spoken expression text data to generate audio-text synchronization annotation data; S102. Perform track classification, style tag labeling and standardization preprocessing on the big data of the style-type expression patterns of the outstanding hosts in the same track, and build an outstanding style library and industry benchmark database in the same track. S103. Integrate the audio-text synchronization annotation data, personal trait label data, and track positioning data into the dual-dimensional dataset of personal style and professional ability, and output the excellent style library of the same track and the industry benchmark database to the training target generation module respectively.