A method and system for sentiment analysis of a Chinese teaching evaluation text

By combining cross-head attention mechanism and knowledge distillation, a lightweight student model is constructed, which solves the problems of large model parameters and low inference efficiency in educational evaluation text analysis. It achieves efficient sentiment analysis in resource-constrained environments and improves classification accuracy and comprehension ability.

CN122240832APending Publication Date: 2026-06-19ZHEJIANG WANLI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG WANLI UNIV
Filing Date
2026-02-09
Publication Date
2026-06-19

Smart Images

  • Figure CN122240832A_ABST
    Figure CN122240832A_ABST
Patent Text Reader

Abstract

This invention discloses a method and system for sentiment analysis of Chinese teacher evaluation texts. The method is as follows: S1: Preprocess the Chinese teacher evaluation text data and map the processed text into a sequence of word vectors as input; S2: Construct a lightweight student model and extract and fuse the features of the word vectors from step S1 through a Transformer encoder layer; S3: Enhance the structure of the student model to obtain an enhanced text representation; S4: Based on the enhanced student model obtained in step S3, introduce a teacher model and train it with the enhanced student model using a knowledge distillation method. By jointly minimizing the soft label loss and hard label loss, the decision logic of the teacher model is passed to the student model, and the hyperparameters of the model are iteratively optimized; S5: Use the trained student model to classify the sentiment of the teacher evaluation text to be tested. Calculate the probability distribution of the classification labels through a task-adapted output layer to obtain the sentiment analysis results.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of natural language processing (NLP) technology, and involves technologies such as text sentiment analysis, deep learning model optimization, and knowledge distillation. In particular, it is a Chinese teaching evaluation text sentiment analysis method and system based on cross-head attention mechanism and knowledge distillation, which can be applied to educational informatization scenarios such as educational evaluation systems, student evaluation text processing, and teaching quality assessment. Background Technology

[0002] In recent years, pre-trained language models based on the Transformer architecture have been widely adopted in the field of natural language processing. Among them, the BERT model, through a bidirectional self-attention mechanism, models semantic relationships at any position in a sequence, achieving good results in tasks such as text classification and sentiment analysis. To reduce model size and inference cost, existing techniques have proposed various lightweight distillation methods based on BERT. For example, DistilBERT achieves model compression by reducing the number of Transformer encoder layers and combining it with a distillation loss function, while TinyBERT further reduces parameter size by simultaneously compressing network depth and hidden layer dimensions and introducing a linear projection matrix and a hierarchical distillation strategy. Meanwhile, some research has also attempted to improve the multi-head attention structure by introducing interaction or fusion mechanisms between attention heads to alleviate the feature redundancy problem caused by the independence of each attention head in traditional multi-head attention. However, in resource-constrained and semantically complex application scenarios such as educational evaluation text analysis, the aforementioned existing technologies still have certain shortcomings. For example, the parameter size of lightweight models is still relatively large, resulting in limited inference efficiency in low-computing-power servers or edge device environments; existing multi-head attention and its improved methods still cannot fully characterize the implicit transitional relationships and deep semantic connections in educational evaluation texts in practical applications; the distillation process focuses more on the numerical alignment of outputs or intermediate representations, making it difficult to effectively convey the attention distribution characteristics and decision-making logic within the teacher model; at the same time, excessive compression of network structures can easily lead to a decline in semantic expression capabilities, coupled with the relative scarcity of high-quality labeled data in the field of educational evaluation, resulting in poor performance of existing models in terms of domain adaptability and fine-grained sentiment recognition, thus limiting their application in practical educational evaluation scenarios. Summary of the Invention

[0003] To address the aforementioned technical problems, this invention discloses a method and system for sentiment analysis of Chinese teaching evaluation texts based on cross-head attention mechanism and knowledge distillation.

[0004] The present invention adopts the following technical solution:

[0005] A sentiment analysis method for Chinese teaching evaluation texts includes the following steps:

[0006] S1: Preprocess the Chinese teaching evaluation text data and map the processed text into a word vector sequence as input;

[0007] S2: Construct a lightweight student model, which includes an embedding layer, a Transformer encoder layer, and a task-adaptive output layer. The Transformer encoder layer extracts and fuses features from the word vectors in step S1.

[0008] S3: Perform structural enhancement on the student model in step S2. Use the cross-head attention mechanism to perform cross-head feature recombination and enhancement on the output of multi-head attention, and use the multi-context feedforward fusion module to perform multi-scale modeling and adaptive fusion of context features to obtain enhanced text representation.

[0009] S4: Based on the enhanced student model obtained in step S3, introduce the teacher model and train it together with the enhanced student model using the knowledge distillation method. By jointly minimizing the soft label loss and hard label loss, the decision logic of the teacher model is passed to the student model, and the hyperparameters of the model are iteratively optimized.

[0010] S5: Use the trained student model to classify the sentiment of the teaching text to be assessed, and calculate the probability distribution of the classification labels through the task adaptation output layer to obtain the sentiment analysis results.

[0011] This invention also discloses a Chinese teacher evaluation text sentiment analysis system for performing the above method, which includes the following modules:

[0012] Preprocessing module: preprocesses the Chinese teaching evaluation text data, mapping the processed text into a word vector sequence as input;

[0013] Student model building module: Constructs a lightweight student model, which includes an embedding layer, a Transformer encoder layer and a task adaptation output layer. The Transformer encoder layer extracts and fuses features from the word vectors in step S1.

[0014] Student Model Enhancement Module: This module enhances the structure of the constructed student model by using a cross-head attention mechanism to reorganize and enhance the output of the multi-head attention within the original Transformer encoder layer, and by using a multi-context feedforward fusion module to perform multi-scale modeling and adaptive fusion of context features to obtain an enhanced text representation.

[0015] Iterative optimization module: Based on the obtained enhanced student model, a teacher model is introduced and trained with the enhanced student model using the knowledge distillation method. By jointly minimizing the soft label loss and hard label loss, the decision logic of the teacher model is passed to the student model, and the hyperparameters of the model are iteratively optimized.

[0016] Sentiment Analysis Module: This module uses a trained student model to classify the sentiment of the assessment text. The probability distribution of the classification labels is calculated through the task-adapted output layer to obtain the sentiment analysis results.

[0017] This invention aims to significantly reduce the number of model parameters and computational costs by improving the multi-head attention mechanism and combining it with a precise knowledge distillation strategy, while enhancing the lightweight model's ability to understand long and complex sentences and multifaceted semantics in educational evaluation texts. By designing a more effective knowledge distillation method, the invention achieves efficient transmission of the teacher model's decision logic, thereby further improving the accuracy of sentiment classification while ensuring inference speed. Ultimately, it achieves an optimal balance between parameter size, inference speed, and classification accuracy, providing an efficient and reliable lightweight solution for sentiment analysis of educational evaluation texts. Attached Figure Description

[0018] Figure 1 This is a flowchart of the cross-head attention mechanism according to a preferred embodiment of the present invention.

[0019] Figure 2 This is a knowledge distillation flowchart related to a preferred embodiment of the present invention.

[0020] Figure 3 This is a block diagram of a Chinese teacher evaluation text sentiment analysis system according to a preferred embodiment of the present invention. Detailed Implementation

[0021] The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

[0022] like Figure 1-2 As shown, this embodiment provides a sentiment analysis method for Chinese teaching evaluation texts based on cross-head attention mechanism and knowledge distillation, including the following steps:

[0023] S1: Preprocess the Chinese teaching evaluation text data, including data cleaning, hierarchical sampling and text encoding, and map the processed text into a word vector sequence as network input;

[0024] S2: Construct a lightweight student model, the student model (architecture as follows) Figure 1 (As shown) It includes an embedding layer, a Transformer encoder layer, and a multi-context feedforward fusion module, which extracts and fuses features from the input vector through the original Transformer encoder layer;

[0025] S3: For the student model in step S2, structural enhancement is performed. The cross-head attention mechanism (CMHA) is used to perform cross-head feature reorganization and enhancement on the output of the multi-head attention in the original Transformer encoder layer. The multi-context feedforward fusion module is used to perform multi-scale modeling and adaptive fusion of context features to obtain enhanced text representation.

[0026] S4: Based on the enhanced student model obtained in step S3, introduce the teacher model and simultaneously train it and the enhanced student model using the knowledge distillation method. By jointly minimizing the soft label loss and hard label loss, the decision logic of the teacher model is passed to the student model, and the model hyperparameters are iteratively optimized.

[0027] S5: Use the trained student model to classify the sentiment of the teaching text to be assessed, and calculate the probability distribution of the classification labels through the task adaptation output layer to obtain the final sentiment analysis results.

[0028] In step S1 of this embodiment, data cleaning and text normalization processing are performed. This step obtains the original Chinese evaluation text dataset, which contains multiple evaluation text records. Each record includes the evaluation text content and the corresponding rating label. Multi-level cleaning and normalization processing is performed on the original Chinese evaluation text data. Non-text content removal: HTML tags, URL links, specific identifiers, and numeric strings exceeding a preset threshold are identified and removed from the evaluation text using regular expressions. Full-width characters are converted to half-width characters, traditional Chinese is converted to simplified Chinese, and the text encoding format is standardized. Evaluation-specific noise removal: Templated evaluation content is identified and removed, irrelevant terms are removed using an institution-specific stop word list, and duplicate text fragments are detected and merged. Hierarchical sampling: The format of the dataset is standardized, uniformly adopting the "sentiment content" format. Text encoding: The processed text is word-encoded and mapped to a word vector sequence as network input.

[0029] In step S3 of this embodiment, the specific process of employing the cross-head attention mechanism (CHMA) is as follows:

[0030] S31: For the input features, multiple independent attention heads are constructed based on the original Transformer's multi-head attention mechanism. For an input sentence X=[x1,x2,…,x] with a sequence length of L. L ], where x L This represents the last input element in the input sentence sequence X. d represents the feature dimension. Multi-head attention first generates a linear transformation of Q, K, and V, where Q, K, and V are the query, key, and value matrices, respectively, and i is the attention head index. The calculation formula is as follows:

[0031] Q i = XWQ i K i = XW K i V i = XW V i

[0032] Where h represents the number of attention heads, W Q i W K i W V i This is the trainable weight matrix for each head;

[0033] S32: Calculate weights and aggregate features for each attention head, using the following formula:

[0034] head i = Attention(Q i ,K i V i )=softmax(Q i K i T / √d k V i

[0035] Where Softmax represents the normalization function, √d k This is the scaling factor;

[0036] S33: The global feature is obtained by concatenating and linearly transforming the outputs of all attention heads, and its calculation formula is as follows:

[0037] MultiHead(Q,K,V) = Concat(head1,…,head h )O CHMA

[0038] Among them, O CHMA To output the projection matrix, Concat represents the concatenation operation;

[0039] S34: Construct a cross-head interaction layer, inputting the aggregated output feature matrix into a linear projection layer (the linear projection layer is a specific implementation of the cross-head interaction layer, used for cross-head information mapping and feature recombination of the features after multi-head aggregation) for feature space mapping. The calculation formula is as follows:

[0040] O linear = O CHMA •W cross +b cross

[0041] Among them, W cross b is a learnable weight matrix. cross For bias terms;

[0042] S35: The final cross-head attention output is obtained through residual connections and layer normalization, and its calculation formula is as follows:

[0043] O residual = O linear +O MHA

[0044] O CHMA = LayerNorm(O residual )

[0045] Among them, O CMHA The output features of the Cross-head Multi-Head Attention mechanism represent the final attention enhancement result after cross-head interaction, residual connections, and layer normalization. O residual The result is the residual connection; LayerNorm is the layer normalization operation.

[0046] In step S3 of this embodiment, a multi-context feedforward fusion module is used (see...). Figure 1 The specific process is as follows:

[0047] S36: Local semantic dependencies are modeled through Local Context branches, and the calculation formula is as follows:

[0048] X local = Conv1D k=3 (X)

[0049] Wherein, Conv1D represents a one-dimensional convolution operation with a kernel size of 3, a stride of 1, and uses equal-length padding to keep the sequence length unchanged;

[0050] S37: Model mid-range context dependencies through Mid-range Context branches, and the calculation formula is as follows:

[0051] X mid = DilatedConv1D k=3, δ (X)

[0052] Where DilatedConv1D represents dilated one-dimensional convolution, and δ is the dilation rate;

[0053] S38: Global semantic information is modeled through the Global Context branch, and its calculation formula is as follows:

[0054] g = Pool(X)

[0055] = MLP(g)

[0056]

[0057] Where g is the global feature vector, The transformed global feature vector, X global Global context matrix, 1 L It is a vector of all 1s, Pool(.) is the global pooling operation, and MLP is a multilayer perceptron. Indicates a broadcast operation;

[0058] S39: Adapting the output layer via task (see...) Figure 1 The features of each branch are weighted and fused, and the calculation formula is as follows:

[0059] X out = α local ⊙X local +α mid ⊙X mid + α global ⊙X global

[0060] Where, α local α mid α global These represent the gating weights calculated by the Sigmoid function within the Local Context branch, Mid-range Context branch, and Global Context branch, respectively, with ⊙ indicating element-wise multiplication.

[0061] The specific process of step S4 in this embodiment is as follows:

[0062] S41: Construct a distillation architecture, in which a 12-layer Transformer BERT-base model is used as the teacher model, and an enhanced student model with a 4-layer Transformer structure obtained in step S3 is introduced as the recipient. The teacher model parameters are frozen, and only the student model parameters are updated.

[0063] S42: Determine the optimal combination of hyperparameter weighting coefficient α and temperature parameter T. Use a grid search algorithm to jointly search the search space of weighting coefficient α and temperature parameter T, and select the combination with the highest accuracy as the optimal hyperparameter. The optimal weighting coefficient α and optimal temperature parameter T are determined as follows:

[0064]

[0065] Where Acc(T, α) represents the classification accuracy of the student model under the current combination of temperature parameter T and weight coefficient α; T * and These represent the optimal temperature parameter and the optimal weighting coefficient, respectively, when the accuracy reaches its maximum value.

[0066] S43: Calculate the soft tag loss L soft First, the optimal temperature parameter T obtained in step S42 is used to smooth the output logits of the teacher model and the student model. Then, the KL divergence between their probability distributions is calculated, as shown in the following formula:

[0067] p T student = softmax(z student / T)

[0068] p T teacher = softmax(z teacher / T)

[0069] L soft = KL(p T student || p T teacher )

[0070] Where T is the temperature parameter used to adjust the smoothness of the distribution, and p T student and p T teacher Let z be the soft probability distributions of the student model and the teacher model at temperature T, respectively. student and z teacher , respectively, are the output logits of the student model and the teacher model, T is the temperature parameter used to adjust the smoothness of the distribution, Softmax represents the normalization function, and KL represents the KL divergence loss function;

[0071] S44: Calculate the hard label loss L hard The student model's output logits are normalized using standard Softmax, and the cross-entropy loss L between the logits and the true label y is calculated. hard The formula is as follows:

[0072] p student = softmax(z student )

[0073] L hard = CE(p student ,y)

[0074] Where, p student Here, CE represents the hard probability distribution of the student model, Softmax represents the normalization function, and CE represents the cross-entropy loss function.

[0075] S45: Construct a joint training objective. The total loss function L is obtained by weighting and summing the soft-label loss and hard-label loss using the optimal weight coefficients α obtained in step S42. total And backpropagation is performed based on this total loss function to optimize the model parameters, as shown in the following formula:

[0076] L total = α•L soft +(1- α)•L hard

[0077] Where α is the optimal weight coefficient of the soft label loss obtained in step S42 (value range: 0-1).

[0078] The specific steps of the task adaptation output layer used in step S5 of this embodiment are as follows:

[0079] S51: Obtain the last hidden state vector corresponding to the special marker [CLS] in the student model and use it as the aggregate semantic representation of the text;

[0080] S52: Construct a task-adaptive output layer, perform linear transformation and Softmax normalization on the hidden state vector, and calculate the prediction probability distribution of the sentiment tri-classification task (positive, neutral, negative). The calculation formula is as follows:

[0081] y pred = softmax(h [CLS] W output +b output )

[0082] Among them, y pred Let h represent the predicted sentiment classification probability distribution. [CLS] W represents the last hidden state vector corresponding to the special marker [CLS]. output Let b represent the trainable weight matrix of the output layer. output represents the bias term of the output layer, and Softmax represents the normalized exponential function.

[0083] The following comparative experiments are conducted to verify the technical advantages of this invention.

[0084] To verify the effectiveness, generalization ability, and practical value of the method of this invention in the task of sentiment classification of educational evaluation texts, experiments were conducted from the aspects of overall model performance comparison, multi-dataset validation, ablation experiment analysis, and comparison with existing distillation models, and the experimental results were systematically analyzed.

[0085] Table 1 Comparison of EduTE-Putter datasets

[0086]

[0087] The method of this invention was compared with various mainstream deep learning models and pre-trained models on the EduTE-Putter educational evaluation dataset. The experimental results are shown in Table 1.

[0088] Table 1 shows that the method of this invention achieved a classification accuracy of 0.798 with 23.95M parameters, which is exactly the same as the parameter size of the BERT-mini model, but the accuracy is improved by 2.1%. Compared with the BERT-base model, the method of this invention has only 23.4% of the parameters, while the inference speed is increased to 278% of the original model, significantly reducing the model complexity while maintaining high classification accuracy. In addition, compared with traditional deep learning models such as LSTM, BiLSTM, CNN, and TextCNN, the method of this invention shows a significant advantage in accuracy, improving by 60.9%, 10.5%, and 10.1% respectively, indicating that it has stronger expressive power in semantic modeling of complex educational evaluation texts.

[0089] The above results show that the method of the present invention achieves a good balance between model lightweighting, inference efficiency and classification performance, and is suitable for deployment in application scenarios such as educational evaluation where real-time performance and resource constraints are highly demanding.

[0090] Table 2 Comparison of three different datasets

[0091] Dataset Accuracy of this invention BERT-mini accuracy Increase Performance stability ChnSentiCorp 0.899 0.876 +2.3% excellent Waimai_10k 0.892 0.858 +3.4% excellent EduTE-Putter 0.798 0.777 +2.1% excellent

[0092] To further verify the generalization ability and stability of the method of the present invention, the method of the present invention was compared with the BERT-mini model on three datasets with different domains and feature distributions. The results are shown in Table 2.

[0093] As shown in Table 2, the method of this invention achieved stable performance improvements on the ChnSentiCorp, Waimai_10k, and EduTE-Putter datasets, with accuracy increases of 2.3%, 3.4%, and 2.1%, respectively, for an average improvement of 2.6%. The method also showed a consistent improvement trend on both the general sentiment analysis dataset and the real business review dataset, indicating that it did not overfit to specific domain data and possessed good cross-domain generalization ability and stability.

[0094] To analyze the role and synergistic relationship between cross-head attention mechanism and knowledge distillation strategy in the method of this invention, ablation experiments were conducted on different datasets.

[0095] ChnSentiCorp dataset ablation results

[0096]

[0097] The ablation experiments on the ChnSentiCorp dataset are shown in Table 3. As shown in Table 3, when the cross-head attention mechanism is introduced alone, the model accuracy improves by 0.7% compared to BERT-mini, while the inference speed improves by approximately 34%, indicating that the cross-head attention mechanism introduces some computational complexity while enhancing feature interaction capabilities. When the knowledge distillation strategy is introduced alone, the accuracy improves by 0.9%, indicating that the distillation mechanism directly contributes to the model performance improvement. When both are introduced simultaneously, the accuracy improves by 2.3%, significantly higher than the improvement of either module alone, indicating a significant synergistic enhancement effect between the cross-head attention mechanism and the knowledge distillation strategy.

[0098] Ablation results from the Waimai_10k dataset

[0099]

[0100] The ablation experiments on the Waimai_10k dataset are shown in Table 4. As can be seen from Table 4, the cross-head attention mechanism alone makes a more significant contribution on this dataset, improving accuracy by 2.9%, indicating that this dataset contains relatively rich long-distance semantic dependencies. In contrast, the improvement brought by knowledge distillation is relatively small (0.7%), mainly because the semantic structure of this task itself is relatively simple. When the two methods are used in combination, not only is the accuracy improved to 3.4%, but the inference speed also reaches 458.72 sentences / second, showing a more prominent efficiency advantage in this scenario.

[0101] EduTE-Putter Educational Dataset Ablation Results

[0102]

[0103] The ablation experiments on the EduTE-Putter educational evaluation dataset are shown in Table 5. As can be seen from Table 5, in the educational evaluation dataset, the accuracy improvement achieved by using the cross-head attention mechanism or the knowledge distillation strategy alone was relatively small, at 0.5% and 0.1%, respectively; however, when the two were combined, the accuracy improvement reached 2.1%, demonstrating a significant synergistic effect. Although the absolute accuracy on this dataset is relatively low, considering the complex features of educational evaluation texts, including professional terminology, student spoken language, and internet slang, the performance improvement achieved by the method of this invention is significant, further validating its applicability in complex semantic scenarios.

[0104] Based on the above comparative experiments and ablation analysis, the following conclusions can be drawn: The method of this invention shows stable and consistent performance improvement on multiple datasets, verifying the effectiveness of the method; there is a significant synergistic enhancement effect between the cross-head attention mechanism and the knowledge distillation strategy, and the overall effect of the combination is better than that of a single module; the stable performance of the method on data from different domains indicates that it has good generalization ability; at the same time, it achieves a reasonable trade-off between parameter quantity, inference speed and classification accuracy, has high engineering practical value, and can meet the needs of practical application scenarios such as educational evaluation.

[0105] Comparison with DistillBERT

[0106] Comparison Dimensions DistillBERT This invention Distillation loss function Cosine similarity KL divergence Loss information richness Considering only vector similarity Considering the differences in probability distributions, conveying decision logic Attention mechanism Standard multi-head attention Improved cross-head attention mechanism Complex semantic understanding Restricted Enhance Reduce the number of parameters 40% 76.6% Performance on educational datasets Untested 0.798, compared to BERT-mini +2.1%.

[0107] The differences between the method of this invention and DistilBERT are shown in Table 6. Table 6 shows that DistilBERT mainly uses a distillation loss function based on vector similarity, while the method of this invention uses a distillation strategy based on KL divergence, which can more fully characterize the probability distribution information of the teacher model output, thus conveying a more complete decision-making logic. Simultaneously, this invention introduces an improved cross-head attention mechanism to enhance the ability to model complex semantics, achieving an accuracy of 0.798 on the educational evaluation dataset, a 2.1% improvement compared to BERT-mini, while related background techniques have not been validated for this type of data.

[0108] Comparison with TinyBERT

[0109] Comparison Dimensions TinyBERT This invention Layer cutting 6 floors 4 floors Distillation method Two phases (pre-training + fine-tuning) Single-stage fine-tuning distillation Hyperparameter optimization fixed Grid search dynamic optimization Attention mechanism Standard multi-head attention Cross-head attention mechanism Number of parameters (relative to BERT-base) Approximately 6.7% (parameters reduced by about 15 times) 23.4% Improved reasoning speed 9 times 2.78 times

[0110] The method of this invention is compared with TinyBERT, and the results are shown in Table 7. Table 7 shows that TinyBERT achieves extreme lightweighting through significant layer reduction and two-stage distillation, but suffers significant accuracy loss in complex semantic tasks. In contrast, although the method of this invention has a higher parameter count than TinyBERT, it significantly reduces accuracy loss while maintaining improved inference efficiency through cross-head attention mechanism and dynamic distillation hyperparameter optimization. For applications such as educational evaluation that require high semantic understanding accuracy, the method of this invention achieves a better trade-off between performance and efficiency.

[0111] In summary, this invention achieves significant results in model quantization, classification accuracy improvement, and performance optimization. The number of parameters is compressed to 23.95M, a 76.6% reduction compared to BERT-base, and the inference speed is significantly improved to 348 sentences / second, an improvement of 177%. In terms of classification accuracy, the model achieves accuracies of 0.899, 0.892, and 0.798 on ChnSentiCorp, Waimai_10k, and the self-built educational evaluation dataset EduTE-Putter, respectively, representing a stable improvement over BERT-mini. Compared with traditional models such as RNN and CNN, this model achieves accuracy improvements of 59.9%, 10.2%, 9.8%, and 12.4% while maintaining the same number of parameters, demonstrating excellent overall performance.

[0112] like Figure 3 As shown, this embodiment discloses a Chinese teacher evaluation text sentiment analysis system for performing the above method, which includes the following modules:

[0113] Preprocessing module: preprocesses the Chinese teaching evaluation text data, mapping the processed text into a word vector sequence as input;

[0114] Student model building module: Constructs a lightweight student model, which includes an embedding layer, a Transformer encoder layer and a task adaptation output layer. The Transformer encoder layer extracts and fuses features from the word vectors in step S1.

[0115] Student Model Enhancement Module: This module enhances the structure of the constructed student model by using a cross-head attention mechanism to reorganize and enhance the output of the multi-head attention within the original Transformer encoder layer, and by using a multi-context feedforward fusion module to perform multi-scale modeling and adaptive fusion of context features to obtain an enhanced text representation.

[0116] Iterative optimization module: Based on the obtained enhanced student model, a teacher model is introduced and trained with the enhanced student model using the knowledge distillation method. By jointly minimizing the soft label loss and hard label loss, the decision logic of the teacher model is passed to the student model, and the hyperparameters of the model are iteratively optimized.

[0117] Sentiment Analysis Module: This module uses a trained student model to classify the sentiment of the assessment text. The probability distribution of the classification labels is calculated through the task-adapted output layer to obtain the sentiment analysis results.

[0118] Other aspects of this embodiment can be found in the above method embodiments.

[0119] This invention has strong application value: the simplified parameter set makes it adaptable to the resource-constrained server environment of universities; the inference speed of up to 348 sentences / second can meet the real-time processing requirements of online teaching quality assessment; the ability to understand evaluation texts such as internet slang and complex sentence structures is significantly enhanced; and its stable performance on multiple public and self-built datasets also demonstrates its good task versatility.

[0120] The preferred embodiments and principles of the present invention have been described in detail above. For those skilled in the art, there may be changes in the specific implementation based on the ideas provided by the present invention, and these changes should also be considered within the scope of protection of the present invention.

Claims

1. A method for sentiment analysis of Chinese teaching evaluation texts, characterized in that, Includes the following steps: S1: Preprocess the Chinese teaching evaluation text data and map the processed text into a word vector sequence as input; S2: Construct a lightweight student model, which includes an embedding layer, a Transformer encoder layer, and a task-adaptive output layer. The Transformer encoder layer extracts and fuses features from the word vectors in step S1. S3: For the student model in step S2, structural enhancement is performed. The cross-head attention mechanism is used to reorganize and enhance the cross-head features of the multi-head attention output in the Transformer encoder layer. The multi-context feedforward fusion module is used to perform multi-scale modeling and adaptive fusion of context features to obtain the enhanced text representation. S4: Based on the enhanced student model obtained in step S3, introduce the teacher model and train it together with the enhanced student model using the knowledge distillation method. By jointly minimizing the soft label loss and hard label loss, the decision logic of the teacher model is passed to the student model, and the hyperparameters of the model are iteratively optimized. S5: Use the trained student model to classify the sentiment of the teaching text to be assessed, and calculate the probability distribution of the classification labels through the task adaptation output layer to obtain the sentiment analysis results.

2. The sentiment analysis method for Chinese teaching evaluation texts according to claim 1, characterized in that, In step S1, the preprocessing includes data cleaning, stratified sampling, and text encoding.

3. The method for sentiment analysis of Chinese teaching evaluation texts according to claim 2, characterized in that, In step S1, HTML tags, URL links, specific identifiers, and numeric strings exceeding a preset threshold in the evaluation text are identified and removed using regular expressions; full-width characters are converted to half-width characters, traditional Chinese characters are converted to simplified Chinese characters, and the text encoding format is unified; specific noise in the evaluation is removed, templated content in the evaluation is identified and removed, irrelevant terms are removed using an institution-specific stop word list, and duplicate text fragments are detected and merged; the format of the dataset is unified, adopting the "sentiment content" format; the processed text is word-encoded using BERT-Base-Chinese and mapped to a word vector sequence as network input.

4. A method for sentiment analysis of Chinese teaching evaluation texts according to any one of claims 1-3, characterized in that, In step S3, the specific process of using the cross-head attention mechanism is as follows: S31: For the input features, multiple independent attention heads are constructed based on the original Transformer's multi-head attention mechanism. For an input sentence X=[x1,x2,…,x] with a sequence length of L. L ], where x L This represents the last input element in the input sentence sequence X, where X∈R L×d d represents the feature dimension. Multi-head attention first generates a linear transformation of Q, K, and V, where Q, K, and V are the query, key, and value matrices, respectively, and i is the attention head index. The calculation formula is as follows: Q i = XW Q i ,K i = XW K i ,V i = XW V i Where h represents the number of attention heads, W Q i W K i W V i This is the trainable weight matrix for each head; S32: Calculate weights and aggregate features for each attention head. The calculation formula is as follows: head i = Attention(Q i ,K i ,V i )=softmax(Q i K i T / √d k )V i Where Softmax represents the normalization function, √d k This is the scaling factor; S33: The global features are obtained by concatenating and linearly transforming the outputs of all attention heads, and the calculation formula is as follows: MultiHead(Q,K,V) = Concat(head1,…,head h )O CHMA Where h represents the number of attention heads, O CHMA To output the projection matrix, Concat represents the concatenation operation; S34: Construct a cross-head interaction layer, input the aggregated output feature matrix into the linear projection layer for feature space mapping, and calculate the following formula: HE linear = O CHMA •W cross +b cross Among them, W cross b is a learnable weight matrix. cross For bias terms; S35: The final cross-head attention output is obtained through residual connections and layer normalization. The calculation formula is as follows: The residual = O linear +O MHA O CHMA = LayerNorm(O residual ) Among them, O CMHA O represents the output feature of the cross-head attention mechanism. residual The result is the residual connection; LayerNorm is the layer normalization operation.

5. The method for sentiment analysis of Chinese teaching evaluation texts according to claim 4, characterized in that, In step S3, the specific process of using the multi-context feedforward fusion module is as follows: S36: Model local semantic dependencies through Local Context branches, with the following calculation formula: X local = Conv1D k=3 (X) Wherein, Conv1D represents a one-dimensional convolution operation with a kernel size of 3, a stride of 1, and uses equal-length padding to keep the sequence length unchanged; S37: Model mid-range context dependencies using Mid-range Context branches, calculated as follows: X mid = DilatedConv1D k=3, δ (X) Where DilatedConv1D represents dilated one-dimensional convolution, and δ is the dilation rate; S38: Model global semantic information through the Global Context branch, with the following calculation formula: g = Pool(X) = MLP(g) Where g is the global feature vector, The transformed global feature vector, X global For the global context matrix, 1 L It is a vector of all 1s, Pool(.) is the global pooling operation, and MLP is a multilayer perceptron. Indicates a broadcast operation; S39: Weighted fusion of features from each branch is performed through the task-adapted output layer, and the calculation formula is as follows: X out = a local ⊙X local +a mid ⊙X mid + a global ⊙X global Where, α local α mid α global These represent the gating weights calculated by the Sigmoid function within the Local Context branch, Mid-range Context branch, and Global Context branch, respectively, with ⊙ indicating element-wise multiplication.

6. The method for sentiment analysis of Chinese teaching evaluation texts according to claim 5, characterized in that, Step S4 is as follows: S41: Construct a distillation architecture, in which a 12-layer Transformer BERT-base model is used as the teacher model, and an enhanced student model with a 4-layer Transformer structure obtained in step S3 is introduced as the recipient. The teacher model parameters are frozen, and only the student model parameters are updated. S42: Determine the optimal combination of hyperparameter weighting coefficient α and temperature parameter T. Use a grid search algorithm to jointly search the search space of weighting coefficient α and temperature parameter T, and select the combination with the highest accuracy as the optimal hyperparameter. The optimal weighting coefficient α and optimal temperature parameter T are determined as follows: Where Acc(T, α) represents the classification accuracy of the student model under the current combination of temperature parameter T and weight coefficient α; T * and α * These represent the optimal temperature parameter and the optimal weighting coefficient, respectively, when the accuracy reaches its maximum value. S43: Calculate the soft tag loss L soft First, the optimal temperature parameter T obtained in step S42 is used to smooth the output logits of the teacher model and the student model. Then, the KL divergence between their probability distributions is calculated, as shown in the following formula: p T student = softmax(z student / T) p T teacher = softmax(z teacher / T) L soft = KL(p T student ||p T teacher ) Where T is the temperature parameter used to adjust the smoothness of the distribution, and p T student and p T teacher Let z be the soft probability distributions of the student model and the teacher model at temperature T, respectively. student and z teacher represents the output logits of the student model and the teacher model, respectively; Softmax represents the loss function; and KL represents the KL divergence loss function. S44: Calculate the hard label loss L hard The student model's output logits are normalized using standard Softmax, and the cross-entropy loss L between the logits and the true label y is calculated. hard The formula is as follows: p student = softmax(z student ) IT hard = CE(p student ,y) Where, p student Here, CE represents the hard probability distribution of the student model, Softmax represents the loss function, and CE represents the cross-entropy loss function. S45: Construct a joint training objective. The total loss function L is obtained by weighting and summing the soft-label loss and hard-label loss using the optimal weight coefficients α obtained in step S42. total And backpropagation is performed based on this total loss function to optimize the model parameters, as shown in the following formula: L total = α•L soft +(1- a)•L hard Where α is the optimal weight coefficient of the soft label loss obtained in step S42.

7. The method for sentiment analysis of Chinese teaching evaluation texts according to claim 6, characterized in that, In step S5, the specific steps for using the task adaptation output layer are as follows: S51: Obtain the last hidden state vector corresponding to the special marker [CLS] in the student model and use it as the aggregate semantic representation of the text; S52: Construct a task-adaptive output layer, perform linear transformation and Softmax normalization on the hidden state vector, and calculate the prediction probability distribution for the sentiment tri-classification task. The calculation formula is as follows: y pred = softmax(h [CLS] W output +b output ) Among them, y pred Let h represent the predicted sentiment classification probability distribution. [CLS] W represents the last hidden state vector corresponding to the special marker [CLS]. output Let b represent the trainable weight matrix of the output layer. output represents the bias term of the output layer, and Softmax represents the normalized exponential function.

8. A Chinese teacher evaluation text sentiment analysis system, used to perform the method as described in any one of claims 1-7, characterized in that, Includes the following modules: Preprocessing module: preprocesses the Chinese teaching evaluation text data, mapping the processed text into a word vector sequence as input; Student model building module: Constructs a lightweight student model, which includes an embedding layer, a Transformer encoder layer and a task adaptation output layer. The Transformer encoder layer extracts and fuses features from the word vectors in step S1. Student Model Enhancement Module: This module enhances the structure of the constructed student model by using a cross-head attention mechanism to reorganize and enhance the output of the multi-head attention within the original Transformer encoder layer, and by using a multi-context feedforward fusion module to perform multi-scale modeling and adaptive fusion of context features to obtain an enhanced text representation. Iterative optimization module: Based on the obtained enhanced student model, a teacher model is introduced and trained with the enhanced student model using the knowledge distillation method. By jointly minimizing the soft label loss and hard label loss, the decision logic of the teacher model is passed to the student model, and the hyperparameters of the model are iteratively optimized. Sentiment Analysis Module: This module uses a trained student model to classify the sentiment of the assessment text. The probability distribution of the classification labels is calculated through the task-adapted output layer to obtain the sentiment analysis results.