A multi-dictionary CSC classroom teaching behavior recognition method based on an expert selection mechanism
By constructing a multi-dictionary convolutional sparse representation model and an expert selection mechanism, the optimal dictionary combination is dynamically selected, which improves the accuracy and robustness of classroom behavior recognition and is suitable for smart classroom monitoring and teaching quality evaluation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HENAN UNIVERSITY
- Filing Date
- 2026-04-22
- Publication Date
- 2026-06-19
AI Technical Summary
Existing convolutional sparse coding methods struggle to fully express diverse feature patterns in complex classroom scenarios, resulting in insufficient accuracy and robustness in classroom behavior recognition.
A multi-dictionary convolutional sparse representation model is constructed, and the optimal dictionary combination is dynamically selected through a gating network using an expert selection mechanism. This leads to the MoDE-CSC-Net model, which enables high-precision recognition of classroom behavior.
It improves the accuracy and robustness of classroom behavior recognition, reduces computational overhead, enhances the interpretability of the model, and is suitable for smart classroom monitoring and teaching quality evaluation.
Smart Images

Figure CN122244952A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer vision and educational informatization technology, specifically involving a multi-dictionary CSC classroom teaching behavior recognition method based on an expert selection mechanism. Background Technology
[0002] With the rapid development of educational informatization and artificial intelligence technologies, smart classrooms are gradually becoming an important direction for modern education. Smart classrooms integrate IoT, computer vision, and big data analytics to collect, analyze, and provide feedback on the classroom teaching process in real time, thereby achieving intelligent management of the teaching process and scientific evaluation of teaching quality. In a smart classroom environment, classroom activities are typically recorded using video capture devices, and intelligent technologies are used to automatically analyze the behavior of teachers and students in the classroom, providing objective evidence for teaching evaluation and improvement.
[0003] In the classroom, the behavior of teachers and students directly reflects the state of classroom teaching and the level of learning participation. For example, a teacher's lecturing, blackboard writing, and guiding actions reflect teaching methods and organization, while students' listening, raising hands, discussion, and looking down reflect their learning participation and classroom interaction. By automatically identifying and analyzing this behavioral information, a quantitative evaluation of the classroom teaching process can be achieved, providing data support for teaching quality assessment, teaching behavior analysis, and personalized teaching improvement. Therefore, utilizing intelligent technology to achieve automatic identification and analysis of teacher and student behavior in the classroom has significant research value and application prospects.
[0004] Currently, classroom behavior analysis primarily relies on computer vision and pattern recognition technologies. By processing and analyzing classroom video data, it achieves automatic recognition of classroom behavior. Traditional methods typically analyze images or videos by manually designing features, such as human posture features, motion trajectory features, and regional features, for behavior recognition. However, these methods often depend on human experience for feature design, making them difficult to adapt to complex and ever-changing classroom scenarios. In practical applications, their recognition accuracy and robustness are limited.
[0005] In recent years, with the development of deep learning technology, behavior recognition methods based on Convolutional Neural Networks (CNNs) have gradually become mainstream. CNNs can automatically learn high-level semantic features in images through multi-layered convolutional structures, thus significantly improving the performance of behavior recognition. In classroom behavior recognition tasks, by constructing deep CNN models to extract and classify features from classroom videos or images, automatic recognition of behaviors such as teacher lecturing, blackboard writing, student hand-raising, and student discussions can be achieved. However, traditional CNNs mainly rely on deep network structures for feature representation, and their ability to model local structural information in images still has certain limitations. Furthermore, in complex classroom scenarios, facing multiple behavioral patterns and background interference, their feature representation capabilities still have room for further improvement.
[0006] To further enhance the model's ability to represent image structural information, Convolutional Sparse Coding (CSC) has been increasingly applied in computer vision. CSC learns a set of convolutional dictionaries, representing the input image as a combination of convolutions of multiple dictionary atoms and sparse coefficients, thus more effectively characterizing local structural features in the image. Compared to traditional convolutional neural networks, CSC has advantages in the interpretability of feature representation. However, existing CSC methods typically employ a single dictionary structure, which struggles to fully represent diverse feature patterns in complex visual scenes, thus limiting their application in classroom behavior recognition tasks.
[0007] Therefore, how to introduce a more flexible feature representation mechanism into the convolutional sparse coding framework, improve the model's ability to express complex classroom behavior patterns, and achieve high-precision automatic recognition of classroom teacher and student behavior has become an important technical problem that urgently needs to be solved in the current research field of smart classrooms. Summary of the Invention
[0008] To address the shortcomings of existing technologies, this invention aims to provide a multi-dictionary CSC classroom teaching behavior recognition method based on an expert selection mechanism. This method constructs a multi-dictionary convolutional sparse representation model to represent sparse features in classroom video images. By introducing a multi-dictionary structure and an expert selection mechanism to select the optimal dictionary combination, the feature expression capability and detection accuracy of the classroom behavior detection model can be significantly improved.
[0009] To achieve the above objectives, the technical solution adopted by the present invention is as follows:
[0010] A multi-dictionary CSC classroom teaching behavior identification method based on expert selection mechanism, the method includes the following steps:
[0011] S1. Collect classroom teaching video data and perform frame-level processing on the video; extract image frames from the video sequence, preprocess them, and divide them into training and test sets;
[0012] S2. Construct a multi-dictionary convolutional sparse representation model;
[0013] S3. Introduce a dictionary selection mechanism based on a hybrid expert model, treating multiple convolutional dictionaries as multiple experts, and achieving dynamic dictionary selection through a gating network;
[0014] S4. The multi-dictionary convolutional sparse representation model is fused with the object detection model to construct the MoDE-CSC-Net model;
[0015] S5. Use the MoDE-CSC-Net model to train the images in the training set, calculate the loss function value of the network model in each iteration until the loss function value converges to the minimum, and save the training model.
[0016] S6. Adjust model parameters: Users can participate in parameter adjustment and test the test set; if the user agrees with the recognition results of teacher and student behavior, the parameter adjustment will stop and the data will be saved; if there are objections, the user can continuously adjust the model parameters as needed until the recognition results of the model that satisfy the user are achieved.
[0017] S7. Display and save detailed information about the detection and recognition.
[0018] Furthermore, step S1 specifically includes:
[0019] S1-1. Collect classroom teaching video data and perform frame-level processing on the video;
[0020] S1-2. Extract image frames from the video sequence;
[0021] S1-3. Subsequently, the image frames are preprocessed, and image normalization and image size unification are performed according to user requirements;
[0022] S1-4. Divide the training set and the test set into an 8:2 ratio.
[0023] Furthermore, the preprocessing of image frames in steps S1-3 includes: image quality screening, color space conversion, data type conversion, image enhancement processing, and annotation information synchronization processing.
[0024] The specific operations for image normalization and size unification include:
[0025] Image normalization: The pixel values of the input image are linearly mapped from the original [0,255] interval to the [0,1] interval, and further standardized according to the preset mean and standard deviation to reduce the differences in brightness and color distribution between different images;
[0026] Uniform size: The extracted image frames are scaled to a uniform resolution so that all input samples meet the fixed input requirements of the detection network.
[0027] Furthermore, step S2 specifically includes:
[0028] Constructing a multi-dictionary convolutional sparse representation model: The input signal is represented in the following form through convolutional sparse coding:
[0029] ;
[0030] In the formula, It is the input signal. Represent the spatial dimension of the input signal; assume the model contains A convolutional dictionary, denoted as Among them, the first The dictionary is denoted as , Indicates the first The number of convolution atoms in each dictionary. Indicates the first The first in the dictionary One convolutional atom, These are the corresponding sparse coding coefficients. Indicates the convolution operation;
[0031] The optimization objective at this point is to minimize the following loss function:
[0032] ;
[0033] In the formula, the first term represents the reconstruction error, and the second term represents the sparsity constraint. Sparse coding coefficients, It is a sparsity parameter;
[0034] The optimization objective can be solved using ISTA, yielding:
[0035] ;
[0036] In the formula, It is the gradient The Lipschitz constant, , represents the sparse coding coefficient update result and the soft threshold operator obtained in the (k+1)th iteration, respectively; β / L represents the threshold size, and β is a parameter related to sparsity penalty. Represents the gradient operator; Let represent the coefficient of the i-th sample. Represents the coefficients of the i-th sample obtained after convolutional sparse coding; soft thresholding operator Defined as:
[0037] ;
[0038] In the formula, This represents the soft threshold operator; Here, z is the sign function; z is the input variable of the soft thresholding operator; to integrate ISTA into standard CNN or Transformer modules, convolution operations are used instead of matrix multiplication; specifically, matrix multiplication is replaced by... The design is for a convolution operation, and the corresponding transpose operation This corresponds to deconvolution.
[0039] Furthermore, step S3 specifically includes:
[0040] Since the system contains multiple convolutional dictionaries, each dictionary is regarded as an expert model. In order to achieve adaptive selection of dictionary experts, a gating network is constructed.
[0041] First, global feature compression is performed on the input features:
[0042] ;
[0043] in GAP is global average pooling; then, the expert selection score is calculated through a two-layer fully connected network.
[0044] ;
[0045] In the formula, This indicates the expert-selected score vector; and This represents the bias parameters corresponding to the two fully connected layers; W1 and W2 are network parameters. To obtain the expert score vector for the activation function:
[0046] ;
[0047] In the formula, They represent the 1st to the 2nd. Each expert has a corresponding selection score; to reduce computational complexity and enhance model sparsity, a Top-k sparse routing strategy is introduced; the expert selection probability is:
[0048] ;
[0049] In the formula, This represents the probability that the k-th expert is selected; Indicates the expert score s kThe result after exponential operation; the denominator represents the sum of all expert scores after exponential operation, used for normalization; j is the summation index; then the k experts with the highest probabilities are selected:
[0050] ;
[0051] In the formula, Indicates the set of experts that are activated; This indicates that the top k experts with the highest probability values are selected from the expert probability vector p; the output of experts not selected is set to zero.
[0052] ;
[0053] In the formula, This represents the final weight of the k-th expert after Top-k sparse routing; to avoid all input samples activating only a small number of experts, a load balancing constraint is introduced; let:
[0054] ;
[0055] In the formula, P k This represents the average probability that expert k is selected. The load balancing loss represents the probability of the i-th sample choosing the k-th expert. Defined as:
[0056] ;
[0057] Final gating network optimization goal for:
[0058] ;
[0059] In the formula, L task For the loss of behavior detection task, These are balancing coefficients; during model training, the convolutional dictionary parameters are updated simultaneously using the backpropagation algorithm. sparsity coefficient and gating network parameters and Its parameter update formula is:
[0060] ;
[0061] In the formula, This represents the model parameters obtained in the (t+1)th iteration after the current update. Represents the model parameters at the t-th iteration; Represents the loss function The gradient with respect to the parameter θ; The learning rate is used to achieve adaptive learning of the dictionary expert through this optimization process.
[0062] Furthermore, step S4 specifically includes:
[0063] S4-1. Input the preprocessed classroom teaching image into the backbone of the target detection network as the initial input of the network;
[0064] S4-2. In the feature extraction process of the backbone, insert a multi-dictionary convolutional sparse coding module so that the module plays a similar role to convolution operation in the network for feature extraction.
[0065] S4-3. In the CSC module, multiple convolutional dictionaries are pre-set, each corresponding to an expert, to learn different types of classroom behavior features;
[0066] S4-4. Perform global average pooling on the features output by the current layer of the backbone and input them into the gating network to calculate the selection weights of each dictionary expert; then select the most relevant dictionary experts for the current input to participate in the calculation through the Top-k mechanism.
[0067] S4-5. The outputs of the activated dictionary experts are weighted and summed according to their corresponding weights to obtain the fused feature map, which is then passed to subsequent layers of the backbone.
[0068] S4-6. The fused features extracted by the backbone are then input into the neck and detection head of the target detection network to output the detection box position, category label and confidence score corresponding to the teacher and student behavior;
[0069] S4-7. During training, the multi-dictionary CSC module embedded in the backbone, the gated network, and the rest of the detection network are jointly trained to obtain the final MoDE-CSC-Net model.
[0070] Furthermore, step S5 specifically includes:
[0071] S5-1. Read the training set from step S1;
[0072] S5-2. Set the number of iterations required for the training set to p, and the number of images read in at one time to q, where p≥1 and q≥1;
[0073] S5-3. Use the MoDE-CSC-Net model to recognize images. Accept the features output by MoDE-CSC-Net, and then output the prediction result after mapping the detection box position and label probability value of the input target. Calculate the loss function value of the network model in each iteration.
[0074] S5-4. Repeat step S5-3 until the loss function value is minimized, and save the trained model.
[0075] Furthermore, step S5-3 specifically includes:
[0076] S5-3-1. Feature Extraction: Input the input image into the MoDE-CSC-Net model. In the backbone, feature extraction is performed through the embedded multi-dictionary CSC module and the gated network to obtain the fused feature map. ;
[0077] S5-3-2. Detection Result Prediction: Merging Feature Maps Input the detection head and output the target bounding box prediction results respectively. Category prediction results and corresponding confidence level , denoted as:
[0078]
[0079] in, Indicates the first Position parameters of each prediction box Indicates the first Each prediction category Indicates the first The confidence level of each predicted bounding box. Indicates the number of prediction boxes;
[0080] S5-3-3. Calculate the detection loss: Compare the prediction results with the ground truth annotations to obtain the action detection task loss. The task loss consists of bounding box regression loss, category loss, and confidence loss, namely:
[0081]
[0082] in, This represents the positional regression loss between the predicted bounding box and the ground truth bounding box; Indicate the predicted loss based on the behavior category; Indicates the target confidence loss; and Indicates the balance coefficient;
[0083] S5-3-4. Calculate gating constraint loss: To avoid frequent activation of only a small number of experts, load balancing loss is introduced. Together with the loss from the behavior detection task, they constitute the total loss function:
[0084]
[0085] in, Indicates the balance coefficient;
[0086] S5-3-5. Parameter Update: Based on the total loss function The backpropagation algorithm is used to update the model parameters until the loss function converges, and the trained model is saved.
[0087] Furthermore, the content displayed and saved in step S6 includes: the classification information of the identified teacher and student behaviors, the classification confidence level, and the position of the detection box.
[0088] The beneficial effects of this invention are as follows:
[0089] 1. This invention constructs multiple convolutional dictionaries to characterize different types of behavioral features, and uses a gating network to adaptively select multiple dictionaries, enabling the model to dynamically select the most suitable dictionary expert for feature representation learning for different classroom behavior patterns, thereby improving the model's ability to express the behavioral features of teachers and students in complex classroom scenarios.
[0090] 2. While improving the accuracy and detection effect of classroom teacher and student behavior recognition, this invention reduces unnecessary computational overhead through a sparse routing mechanism, improves the inference efficiency of the model, and enhances the interpretability of the model's feature expression process. As a result, it can be more effectively applied to educational informatization scenarios such as smart classroom monitoring, teaching behavior analysis, and teaching quality evaluation. Attached Figure Description
[0091] Figure 1 This is a flowchart of the method of the present invention;
[0092] Figure 2 This is a structural diagram of the MoDE-CSC-Net model of the present invention;
[0093] Figure 3 This is a system interface diagram of the identification method of the present invention;
[0094] Figure 4 This is a teacher-student behavior recognition diagram of the MoDE-CSC-Net model of this invention;
[0095] Figure 5 A bar chart showing the accuracy of teacher and student behavior recognition between the MoDE-CSC-Net model and the comparison model of this invention;
[0096] Figure 6 Line graphs showing the accuracy of teacher and student behavior recognition for the MoDE-CSC-Net model and the comparison model of this invention. Detailed Implementation
[0097] The principles and features of the present invention are described below with reference to the accompanying drawings. The examples given are for illustrative purposes only and are not intended to limit the scope of the invention.
[0098] like Figure 1-3 As shown, this invention proposes a multi-dictionary CSC classroom teaching behavior recognition method based on an expert selection mechanism. This method includes the following steps:
[0099] Step S1. Collect classroom teaching video data and perform frame-level processing on the video; extract image frames from the video sequence, preprocess them, and divide them into training and test sets. Specifically, this includes:
[0100] S1-1. Collect classroom teaching video data and perform frame-level processing on the video;
[0101] S1-2. Extract image frames from the video sequence;
[0102] S1-3. Subsequently, the image frames are preprocessed, including image normalization and image size unification according to user requirements. This preprocessing includes: image quality screening, color space conversion, data type conversion, image enhancement, and annotation information synchronization.
[0103] The specific operations for image normalization and size unification include:
[0104] Image normalization: The pixel values of the input image are linearly mapped from the original [0,255] interval to the [0,1] interval, and further standardized according to the preset mean and standard deviation to reduce the differences in brightness and color distribution between different images and improve the stability of model training.
[0105] Uniform size: The extracted image frames are scaled to a uniform resolution, 640×640 or other preset input sizes, so that all input samples meet the fixed input requirements of the detection network.
[0106] S1-4. Divide the training set and the test set into an 8:2 ratio.
[0107] Step S2. Construct a multi-dictionary convolutional sparse representation model. Specifically, this includes:
[0108] Constructing a multi-dictionary convolutional sparse representation model: The input signal is represented in the following form through convolutional sparse coding:
[0109] ;
[0110] In the formula, It is the input signal. Represent the spatial dimension of the input signal; assume the model contains A convolutional dictionary, denoted as Among them, the first The dictionary is denoted as , Indicates the first The number of convolution atoms in each dictionary. Indicates the first The first in the dictionary One convolutional atom, These are the corresponding sparse coding coefficients. Indicates the convolution operation;
[0111] The optimization objective at this point is to minimize the following loss function:
[0112] ;
[0113] In the formula, the first term represents the reconstruction error, and the second term represents the sparsity constraint. Sparse coding coefficients, It is a sparsity parameter;
[0114] The optimization objective can be solved using ISTA, yielding:
[0115] ;
[0116] In the formula, It is the gradient The Lipschitz constant, , represents the sparse coding coefficient update result and the soft threshold operator obtained in the (k+1)th iteration, respectively; β / L represents the threshold size, and β is a parameter related to sparsity penalty. Represents the gradient operator; Let represent the coefficient of the i-th sample. Represents the coefficients of the i-th sample obtained after convolutional sparse coding; soft thresholding operator Defined as:
[0117] ;
[0118] In the formula, This represents the soft threshold operator; Here, z is the sign function; z is the input variable of the soft thresholding operator; to integrate ISTA into standard CNN or Transformer modules, convolution operations are used instead of matrix multiplication; specifically, matrix multiplication is replaced by... The design is for a convolution operation, and the corresponding transpose operation This corresponds to deconvolution.
[0119] Step S3. Introduce a dictionary selection mechanism based on a hybrid expert model, treating multiple convolutional dictionaries as multiple experts, and achieving dynamic dictionary selection through a gating network. Specifically, this includes:
[0120] Since the system contains multiple convolutional dictionaries, each dictionary is regarded as an expert model. In order to achieve adaptive selection of dictionary experts, a gating network is constructed.
[0121] First, global feature compression is performed on the input features:
[0122] ;
[0123] in GAP is global average pooling; then, the expert selection score is calculated through a two-layer fully connected network.
[0124] ;
[0125] In the formula, This indicates the expert-selected score vector; and This represents the bias parameters corresponding to the two fully connected layers; W1 and W2 are network parameters. To obtain the expert score vector for the activation function:
[0126] ;
[0127] In the formula, They represent the 1st to the 2nd. Each expert has a corresponding selection score; to reduce computational complexity and enhance model sparsity, a Top-k sparse routing strategy is introduced; the expert selection probability is:
[0128] ;
[0129] In the formula, This represents the probability that the k-th expert is selected; Indicates the expert score s k The result after exponential operation; the denominator represents the sum of all expert scores after exponential operation, used for normalization; j is the summation index; then the k experts with the highest probabilities are selected:
[0130] ;
[0131] In the formula, Indicates the set of experts that are activated; This indicates that the top k experts with the highest probability values are selected from the expert probability vector p; the output of experts not selected is set to zero.
[0132] ;
[0133] In the formula, This represents the final weight of the k-th expert after Top-k sparse routing; to avoid all input samples activating only a small number of experts, a load balancing constraint is introduced; let:
[0134] ;
[0135] In the formula, P k Let k represent the average probability that expert k is selected. The load balancing loss represents the probability of the i-th sample choosing the k-th expert. Defined as:
[0136] ;
[0137] Final gating network optimization goal for:
[0138] ;
[0139] In the formula, L task For the loss of behavior detection task, These are balancing coefficients; during model training, the convolutional dictionary parameters are updated simultaneously using the backpropagation algorithm. sparsity coefficient and gating network parameters and Its parameter update formula is:
[0140] ;
[0141] In the formula, This represents the model parameters obtained in the (t+1)th iteration after the current update. Represents the model parameters at the t-th iteration; Represents the loss function The gradient with respect to the parameter θ; The learning rate is used to achieve adaptive learning of the dictionary expert through this optimization process.
[0142] Step S4. Fuse the multi-dictionary convolutional sparse representation model with the object detection model to construct the MoDE-CSC-Net model. Specifically, this includes:
[0143] S4-1. Input the preprocessed classroom teaching image into the backbone of the target detection network as the initial input of the network;
[0144] S4-2. In the feature extraction process of the backbone, a multi-dictionary convolutional sparse coding module is inserted, so that the module plays a similar role to convolution operation in the network for feature extraction. The module performs sparse representation of the input features through multiple convolutional dictionaries to extract richer local structural information.
[0145] S4-3. In the CSC module, multiple convolutional dictionaries are pre-set, each corresponding to an expert, to learn different types of classroom behavior features;
[0146] S4-4. Perform global average pooling on the features output by the current layer of the backbone and input them into the gating network to calculate the selection weights of each dictionary expert; then select the most relevant dictionary experts for the current input to participate in the calculation through the Top-k mechanism.
[0147] S4-5. The outputs of the activated dictionary experts are weighted and summed according to their corresponding weights to obtain the fused feature map, which is then passed to subsequent layers of the backbone; this process is equivalent to performing an adaptive feature extraction within the backbone.
[0148] S4-6. The fused features extracted by the backbone are then input into the neck and detection head of the target detection network to output the detection box position, category label and confidence score corresponding to the teacher and student behavior;
[0149] S4-7. During training, the multi-dictionary CSC module embedded in the backbone, the gated network, and the rest of the detection network are jointly trained to obtain the final MoDE-CSC-Net model.
[0150] Step S5. Train the network model on the images in the training set using the MoDE-CSC-Net model, calculate the loss function value of the network model in each iteration, until the loss function value converges to a minimum, and save the trained model. Specifically, this includes:
[0151] S5-1. Read the training set from step S1;
[0152] S5-2. Set the number of iterations required for the training set. The number of images read in at one time is q, where ≥1, q≥1;
[0153] S5-3. Use the MoDE-CSC-Net model to recognize images. It accepts the features output by MoDE-CSC-Net, then outputs the predicted result after mapping the detection box position and label probability value corresponding to the input target, and calculates the loss function value of the network model in each iteration. Specifically, this includes:
[0154] S5-3-1. Feature Extraction: Input the input image into the MoDE-CSC-Net model. In the backbone, feature extraction is performed through the embedded multi-dictionary CSC module and the gated network to obtain the fused feature map. ;
[0155] S5-3-2. Detection Result Prediction: Merging Feature Maps Input the detection head and output the target bounding box prediction results respectively. Category prediction results and corresponding confidence level , denoted as:
[0156]
[0157] in, Indicates the first Position parameters of each prediction box Indicates the first Each prediction category Indicates the first The confidence level of each predicted bounding box. Indicates the number of prediction boxes;
[0158] S5-3-3. Calculate the detection loss: Compare the prediction results with the ground truth annotations to obtain the action detection task loss. The task loss consists of bounding box regression loss, category loss, and confidence loss, namely:
[0159]
[0160] in, This represents the positional regression loss between the predicted bounding box and the ground truth bounding box; Indicate the predicted loss based on the behavior category; Indicates the target confidence loss; and Indicates the balance coefficient;
[0161] S5-3-4. Calculate gating constraint loss: To avoid frequent activation of only a small number of experts, load balancing loss is introduced. Together with the loss from the behavior detection task, they constitute the total loss function:
[0162]
[0163] in, Indicates the balance coefficient;
[0164] S5-3-5. Parameter Update: Based on the total loss function The backpropagation algorithm is used to update the model parameters until the loss function converges, and the trained model is saved.
[0165] S5-4. Repeat step S5-3 until the loss function value is minimized, and save the trained model.
[0166] Step S6. Adjust model parameters: Users can participate in parameter adjustment and test the test set; if users agree with the recognition results of teacher and student behavior, they can stop adjusting parameters and save the data; if there are objections, users can continuously adjust the model parameters as needed until the recognition results of the model satisfy the user.
[0167] The content displayed and saved includes: the classification information of the identified teacher and student behaviors, the classification confidence level, and the position of the detection box.
[0168] S7. Display and save detailed information about the detection and recognition.
[0169] In summary, this invention proposes a multi-dictionary convolutional sparse coding method for teacher and student behavior recognition based on an expert selection mechanism. It effectively combines the advantages of deep learning models in feature representation learning with the capabilities of convolutional sparse coding models in structural feature expression and sparse representation. Furthermore, it introduces a hybrid expert model's expert selection mechanism to construct a multi-dictionary expert convolutional sparse coding network model. This model constructs multiple convolutional dictionaries to characterize different types of behavioral features and uses a gating network to adaptively select from these dictionaries. This allows the model to dynamically select the most suitable dictionary expert for feature representation learning based on different classroom behavior patterns, thereby improving the model's ability to represent teacher and student behavioral features in complex classroom scenarios.
[0170] This invention improves the accuracy and detection effect of classroom teacher and student behavior recognition. At the same time, the method reduces unnecessary computational overhead through a sparse routing mechanism, improves the inference efficiency of the model, and enhances the interpretability of the model's feature expression process. As a result, it can be more effectively applied to educational informatization scenarios such as smart classroom monitoring, teaching behavior analysis, and teaching quality evaluation.
[0171] Simulation experiment:
[0172] To verify the superiority of the method proposed in this patent, a detailed analysis is presented below using relevant simulation experimental data. The precise data obtained through this simulation experiment clearly demonstrates the outstanding performance advantages of this patented technical solution compared to existing technologies, further supporting the innovation and practicality of the technical solution. Simultaneously, it provides detailed and reliable experimental support for patent examination, laying a solid foundation for subsequent physical experiments and the industrialization of the technology.
[0173] In terms of recognition performance, the MoDE-CSC-Net model proposed in this application demonstrates higher detection accuracy in classroom teaching behavior recognition tasks. As shown in Figure 5, the comparison results of the three models at mAP@0.50 indicate that the detection accuracy of the MoDE-CSC-Net model is higher than that of the comparison models YOLOv11 and WE-DETR. This demonstrates that by introducing multi-dictionary convolutional sparse representation and a dictionary selection mechanism based on a hybrid expert model, this application can more effectively extract discriminative features of teacher and student behavioral targets in classroom scenarios, thereby improving overall detection performance.
[0174] Secondly, regarding the model training process, Figure 6 shows the mAP@0.50 variation curves for the three models during training. As can be seen from the figure, the MoDE-CSC-Net model converges faster during training and achieves higher final stable accuracy. This indicates that the multi-dictionary expert selection mechanism proposed in this application not only enhances feature representation capabilities but also improves the stability and optimization effect of model training.
[0175] Furthermore, based on the visual recognition results, as shown in Figure 4, compared to the comparison model, the method of this application is more accurate in locating the behavioral targets of teachers and students in complex classroom environments, with a higher matching degree between the detection box and the target area. It still has good recognition ability for behavioral targets with dense distribution, small targets, and background interference, indicating that the method of this application has strong adaptability and robustness to complex classroom scenarios.
[0176] Therefore, this application introduces multi-dictionary convolutional sparse representation and expert dynamic selection mechanism into the object detection model, enabling the model to adaptively select more suitable dictionary experts for feature extraction based on different classroom behavior patterns. This improves the accuracy of classroom teacher and student behavior recognition, enhances model training stability, adaptability to complex scenarios, and effectiveness of feature representation, thus enabling better application in scenarios such as smart classroom monitoring, teaching behavior analysis, and teaching quality evaluation.
[0177] Obviously, the embodiments described above are only some embodiments of this application, not all embodiments. The accompanying drawings show preferred embodiments of this application, but do not limit the patent scope of this application. This application can be implemented in many different forms; rather, the purpose of providing these embodiments is to provide a more thorough and comprehensive understanding of the disclosure of this application. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or make equivalent substitutions for some of the technical features. Any equivalent structures made using the content of this application's specification and drawings, directly or indirectly applied to other related technical fields, are similarly within the scope of patent protection of this application.
Claims
1. A multi-dictionary CSC classroom teaching behavior recognition method based on an expert selection mechanism, characterized in that: The method includes the following steps: S1. Collect classroom teaching video data and perform frame-level processing on the video; extract image frames from the video sequence, preprocess them, and divide them into training and test sets; S2. Construct a multi-dictionary convolutional sparse representation model; S3. Introduce a dictionary selection mechanism based on a hybrid expert model, treating multiple convolutional dictionaries as multiple experts, and achieving dynamic dictionary selection through a gating network; S4. The multi-dictionary convolutional sparse representation model is fused with the object detection model to construct the MoDE-CSC-Net model; S5. Use the MoDE-CSC-Net model to train the images in the training set, calculate the loss function value of the network model in each iteration until the loss function value converges to the minimum, and save the training model. S6. Adjust model parameters: Users can participate in parameter adjustment and test the test set; if the user agrees with the recognition results of teacher and student behavior, the parameter adjustment will stop and the data will be saved; if there are objections, the user can continuously adjust the model parameters as needed until the recognition results of the model that satisfy the user are achieved. S7. Display and save detailed information about the detection and recognition.
2. The multi-dictionary CSC classroom teaching behavior recognition method according to claim 1, characterized in that: Step S1 specifically includes: S1-1. Collect classroom teaching video data and perform frame-level processing on the video; S1-2. Extract image frames from the video sequence; S1-3. Subsequently, the image frames are preprocessed, and image normalization and image size unification are performed according to user requirements; S1-4. Divide the training set and the test set into an 8:2 ratio.
3. The multi-dictionary CSC classroom teaching behavior recognition method according to claim 2, characterized in that: The preprocessing of image frames in steps S1-3 includes: image quality screening, color space conversion, data type conversion, image enhancement processing, and annotation information synchronization processing. The specific operations for image normalization and size unification include: Image normalization: The pixel values of the input image are linearly mapped from the original [0,255] interval to the [0,1] interval, and further standardized according to the preset mean and standard deviation to reduce the differences in brightness and color distribution between different images; Uniform size: The extracted image frames are scaled to a uniform resolution so that all input samples meet the fixed input requirements of the detection network.
4. The multi-dictionary CSC classroom teaching behavior recognition method according to claim 1, characterized in that: Step S2 specifically includes: Constructing a multi-dictionary convolutional sparse representation model: The input signal is represented in the following form through convolutional sparse coding: ; In the formula, It is the input signal. Represent the spatial dimension of the input signal; suppose the model contains A convolutional dictionary, denoted as Among them, the first The dictionary is denoted as , Indicates the first The number of convolution atoms in each dictionary. Indicates the first The first in the dictionary One convolutional atom, These are the corresponding sparse coding coefficients. Indicates the convolution operation; The optimization objective at this point is to minimize the following loss function: ; In the formula, the first term represents the reconstruction error, and the second term represents the sparsity constraint. Sparse coding coefficients, It is a sparsity parameter; The optimization objective is solved using ISTA, yielding the following result: ; In the formula, It is the gradient The Lipschitz constant, , represents the sparse coding coefficient update result and the soft threshold operator obtained in the (k+1)th iteration, respectively; β / L represents the threshold size, and β is a parameter related to sparsity penalty. Represents the gradient operator; Let represent the coefficient of the i-th sample. Represents the coefficients of the i-th sample obtained after convolutional sparse coding; soft thresholding operator Defined as: ; In the formula, This represents the soft threshold operator; The sign function is z; the input variable for the soft threshold operator is z. To integrate ISTA into standard CNN or Transformer modules, matrix multiplication is replaced with convolution operations; specifically, matrix multiplication is... The design is for a convolution operation, and the corresponding transpose operation This corresponds to deconvolution.
5. The multi-dictionary CSC classroom teaching behavior recognition method according to claim 4, characterized in that: Step S3 specifically includes: Since the system contains multiple convolutional dictionaries, each dictionary is regarded as an expert model. In order to achieve adaptive selection of dictionary experts, a gating network is constructed. First, global feature compression is performed on the input features: ; in GAP is global average pooling; then, the expert selection score is calculated through a two-layer fully connected network. ; In the formula, This indicates the expert-selected score vector; and This represents the bias parameters corresponding to the two fully connected layers; W1 and W2 are network parameters. To obtain the expert score vector for the activation function: ; In the formula, They represent the 1st to the 2nd. Each expert has a corresponding selection score; to reduce computational complexity and enhance model sparsity, a Top-k sparse routing strategy is introduced; the expert selection probability is: ; In the formula, This represents the probability that the k-th expert is selected; Indicates the expert score s k The result after exponential operation; the denominator represents the sum of all expert scores after exponential operation, used for normalization; j is the summation index; then the k experts with the highest probabilities are selected: ; In the formula, Indicates the set of experts that are activated; This indicates that the top k experts with the highest probability values are selected from the expert probability vector p; the output of experts not selected is set to zero. ; In the formula, This represents the final weight of the k-th expert after Top-k sparse routing; to avoid all input samples activating only a small number of experts, a load balancing constraint is introduced; let: ; In the formula, P k This represents the average probability that expert k is selected. The load balancing loss represents the probability of the i-th sample choosing the k-th expert. Defined as: ; Final gating network optimization goal for: ; In the formula, L task For the loss of behavior detection task, These are balancing coefficients; during model training, the convolutional dictionary parameters are updated simultaneously using the backpropagation algorithm. sparsity coefficient and gating network parameters and Its parameter update formula is: ; In the formula, This represents the model parameters obtained in the (t+1)th iteration after the current update. Represents the model parameters at the t-th iteration; Represents the loss function The gradient with respect to the parameter θ; The learning rate is used to achieve adaptive learning of the dictionary expert through this optimization process.
6. The multi-dictionary CSC classroom teaching behavior recognition method according to claim 1, characterized in that: Step S4 specifically includes: S4-1. Input the preprocessed classroom teaching image into the backbone of the target detection network as the initial input of the network; S4-2. In the feature extraction process of the backbone, insert a multi-dictionary convolutional sparse coding module so that the module plays a similar role to convolution operation in the network for feature extraction. S4-3. In the CSC module, multiple convolutional dictionaries are pre-set, each corresponding to an expert, to learn different types of classroom behavior features; S4-4. Perform global average pooling on the features output by the current layer of the backbone and input them into the gating network to calculate the selection weights of each dictionary expert; then select the most relevant dictionary experts for the current input to participate in the calculation through the Top-k mechanism. S4-5. The outputs of the activated dictionary experts are weighted and summed according to their corresponding weights to obtain the fused feature map, which is then passed to subsequent layers of the backbone. S4-6. The fused features extracted by the backbone are then input into the neck and detection head of the target detection network to output the detection box position, category label and confidence score corresponding to the teacher and student behavior; S4-7. During training, the multi-dictionary CSC module embedded in the backbone, the gated network, and the rest of the detection network are jointly trained to obtain the final MoDE-CSC-Net model.
7. The multi-dictionary CSC classroom teaching behavior recognition method according to claim 1, characterized in that: Step S5 specifically includes: S5-1. Read the training set from step S1; S5-2. Set the number of iterations required for the training set. The number of images read in at one time is q, where ≥1, q≥1; S5-3. Use the MoDE-CSC-Net model to recognize images. Accept the features output by MoDE-CSC-Net, and then output the prediction result after mapping the detection box position and label probability value of the input target. Calculate the loss function value of the network model in each iteration. S5-4. Repeat step S5-3 until the loss function value is minimized, and save the trained model.
8. The multi-dictionary CSC classroom teaching behavior recognition method according to claim 7, characterized in that: Step S5-3 specifically includes: S5-3-1. Feature Extraction: Input the input image into the MoDE-CSC-Net model. In the backbone, feature extraction is performed through the embedded multi-dictionary CSC module and the gated network to obtain the fused feature map. ; S5-3-2. Detection Result Prediction: Merging Feature Maps Input the detection head and output the target bounding box prediction results respectively. Category prediction results and corresponding confidence level , denoted as: in, Indicates the first Position parameters of each prediction box Indicates the first Each prediction category Indicates the first The confidence level of each predicted bounding box. Indicates the number of prediction boxes; S5-3-3. Calculate the detection loss: Compare the prediction results with the ground truth annotations to obtain the action detection task loss. The task loss consists of bounding box regression loss, category loss, and confidence loss, namely: in, This represents the positional regression loss between the predicted bounding box and the ground truth bounding box; Indicate the predicted loss based on the behavior category; Indicates the target confidence loss; and Indicates the balance coefficient; S5-3-4. Calculate gating constraint loss: To avoid frequent activation of only a small number of experts, load balancing loss is introduced. Together with the loss from the behavior detection task, they constitute the total loss function: in, Indicates the balance coefficient; S5-3-5. Parameter Update: Based on the total loss function The backpropagation algorithm is used to update the model parameters until the loss function converges, and the trained model is saved.
9. The multi-dictionary CSC classroom teaching behavior recognition method according to claim 1, characterized in that: The content displayed and saved in step S6 includes: the classification information of the identified teacher and student behaviors, the classification confidence level, and the position of the detection box.