Face expression recognition method and device based on multi-teacher network, equipment and medium
By constructing a facial expression recognition model through a multi-teacher network, the resource consumption problem caused by the complexity of existing models is solved, and real-time and efficient expression recognition is achieved on smart devices.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN POLYTECHNIC
- Filing Date
- 2023-09-07
- Publication Date
- 2026-06-23
AI Technical Summary
Existing facial expression recognition models are complex in structure and consume a lot of computing resources, which makes it impossible for smart devices to recognize customer expressions in real time.
A facial expression recognition model is constructed using a multi-teacher network. Through training with several teacher networks and an ensemble network, combined with diverse knowledge integration components and relative confidence calculation, a simple student model is trained to recognize facial expressions.
It enables real-time recognition of user facial expressions on smart devices without reducing, and may even improve, the recognition accuracy.
Smart Images

Figure CN117152822B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of machine learning technology, and in particular to a method, apparatus, device, and medium for facial expression recognition based on a multi-teacher network. Background Technology
[0002] Thanks to the development of artificial intelligence technology, many intelligent devices have emerged to help businesses and organizations serve their customers, such as smart teller machines in banks and smart counter robots in hotels. These devices enable businesses and organizations to reduce labor costs and minimize business processing errors. However, current research and development of intelligent devices focuses more on improving performance in business processing, neglecting the essential service-oriented nature of these devices and failing to respond promptly and accurately by observing customer expressions.
[0003] Existing facial expression recognition models are not well applied to these smart devices. The existing facial expression recognition models have complex structures and require a lot of computing resources to perform recognition. This will prevent the smart devices from completing other business functions, and due to the long processing time, they cannot recognize the customer's expression in real time. Summary of the Invention
[0004] This invention provides a method, apparatus, device, and medium for facial expression recognition based on a multi-teacher network. It can effectively solve the problems in the prior art where facial expression recognition models are too large to be applied to a large number of smart devices, and where real-time facial recognition based on acquired facial images is not possible due to long computation time.
[0005] An embodiment of the present invention provides a facial expression recognition method based on a multi-teacher network, comprising:
[0006] Acquire the image of the face to be identified;
[0007] The image of the face to be identified is input into the facial expression recognition model so that the facial expression recognition model can identify the expression corresponding to the image of the face to be identified;
[0008] The construction of the facial expression recognition model includes:
[0009] Acquire several facial images with different expressions, and attach an expression label to each facial image to form several facial image learning samples. The several facial image learning samples are divided into several categories according to the expression labels.
[0010] Construct several teacher networks with different network structures, and combine these teacher networks into an integrated network;
[0011] Using the facial images as input and expression tags as output, the teacher networks and the ensemble network are trained.
[0012] When the training of each teacher network and the ensemble network is completed, the relative logical score and relative confidence of each network are calculated for each face image learning sample.
[0013] The relative logical score is used to characterize the ability of each network to correctly predict the correct facial expression label for a learning sample; the relative confidence score is used to characterize the relative accuracy of each network in predicting the facial expression label for each learning sample.
[0014] An initial student model is constructed, and the student model is trained using the aforementioned face image learning samples and a loss function weighted by the relative logical score and relative confidence of each network under the corresponding learning samples. The trained student model is then used as the face expression recognition model.
[0015] Furthermore, when training the individual teacher networks and the integrated network, a pre-defined, diverse knowledge integration component is used to train the individual teacher networks and the integrated network.
[0016] The loss function of the diversified knowledge integration component includes:
[0017] Λ=L D +L A
[0018] L D =α1Δ-α2δ
[0019]
[0020] L D is the loss function used to maximize the diversity of output features of each network, Δ is the difference in direction of face image features output by any two networks, δ is used to characterize the size diversity of all teacher networks in the learning samples of each category, and α1 and α2 are weights used to balance the two losses in a learning sample.
[0021] L A It is a loss function used to maximize the accuracy of the output features of each network, L. C (Θ) refers to the classification loss of the ensemble network; L C (θ i ) refers to the classification loss of the i-th teacher network; M represents the number of teacher networks.
[0022] Furthermore, the calculation of the relative logical score of each network under each face image learning sample includes:
[0023] Using a pre-defined computational component, the relative logical score of each network on each learning sample is calculated using the following formula, based on the logical values of each teacher network and the ensemble network on each learning sample.
[0024]
[0025] Among them, l i,j z is the relative logistic score of the i-th teacher network across j samples. i,j,k' It is the logical score of the i-th network output on the j-th sample regarding the correct facial expression label k'. It is the maximum logistic score of each facial expression label category output by all networks based on all training samples; the logistic score represents the score of a network's prediction for each facial expression label category on a training sample.
[0026] Furthermore, the calculation of the relative confidence of each network under each face image learning sample includes:
[0027] The computational component calculates the confidence level of each network on each sample, and the relative confidence level of each network on each sample is calculated using the following formula.
[0028]
[0029] Where, r j,i c is the relative confidence value of the i-th network on the j-th sample. i,j τ is the confidence value of the i-th network on the j-th sample, τ is the scaling factor used to adjust the weight of the relative confidence of the teacher networks; M represents the number of teacher networks, and M+1 represents the number of teacher networks and ensemble networks.
[0030] Furthermore, the step of training the initial student model using the aforementioned face image learning samples and a loss function weighted by the relative logistic score and relative confidence of each network under the corresponding learning samples includes:
[0031] The loss function of the student model for each learning sample is calculated using the following formula.
[0032] L s (θ i ;θ s ;x j )=(1-β×l i,j )L C (θ s )+β×l i,j ×L K (θ i ;θ s ),
[0033]
[0034] Among them, L s (θ i ;θ s ;x j ) represents the face image features of the j-th sample learned by the student model guided by the i-th teacher network, where θ t θ represents the parameters of the teacher network or integration network. s The parameters of the student model are represented by β, which is a hyperparameter. i,j L is the relative logistic score of the i-th teacher network across j samples. c Let L represent the classification loss of the student model guided by the i-th teacher network. k L represents the distillation loss of the student model guided by the i-th teacher network. j It refers to the loss function of the student model on the j-th learning sample.
[0035] As an improvement to the above solution, another embodiment of the present invention provides a facial expression recognition device based on a multi-teacher network, comprising:
[0036] The face image acquisition module is used to acquire the face image to be identified;
[0037] A facial expression recognition module is used to input the image of the face to be recognized into the facial expression recognition model so that the facial expression recognition model can recognize the expression corresponding to the image of the face to be recognized.
[0038] A facial expression recognition model construction module is used to construct the facial expression recognition model.
[0039] Furthermore, the facial expression recognition model construction module includes:
[0040] The learning sample construction submodule is used to acquire several facial images with different expressions and attach an expression label to each facial image to form several facial image learning samples. The several facial image learning samples are divided into several categories according to the expression labels.
[0041] The teacher network construction submodule is used to construct several teacher networks with different network structures and to integrate these teacher networks into an integrated network.
[0042] The teacher network training submodule is used to train each teacher network and the ensemble network by taking the face image as input and the expression label as output.
[0043] The relative confidence calculation submodule calculates the relative logical score and relative confidence of each network for each face image learning sample after the training of each teacher network and the ensemble network is completed; wherein, the relative logical score is used to characterize the ability of each network to correctly predict the correct expression label of a learning sample; and the relative confidence is used to characterize the relative accuracy of each network in predicting the expression label of each learning sample.
[0044] The student model construction submodule is used to construct an initial student model, and train the initial student model using the aforementioned face image learning samples and a loss function weighted by the relative logical score and relative confidence of each network under the corresponding learning samples. The trained student model is then used as the face expression recognition model.
[0045] Another embodiment of the present invention provides a device, which is a terminal device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. When the processor executes the computer program, it implements any of the above-described facial expression recognition methods based on a multi-teacher network.
[0046] Another embodiment of the present invention provides a medium, which is a computer-readable storage medium including a stored computer program, wherein, when the computer program is executed, it controls the device where the computer-readable storage medium is located to execute any of the above-described facial expression recognition methods based on a multi-teacher network.
[0047] The following benefits can be obtained by implementing the present invention:
[0048] This invention provides a method, apparatus, device, and medium for facial expression recognition based on a multi-teacher network. The method enables the trained facial recognition model to be applied in current service-oriented intelligent devices. This invention trains several pre-constructed teacher networks and an ensemble network composed of these teacher networks using several facial image learning samples, and calculates the relative confidence of each network under each facial image learning sample. The several facial image learning samples and the relative confidence of each network under each learning sample are then used to train a pre-constructed, structurally simple student model, and the trained student model is used as the facial expression recognition model. This results in a facial expression recognition model with a simple structure, yet achieving the same or even higher recognition accuracy as existing complex facial expression recognition models. Furthermore, the facial expression recognition model trained by this invention also has higher recognition efficiency, enabling real-time recognition of user expressions. Attached Figure Description
[0049] Figure 1This is a flowchart illustrating a facial expression recognition method based on a multi-teacher network, provided by an embodiment of the present invention.
[0050] Figure 2 This is a schematic diagram of the structure of a facial expression recognition device based on a multi-teacher network provided in an embodiment of the present invention.
[0051] Figure 3 This is a schematic diagram of the operation of a diversified knowledge integration component provided in an embodiment of the present invention.
[0052] Figure 4 This is a schematic diagram of the operation of a relative confidence calculation component provided in an embodiment of the present invention.
[0053] Figure 5 This is a schematic diagram illustrating the construction process of a facial expression recognition model provided in an embodiment of the present invention. Detailed Implementation
[0054] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0055] To better illustrate the effects achieved by this invention, the following explanations are provided for some of the terms involved in this invention:
[0056] (1) Ensemble Learning: Ensemble learning is a machine learning technique. In supervised learning algorithms, ensemble learning combines several weakly supervised networks to obtain a better and more comprehensive strongly supervised network, which is a machine learning scheme used to improve the generalization ability of a network. One existing ensemble learning algorithm trains each weakly supervised network on all learning samples; negative correlation learning is a representative of this type of algorithm. It incorporates the output of the penalized classifier as a metric into the loss function of the multi-teacher network and encourages the classifier to learn expertise through cooperation and competition. Therefore, it can balance the accuracy and diversity of the multi-teacher network by setting the weights between the classification loss and the penalty.
[0057] (2) Knowledge Distillation: Knowledge distillation is a model compression method designed to build efficient networks for resource-constrained devices. The knowledge distillation method includes teacher networks and student networks. Its core idea is to enable a simple student network to learn knowledge from a complex teacher network, thus allowing the student network to adapt to more real-time application scenarios. The two-stage knowledge distillation method involves: first, generating diverse multi-teacher networks through ensemble learning; and second, the student network learning different feature knowledge from the multi-teacher networks.
[0058] (3) Multi-teacher network: In this invention, a multi-teacher network refers to an integrated network and all teacher networks that constitute the integrated network; an integrated network refers to the collection of all teacher networks.
[0059] (4) Implicit knowledge: refers to the true inter-class probabilities of samples hidden in the teacher network. It defines a measure of inter-class similarity, which makes it easier for the network to classify samples accurately.
[0060] See Figure 1 This is a flowchart illustrating a facial expression recognition method based on a multi-teacher network according to an embodiment of the present invention, including:
[0061] S101. Obtain the image of the face to be identified;
[0062] In a preferred embodiment of the present invention, a facial image can be obtained by a device equipped with the facial expression recognition model, or a facial image can be input into the facial expression recognition model manually.
[0063] S102. Input the face image to be identified into the facial expression recognition model so that the facial expression recognition model can identify the expression corresponding to the face image to be identified;
[0064] In a preferred embodiment of the present invention, the facial expression recognition model can identify the facial expression corresponding to the input facial image, such as: happy, sad, angry, etc.
[0065] Among them, see Figure 5 The construction of the facial expression recognition model includes:
[0066] S201. Obtain several facial images with different expressions, and attach an expression label to each facial image to form several facial image learning samples, wherein the several facial image learning samples are divided into several categories according to the expression labels;
[0067] S202. Construct several teacher networks with different network structures, and combine the several teacher networks into an integrated network;
[0068] S203. Using the face image as input and the expression label as output, train each teacher network and the ensemble network;
[0069] S204. When the training of each teacher network and the ensemble network is completed, calculate the relative logical score and relative confidence of each network under each face image learning sample.
[0070] S205. Construct an initial student model, and train the initial student model using the aforementioned face image learning samples and a loss function weighted by the relative logical score and relative confidence of each network under the corresponding learning samples. The trained student model is then used as the face expression recognition model.
[0071] In step S201, in a preferred embodiment of the present invention, several facial images with different expressions are collected via the Internet, and corresponding expression tags, such as happy, sad, angry, etc., are manually attached to the facial images. Several categories of facial image learning samples are then formed based on these expression tags. It should be noted that, to ensure the generalization ability of the network trained using these facial image learning samples, the collected facial images with different expressions also include facial images of different ages, genders, and ethnicities.
[0072] In step S202, in a preferred embodiment of the present invention, the constructed teacher networks are VGG-16 network model, ResNet-50 network model, and DenseNet-121 network model, respectively. The ensemble network is composed of the teacher networks, that is, each teacher network serves as a sub-classifier in the ensemble network.
[0073] For step S203, in a preferred embodiment of the present invention, the plurality of face image learning samples are used one by one, with the face images as input and expression labels as output to train each teacher network and the ensemble network. It should be noted that the precision of the plurality of input face images is 224*224.
[0074] Furthermore, when training the individual teacher networks and the integrated network, a pre-defined, diverse knowledge integration component is used to train the individual teacher networks and the integrated network.
[0075] The loss function of the diversified knowledge integration component includes:
[0076] Λ=L D +L A
[0077] L D =α1Δ-α2δ
[0078]
[0079] L D is the loss function used to maximize the diversity of output features of each network, Δ is the difference in direction of face image features output by any two networks, δ is used to characterize the size diversity of all teacher networks in the learning samples of each category, and α1 and α2 are weights used to balance the two losses in a learning sample.
[0080] L A It is a loss function used to maximize the accuracy of the output features of each network, L. C (Θ) refers to the classification loss of the ensemble network; L C (θ i ) refers to the classification loss of the i-th teacher network; M represents the number of teacher networks.
[0081] It is understood that the Diversified Knowledge Ensemble (DKE) component pre-defined in this invention, see [link to relevant documentation]. Figure 3 This can be understood as an independent computational module or component, designed to penalize the similarity in size and direction of feature vectors generated by different networks in a multi-teacher network, thereby increasing the diversity of output logits between and within each network. Its loss function includes a loss function L that maximizes the diversity of output features of each network. D And the loss function L that maximizes the accuracy of the output features of each network. A ;
[0082] The loss function L is determined through the following steps. D ;
[0083] S301. When training the individual teacher networks and the ensemble network, determine the logits output by each teacher network based on the learned samples from each face image:
[0084] t i =f i (x;θ i )=(t i,1 , ..., t i,k , ..., t i,K )∈R K ,
[0085] t i This represents the logic output by the i-th teacher network based on the input of the face image learning sample x; where K represents the number of expression label categories of the face image learning sample.
[0086] S302. Using the following formula, negative correlation learning is introduced to maximize the logical distance between the network outputs of each teacher;
[0087]
[0088]
[0089] Among them, t j t represents the logic output by the j-th teacher network based on the input sample x from the face image;i,k and t j,k They represent t respectively i and t j The k-th element in the matrix. Negative correlation learning aims to measure the correlation and proximity between two vectors, essentially maximizing t in the direction of each pair of feature vectors. i and t j The differences.
[0090] S303. Calculate t using the following formula. i and t j Correlation ψ i,j ;
[0091]
[0092] By calculating t i and t j The cosine of the angle between the eigenvectors is used to determine the calculation t. i and t j The correlation between them needs to be explained, ψ i,j The smaller t i and t j The smaller the correlation between them, that is, the greater the difference between the i-th and j-th teacher networks.
[0093] S304. Using the method in step S303, calculate the directional difference in the output feature vectors of any two teacher networks, denoted as:
[0094]
[0095] It should be noted that,
[0096] S305. Using the following formula, each teacher network learns different probability distribution vectors output from each face image sample. Penalized output feature vector Size;
[0097]
[0098] Where T is the temperature factor. Probability distribution vector. This can also be referred to as the dark knowledge learned by each teacher network on each face image learning sample. 1
[0099] Experimental results show that, up to step S304, the feature vectors output by each teacher network based on the face image learning samples are only penalized in direction, resulting in limited diversity of the generated multi-teacher networks. Furthermore, considering that feature vectors possess both direction and magnitude attributes, a penalty could be imposed on the magnitude of the feature vectors output by each teacher network based on the face image learning samples. However, direct penalty will ultimately lead to gradient explosion. This problem arises because the logic of the network output is unrestricted, i.e., t... i The directional difference in feature vectors between `tj` and `tj` falls within (-1, 1), but the size diversity falls within (-∞, +∞). Generally, when both size diversity and accuracy are optimization objectives for multi-teacher networks, the loss function only considers the former and ignores the latter. This is because the former can quickly reduce the loss function, while the latter's contribution to the loss function is negligible.
[0100] To ensure the size diversity and accuracy of the final generated multi-teacher network, this invention learns different probability distribution vectors from the output samples of each teacher network based on each face image. Punishment The size of the feature vectors is such that a sufficiently diverse multi-teacher network can be generated while ensuring accuracy.
[0101] S306. Integrate all the teacher networks with a matrix S to learn the different probability distribution vectors output by each face image sample;
[0102]
[0103] Where, a1=[p 1,1 , ..., p M,1 ],…,a k =[p 1,k , ..., p M,k ],…,a K =[p 1,K , ..., p M,K ];a k Let N represent the output set of the probability distribution vectors of all teacher networks outputting the k-th emoji label category, where k∈N. M .
[0104] S307. Calculate the variance of matrix S for each emoji tag category using the following formula;
[0105]
[0106]
[0107] in, Indicates a k The average value, δk This represents the variance of the probability distribution vector of the outputs of all teacher networks for the k-th face expression learning sample category. Variance is used to represent the degree of data fluctuation. The larger the variance, the greater the variation in probability. k The greater the diversity of S, the better.
[0108] S308. Calculate the size diversity of all teacher networks across all categories of facial expression learning samples using the following formula;
[0109]
[0110] S309. Combining Δ and δ from the above steps, the loss function can be obtained.
[0111] L D =α1Δ-α2δ
[0112] Here, α1 and α2 are weights used to balance the two losses in a single sample. This invention aims to minimize Δ and maximize δ in the formula, thereby increasing the diversity of the multi-teacher network as much as possible.
[0113] Furthermore, the loss function L is determined through the following steps. A ;
[0114] S310. Improve the classification accuracy of all teacher networks using the following formula:
[0115]
[0116] Among them, y i,k and Let y represent the corresponding expression label and the predicted expression label of the i-th teacher network on the face image training sample of the k-th category, respectively. i,k Also an emoji tag y i The k-th value in the one-hot encoding.
[0117] S311. Calculate the predicted facial expression labels of the integrated network using the following formula;
[0118]
[0119] in, This refers to the logic output by the integrated network. Specifically, it refers to the logic output by each teacher network on the face image learning sample of the k-th category. The average output logic is and Too The logic output on the face image learning sample of the k-th category.
[0120] In knowledge distillation, teacher networks typically learn different probability distribution vectors (dark knowledge) on the training dataset through pre-training. This knowledge is then distilled into a student network using the training dataset. However, teacher networks with robust learning capabilities can often accurately classify familiar samples, making various voting mechanisms the average ensemble network in knowledge distillation. Therefore, this invention uses the average output of all teacher networks as the prediction result of the ensemble network to improve training efficiency.
[0121] S312. Using the above formula, the classification loss of the ensemble network can be determined as follows:
[0122]
[0123] Among them, y k It's a real emoji tag y i The k-th value in one-hot encoding and This is the predicted k-th value. Both S310 and S312 use the same face image learning samples, i.e.: y i,k =y k
[0124] S313. Combining the above process, the loss function can be determined:
[0125]
[0126] The method disclosed in this invention for maximizing the accuracy of output features of each network is applicable not only to ensemble networks but also to each independently trained individual teacher network. The ensemble network is a high-performing individual network, while each teacher network is a relatively low-performing individual network. This invention enables simultaneous training of both networks, resulting in excellent performance for each.
[0127] For step S204, in a preferred embodiment of the present invention, further, through a preset computing component, based on the logical values of each teacher network and the ensemble network on each learning sample, the relative logical score of each network on each learning sample is calculated using the following formula.
[0128]
[0129] Among them, l i,j z is the relative logistic score of the i-th teacher network across j samples. i,j,k' It is the logical score of the i-th network on the j-th sample regarding the expression label k'. It is the maximum logistic score of each facial expression label category output by all networks based on all training samples; the logistic score represents the score of a network's prediction for each facial expression label category on a training sample.
[0130] Furthermore, the computational component calculates the confidence level of each network on each sample, and the relative confidence level of each network on each sample is calculated using the following formula:
[0131]
[0132] Where, r j,i c is the relative confidence value of the i-th network on the j-th sample. i,j τ is the confidence value of the i-th network on the j-th sample, τ is the scaling factor used to adjust the weight of the relative confidence of the teacher networks; M represents the number of teacher networks, and M+1 represents the number of teacher networks and ensemble networks.
[0133] Understandably, there is a pre-defined computational component – the Relative Confidence Computing (RCC) component, see [link to relevant documentation]. Figure 4 This can be understood as an independent computational module or component used to calculate the relative confidence in a multi-teacher network, guiding the student network to learn the optimal dark knowledge in the multi-teacher network. This component contains two main algorithms: Relative Logit Score (RLS) and Relative Confidence (RC).
[0134] It should be noted that, in existing technologies, each face image sample has a different impact on the student network, which can be summarized as follows:
[0135] L S =(1-λ)L C (θ S )+λL K (θ t ;θ s )
[0136] Where, θ t and θ s These are the parameters for the teacher and student networks, respectively. L c It is classification loss, L K λ is the distillation loss, and λ is the weight used to balance the two losses.
[0137] λ is a hyperparameter, and its value varies depending on the sample. Accurately calculating the λ value for a given sample is crucial to ensuring the student network learns useful knowledge. In existing methods, the λ value for a sample is typically determined by the true probability of the teacher network's dark knowledge output for that sample. However, the inter-class relationships of dark knowledge are relatively smooth, resulting in small differences in λ values across different samples. Therefore, current techniques struggle to distinguish the accuracy of the teacher network's dark knowledge output for different samples. Most importantly, existing research relies on the prediction information of a single teacher network for a given sample. However, single teacher networks, especially well-trained ones, are prone to overconfidence in the training dataset. This makes it impossible to evaluate the accuracy of the dark knowledge generated by individual networks, as their predictions for familiar samples are often correct.
[0138] To address the aforementioned issues, this invention employs a pre-defined computational component—a relative confidence calculation component—to calculate the relative logit score (RLS) based on a multi-teacher network during knowledge distillation. The core of this calculation is to evaluate the λ value of different learning samples.
[0139] The relative logical score is calculated as follows:
[0140] S401. The value of λ is kept within the numerical range (0,1) by using the following formula;
[0141]
[0142] {t i,B,k′}=b,b∈N B {t 1,b,k′ , ..., t M+1,b,k′},
[0143] Where B is the number of face image samples, t i,b,k′ ξ is the logical score of the i-th teacher network output on the b-th face image sample regarding the correct expression label k′. ξ is a small positive number, such as 10. -6 This is used to prevent logical fractions from equaling 0. i,b,k′} is a set of logical scores for all correct expression labels generated by all teacher networks and ensemble networks on the face image samples.
[0144] For a well-trained and stable teacher network and ensemble network, they are often able to correctly pre-set all samples in the face image sample set, and their logic tends to be overconfident after the softmax function. Fortunately, the logic outputs of each teacher network and ensemble network on the same learning sample can be distinguished by comparison. Therefore, this invention calculates the λ value of the teacher network by the logic score on the correct expression label. However, for familiar samples, the logic scores of the trained teacher networks and ensemble networks on the correct expression labels are usually positive. But this is not always the case. Even for familiar face image learning samples, the logic scores of each network for the correct expression label may be negative.
[0145] To address this issue, step S401 is employed: when a teacher network or ensemble network predicts a negative logical score on the correct facial expression label on a sample, the logical t of each teacher network is shifted. i,k Ensure that all teacher networks and integration networks have positive logical scores on the correct emoji labels.
[0146] S402. Calculate the relative logical score of each teacher network on each face image learning sample using the following formula.
[0147]
[0148]
[0149] Among them, l i,j Z is the relative logical score of the i-th teacher network on j facial expression learning samples. i,j,k' It is the logical score of the i-th network output on the j-th sample regarding the correct facial expression label k'. It is the maximum logical score among all logical scores of each facial expression label output by all networks based on all learning samples; where all networks refer to all teacher networks and ensemble networks.
[0150] This invention calculates the relative logistic scores of each teacher network on each face image learning sample. This accurately distinguishes the accuracy of the teacher network's output of dark knowledge for each learning sample. Subsequently, when training the student network, the relative logistic scores can be used as λ values to dynamically calibrate the ratio of classification loss and distillation loss of the student network guided by a single teacher network or ensemble network, ensuring that the student network can learn useful knowledge from each learning sample. The underlying principle is that when the teacher network can accurately predict the face image learning sample, the student network tends to learn dark knowledge from it; while when its prediction results on the face image learning sample are poor, the student network needs to favor labeled supervised learning.
[0151] It should be noted that the relative logistic score mentioned above only applies to a single teacher network guiding a student network to learn optimal hidden knowledge from learning samples. When using a multi-teacher network (i.e., all teacher networks and ensemble networks) to guide the training of the student network, it is also necessary to evaluate the confidence score of each network on each learning sample. The confidence score is used to assess the accuracy of the teacher networks in predicting the class labels of the samples. A high relative logistic score for a single teacher network on a sample does not necessarily mean a low confidence score; therefore, the confidence score of each teacher network on each learning sample should also be evaluated.
[0152] S403. Calculate the confidence of each teacher network and ensemble network on each face image learning sample using the following formula.
[0153]
[0154] c i,j z is the confidence value of the i-th network on the j-th sample. i,j,k' It is the logical score of the i-th network on the j-th sample regarding the correct facial expression label k'.
[0155] The confidence of each teacher network and ensemble network on a face image training sample is calculated by subtracting the logical scores of other expression labels from the logical scores of the correct expression labels output by each teacher network and ensemble network for each face image training sample. This allows for a proper evaluation of the confidence of each network on each training sample, as individual networks typically converge on the training dataset. However, in multi-teacher networks, increasing the diversity of the multi-teacher networks may lead to convergence failure. Therefore, when a teacher network misclassifies a training sample, its confidence is set to 0.
[0156] S404. Calculate the relative confidence of each teacher network and ensemble network on each face image learning sample using the following formula.
[0157]
[0158] r j =[r j,1 , ..., r j,i , ..., r j,M+1 ]∈R M+1
[0159] Where, r j,i r is the relative confidence value of the i-th network on the j-th sample. j This is the relative confidence vector of all teacher networks and ensemble networks on the j-th sample. τ is a scaling factor used to adjust the weights of the relative confidence of the teacher networks. When τ is large, the ratio of confidence between teacher networks increases. j The difference is very small; when it is very small, r jIt's very large. M represents the number of teacher networks, and M+1 represents the number of teacher networks and integration networks.
[0160] This invention is based on the principle that using dark knowledge generated by a multi-teacher network to train a student network is generally more effective than using dark knowledge from a single teacher network. It proposes a relative confidence calculation component to compare the prediction results of all teacher networks and the ensemble network for each learning sample. On the same learning sample, the best dark knowledge output from each teacher network is selected, enabling it to be used to train the student network. This avoids the negative impact of erroneous or redundant dark knowledge on the student network's performance, thereby improving the accuracy and robustness of the student network.
[0161] In step S205, in a preferred embodiment of the present invention, the constructed student model is a MobileNetV2 network model, which has a relatively simple structure, low computational load, and fast processing speed. This allows the trained student model to serve as the facial expression recognition model, enabling its application in most smart devices and achieving real-time detection.
[0162] Preferably, training the initial student model using the plurality of face image learning samples and a loss function weighted by the relative logistic score and relative confidence of each network under the corresponding learning samples includes:
[0163] The loss function of the student model for each learning sample is calculated using the following formula.
[0164] L s (θ i ;θ s ;x j )=(1-β×l i,j )L C (θ s )+β×l i,j ×L K (θ i ;θ s ),
[0165]
[0166] Among them, L s (θ i ;θ s ;x j ) represents the face image features of the j-th sample learned by the student model guided by the i-th teacher network, where θ t The parameters of a network or integrated network, θ s The parameters of the student model, β is a hyperparameter, l i,j L is the relative logistic score of the i-th teacher network across j samples. cLet L represent the classification loss of the student model trained under the guidance of the i-th teacher network. k L represents the distillation loss during student model training guided by the i-th teacher network. j It refers to the loss function of the student model on the j-th learning sample.
[0167] This invention is based on the concept that the importance and confidence of the dark knowledge generated by different teacher networks on each learning sample vary. It combines all teacher networks and the ensemble network to collaboratively guide the training of the student network. The relative confidence scores of each teacher network and the ensemble network are calculated by comparing the logical scores of the correct facial expression labels output by each teacher network and the ensemble network on each sample, as well as the difference between the largest logical scores among other facial expression labels. These scores are then used to weight the knowledge of the student network on each learning sample. This approach avoids the negative impact of erroneous or redundant dark knowledge on the performance of the student network, thereby improving the accuracy and robustness of the student network.
[0168] In summary, one embodiment of the present invention provides a facial expression recognition method based on a multi-teacher network, enabling the trained facial recognition model to be applied in current service-oriented intelligent devices. The present invention trains several pre-constructed teacher networks and an ensemble network composed of these teacher networks using several facial image learning samples, and calculates the relative confidence of each network under each facial image learning sample. The pre-constructed, simple student model is then trained using the facial image learning samples and the relative confidence of each network under each learning sample, and the trained student model is used as the facial expression recognition model. This results in a facial expression recognition model with a simple structure, yet achieving the same or even higher recognition accuracy as existing complex facial expression recognition models. Furthermore, the facial expression recognition model trained by the present invention also has higher recognition efficiency, enabling real-time recognition of user expressions.
[0169] See Figure 2 This is a schematic diagram of a facial expression recognition device based on a multi-teacher network according to an embodiment of the present invention, comprising:
[0170] The face image acquisition module is used to acquire the face image to be identified;
[0171] A facial expression recognition module is used to input the image of the face to be recognized into the facial expression recognition model so that the facial expression recognition model can recognize the expression corresponding to the image of the face to be recognized.
[0172] A facial expression recognition model construction module is used to construct the facial expression recognition model.
[0173] Preferably, the facial expression recognition model construction module includes:
[0174] The learning sample construction submodule is used to acquire several facial images with different expressions and attach an expression label to each facial image to form several facial image learning samples. The several facial image learning samples are divided into several categories according to the expression labels.
[0175] The teacher network construction submodule is used to construct several teacher networks with different network structures and to integrate these teacher networks into an integrated network.
[0176] The teacher network training submodule is used to train each teacher network and the ensemble network by taking the face image as input and the expression label as output.
[0177] The relative confidence calculation submodule calculates the relative logical score and relative confidence of each network for each face image learning sample after the training of each teacher network and the ensemble network is completed; wherein, the relative logical score is used to characterize the ability of each network to correctly predict the correct expression label for a learning sample; and the relative confidence is used to characterize the relative accuracy of each network in predicting the expression label for each learning sample.
[0178] The student model construction submodule is used to construct an initial student model, and train the initial student model using the aforementioned face image learning samples and a loss function weighted by the relative logical score and relative confidence of each network under the corresponding learning samples. The trained student model is then used as the face expression recognition model.
[0179] One embodiment of the present invention provides a facial expression recognition device based on a multi-teacher network, enabling the trained facial recognition model to be applied to current service-oriented intelligent devices. The present invention trains several pre-constructed teacher networks and an ensemble network composed of these teacher networks using several facial image learning samples, and calculates the relative confidence of each network under each facial image learning sample. The several facial image learning samples and the relative confidence of each network under each learning sample are then used to train a pre-constructed, structurally simple student model, and the trained student model is used as the facial expression recognition model. This results in a facial expression recognition model with a simple structure, yet achieving the same or even higher recognition accuracy as existing, more complex facial expression recognition models. Furthermore, the facial expression recognition model trained by the present invention also has higher recognition efficiency, enabling real-time recognition of user expressions.
[0180] It should be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Furthermore, in the accompanying drawings of the device embodiments provided by this invention, the connection relationships between modules indicate that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines. Those skilled in the art can understand and implement this without any creative effort.
[0181] Those skilled in the art will clearly understand that, for convenience and brevity, the specific working process of the device described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0182] An embodiment of the present invention provides a device, which is a terminal device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. When the processor executes the computer program, it implements any of the above-described multi-teacher network facial expression recognition methods.
[0183] The terminal device can be a desktop computer, laptop, handheld computer, or cloud server, etc. The terminal device may include, but is not limited to, a processor and memory. The processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor, etc. The processor is the control center of the terminal device, connecting various parts of the terminal device through various interfaces and lines. The memory can be used to store the computer program. The processor implements various functions of the terminal device by running or executing the computer program stored in the memory and by calling data stored in the memory. The memory may mainly include a program storage area and a data storage area. The program storage area may store the operating system, at least one application program required for a function, etc.; the data storage area may store data created based on the use of the mobile phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as hard disk, memory, plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, at least one disk storage device, flash memory device, or other volatile solid-state storage device.
[0184] One embodiment of the present invention provides a medium, which is a computer-readable storage medium. A computer program is stored in the computer-readable storage medium, and when executed by a processor, the computer program can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium can include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc.
[0185] The above description represents the preferred embodiments of the present invention. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of the present invention, and these improvements and modifications are also considered to be within the scope of protection of the present invention.
Claims
1. A facial expression recognition method based on a multi-teacher network, characterized in that, include: Acquire the image of the face to be identified; The image of the face to be identified is input into the facial expression recognition model so that the facial expression recognition model can identify the expression corresponding to the image of the face to be identified; The construction of the facial expression recognition model includes: Acquire several facial images with different expressions, and attach an expression label to each facial image to form several facial image learning samples. The several facial image learning samples are divided into several categories according to the expression labels. Construct several teacher networks with different network structures, and combine these teacher networks into an integrated network; Using the facial images as input and the expression tags as output, the individual teacher networks and the ensemble network are trained. When the training of each teacher network and the ensemble network is completed, the relative logical score and relative confidence of each network are calculated for each face image learning sample; wherein, the relative logical score is used to characterize the ability of each network to correctly predict the correct expression label for a learning sample; and the relative confidence is used to characterize the relative accuracy of each network in predicting the expression label for each learning sample. An initial student model is constructed, and the initial student model is trained using the aforementioned face image learning samples and a loss function weighted by the relative logical score and relative confidence of each network under the corresponding learning samples. The trained student model is then used as the face expression recognition model. Specifically, when training the individual teacher networks and the integrated network, a pre-set, diverse knowledge integration component is used to train the individual teacher networks and the integrated network. The loss function of the diversified knowledge integration components includes: ; ; ; In the formula, For the loss function of diverse knowledge integration components, It is a loss function used to maximize the diversity of output features of each network, where Δ is the difference in direction between the face image features output by any two networks. It is used to characterize the size diversity of all teacher networks in each category of learning samples. and It is used to balance the two losses in a training sample; It is a loss function used to maximize the accuracy of the output features of each network. This refers to the classification loss of the integrated network; is the classification loss of the i-th teacher network; M represents the number of teacher networks.
2. The facial expression recognition method based on a multi-teacher network as described in claim 1, characterized in that, The calculation of the relative logical score of each network under each face image learning sample includes: Using a pre-defined computational component, the relative logical score of each network on each learning sample is calculated using the following formula, based on the logical values of each teacher network and the ensemble network on each learning sample. , in, It is the relative logistic score of the i-th teacher network across j samples. It is the logical score of the i-th network output on the j-th sample regarding the correct facial expression label k'. It is the maximum logistic score of each facial expression label category output by all networks based on all training samples; the logistic score represents the score of a network's prediction for each facial expression label category on a training sample.
3. The facial expression recognition method based on a multi-teacher network as described in claim 2, characterized in that, The calculation of the relative confidence of each network under each face image learning sample includes: The computational component calculates the confidence level of each network on each sample, and the relative confidence level of each network on each sample is calculated using the following formula. in, It is the relative confidence value of the i-th network on the j-th sample. It is the confidence value of the i-th network on the j-th sample. is a scaling factor used to adjust the weight of the relative confidence of the teacher network; M represents the number of teacher networks, and M+1 represents the number of teacher networks and ensemble networks.
4. The facial expression recognition method based on a multi-teacher network as described in claim 3, characterized in that, The step of training the initial student model using the aforementioned face image learning samples and a loss function weighted by the relative logistic score and relative confidence of each network under the corresponding learning samples includes: The loss function of the student model for each learning sample is calculated using the following formula. , in, This indicates that the student model, guided by the i-th teacher network, learns the face image features of the j-th sample from the output. Parameters representing the teacher network or integration network. This represents the parameters of the student model, where β is a hyperparameter. It is the relative logistic score of the i-th teacher network across j samples. This represents the classification loss of the student model guided by the i-th teacher network. This represents the distillation loss of the student model guided by the i-th teacher network. It refers to the loss function of the student model on the j-th learning sample.
5. A facial expression recognition device based on a multi-teacher network, characterized in that, include: The face image acquisition module is used to acquire the face image to be identified; A facial expression recognition module is used to input the image of the face to be recognized into the facial expression recognition model so that the facial expression recognition model can recognize the expression corresponding to the image of the face to be recognized. The facial expression recognition model construction module includes: a learning sample construction submodule, a teacher network construction submodule, a teacher network training submodule, a teacher network training submodule, and a student model construction submodule; The learning sample construction submodule is used to acquire several facial images with different expressions and attach an expression label to each facial image to form several facial image learning samples. The several facial image learning samples are divided into several categories according to the expression labels. The teacher network construction submodule is used to construct several teacher networks with different network structures and to integrate these teacher networks into an integrated network. The teacher network training submodule is used to train each teacher network and the ensemble network by taking the face image as input and the expression label as output. When training the individual teacher networks and the ensemble network, a pre-defined diversified knowledge integration component is used to train the individual teacher networks and the ensemble network; the loss function of the diversified knowledge integration component includes: ; ; ; In the formula, For the loss function of diverse knowledge integration components, It is a loss function used to maximize the diversity of output features of each network, where Δ is the difference in direction between the face image features output by any two networks. It is used to characterize the size diversity of all teacher networks in each category of learning samples. and It is used to balance the two losses in a training sample; It is a loss function used to maximize the accuracy of the output features of each network. This refers to the classification loss of the integrated network; This refers to the classification loss of the i-th teacher network; M represents the number of teacher networks. The relative confidence calculation submodule calculates the relative logical score and relative confidence of each network for each face image learning sample after the training of each teacher network and the ensemble network is completed; wherein, the relative logical score is used to characterize the ability of each network to correctly predict the correct expression label for a learning sample; and the relative confidence is used to characterize the relative accuracy of each network in predicting the expression label for each learning sample. The student model construction submodule is used to construct an initial student model, and train the student model using the aforementioned face image learning samples and a loss function weighted by the relative logical score and relative confidence of each network under the corresponding learning samples. The trained student model is then used as the face expression recognition model.
6. A device, characterized in that, The device is a terminal device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. When the processor executes the computer program, it implements a facial expression recognition method based on a multi-teacher network as described in any one of claims 1 to 4.
7. A medium, characterized in that, The medium is a computer-readable storage medium, including a stored computer program, wherein, when the computer program is executed, it controls the device where the computer-readable storage medium is located to execute a facial expression recognition method based on a multi-teacher network as described in any one of claims 1 to 4.