Self-distillation expression recognition method based on weighted non-target category, medium and equipment

By employing a weighted non-target category self-distillation facial expression recognition method, and utilizing affine transformation and self-distillation loss function, the problem of poor facial expression recognition performance and high training costs in existing technologies is solved, achieving more stable and efficient facial expression feature extraction and recognition.

CN116343309BActive Publication Date: 2026-06-16SOUTH CHINA UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SOUTH CHINA UNIV OF TECH
Filing Date
2023-04-06
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In existing technologies, facial expression recognition methods based on hand gesture features have poor performance, while methods based on multiple neural networks have high training costs and poor robustness, making it difficult to effectively capture facial expression features and are highly dependent on the training process.

Method used

A self-distillation expression recognition method based on weighted non-target categories is adopted. The method uses a feature extraction network to perform affine transformation, constructs feature consistency loss and self-distillation loss function using class activation map, designs a training framework that does not change the network structure, and introduces non-proportional linear weighted non-target category distillation.

🎯Benefits of technology

It improves the robustness and performance of the network, reduces training costs, and is better able to handle subtle changes in facial images, obtain effective features for expression recognition, and enhances the stability and feature extraction capabilities of the network.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116343309B_ABST
    Figure CN116343309B_ABST
Patent Text Reader

Abstract

The application provides a self-distillation expression recognition method based on weighted non-target categories, a medium and equipment; wherein the method is: inputting an image to be recognized into a feature extraction network, using the feature extraction network to propose expression features, and then performing expression recognition; the feature extraction network training method is: respectively performing flipping and random rotation changes on the training original image; inputting into the feature extraction network to obtain the corresponding feature map and vector; respectively calculating the CAM map corresponding to the feature map to construct the feature consistency loss; using the vector to construct a cross-entropy loss function; using the vector to calculate the KL divergence for decomposition to construct a self-distillation loss function, and then constructing a total loss function. The method starts from the affine transformation of the training original image, improves the consistency of the features by limiting the CAM maps before and after the affine transformation of the original image, and improves the consistency of the discrimination by self-distillation of the non-target categories through non-proportional linear weighting.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer image recognition technology, and more specifically, to a self-distillation facial expression recognition method, medium, and device based on weighted non-target categories. Background Technology

[0002] Facial expressions are an important means of communication. With the development of computer technology and smart devices, expression recognition has been widely applied, such as in human-computer interaction, medical diagnosis, and driver fatigue monitoring. Expression recognition methods can be divided into methods based on hand gesture features and methods based on deep learning. For the former, Chinese invention patent "A Facial Expression Recognition Feature Extraction Method Based on Edge Detection and SIFT" (Publication No.: CN108038476A) uses SIFT descriptors of facial expression information in face images to complete feature extraction; Chinese invention patent "An Input Method Based on Expression Recognition" (Publication No.: CN102193620A) uses principal component analysis and Gabor wavelet method to extract expression features. This type of hand gesture feature method is quite difficult to obtain high-quality expression features.

[0003] With the development of artificial intelligence, deep learning-based methods have become the mainstream approach for facial expression recognition. Most of these methods start with network structure. Chinese invention patent application "A Facial Expression Recognition Algorithm Combining Multi-Level Convolutional Feature Pyramids" (Publication No.: CN110580461A) proposes a facial expression recognition algorithm combining multi-level convolutional feature pyramids; Chinese invention patent "A Facial Expression Recognition Method Based on Multi-Branch Cross-Connectivity Convolutional Neural Networks" (Publication No.: CN111639544A) proposes a multi-branch cross-connectivity convolutional neural network; Chinese invention patent "A Facial Expression Recognition Method and System Based on Local and Global Attention Mechanisms" (Publication No.: CN112784764A) proposes a network structure based on local and global attention mechanisms; and Chinese invention patent "A Facial Expression Recognition Method and System Based on an Improved Channel Attention Mechanism" (Publication No.: CN113076890A) proposes a neural network based on an improved channel attention mechanism. The performance of these methods depends on a good training process, and compared to the design of the structure, the design of the training framework has higher versatility and practical value. Chinese invention patents “A Facial Expression Recognition Method Based on Improved Deep Convolutional Generative Adversarial Networks” (Publication No.: CN113688799B) and “A Multi-Angle Facial Expression Recognition Method Based on Generative Adversarial Networks” (Publication No.: CN108446609A) utilize generative adversarial networks to achieve facial expression recognition. However, these methods rely on multiple neural networks, which increases training costs and also exhibits high instability and poor robustness. Summary of the Invention

[0004] To overcome the shortcomings and deficiencies in the prior art, the present invention aims to provide a self-distillation expression recognition method, medium, and device based on weighted non-target categories. The method starts from the affine transformation of the original image, improves the consistency of features by restricting the class activation map (CAM) before and after the affine transformation of the original image, and improves the consistency of discrimination by performing self-distillation of non-target categories with non-proportional linear weighting.

[0005] To achieve the above objectives, the present invention is implemented through the following technical solution: a self-distillation expression recognition method based on weighted non-target categories, wherein the image to be recognized is input into a feature extraction network, expression features are extracted using the feature extraction network, and then expression recognition is performed;

[0006] The feature extraction network refers to the trained feature extraction network; the feature extraction network training method includes the following steps:

[0007] S1. The original training image is flipped and randomly rotated to obtain a flipped image and a rotated image, respectively. The original image, flipped image, and rotated image are then input into the feature extraction network to obtain the feature maps M corresponding to the original image, flipped image, and rotated image, respectively. o Feature map M f and feature map M r And obtain the corresponding logit vector f i ,i∈{o,f,r};

[0008] S2. Calculate the feature map M respectively. i Corresponding CAM diagram A i,c ;

[0009] CAM diagram A o,c Perform flipping and the same rotation separately to convert to CAM image A′. f,c and CAM diagram A′ r,c Using CAM diagram A f,c and CAM diagram A′ f,c Differences and CAM diagram A r,c and CAM diagram A′ r,c To construct a feature consistency loss based on the differences.

[0010] S3, using the logit vector f i Construct the cross-entropy loss function

[0011] S4. For the logit vector f o and the logit vector f f and logit vector f rThe KL divergence is calculated using the mean, and then decomposed; subsequently, a self-distillation loss function is constructed.

[0012] S5. Utilizing Feature Consistency Loss Cross-entropy loss function and self-distillation loss function Construct the total loss function The feature extraction network is trained using the total loss function.

[0013] Preferably, step S2 includes the following sub-steps:

[0014] S2.1 Calculate the feature map M o Corresponding CAM diagram A o,c Calculate feature map M f Corresponding CAM diagram A f,c Calculate feature map M r Corresponding CAM diagram A r,c :

[0015]

[0016] Where i∈{o,f,r}; c is the category index; L is the input dimension of the fully connected layer, i.e., the image feature output dimension; w l,c This represents the weights of the l-th feature and category c in the fully connected layer;

[0017] S2.2, CAM drawing A o,c Perform flipping and the same rotation separately to convert to CAM image A′. f,c and CAM diagram A′ r,c ;

[0018] S2.3, Using CAM diagram A f,c and CAM diagram A′ f,c Differences and CAM diagram A r,c and CAM diagram A′ r,c To construct a feature consistency loss based on the differences.

[0019]

[0020] Where C is the number of categories; c is the category number; w and h are the width and height of the feature map, respectively; N is the number of samples; and (n) is the sample number.

[0021] Preferably, step S3 includes the following sub-steps:

[0022] S3.1, Transfer the logit vector f iThe inputs are respectively fed into the SoftMax function to obtain the corresponding discriminant distribution vector P. i ;

[0023] S3.2 Constructing the cross-entropy loss function

[0024]

[0025] Where N is the number of samples; (n) is the sample number; CE(.) is the cross-entropy function; and l is the corresponding one-hot label.

[0026] Preferably, step S4 includes the following sub-steps:

[0027] S4.1, Transfer the logit vector f f and logit vector f r The mean is set as the logit vector d. T logit vector f o Set as logit vector d S :

[0028]

[0029] d S =f o

[0030] S4.2, transfer the logit vector d T and logit vector d S After passing through the Softmax function containing the temperature temp, the output p is obtained. T and output p s :

[0031]

[0032]

[0033] and In this context, t∈{1,2,...,C} represents the index of the target category;

[0034] S4.3, Definition To independently model the prediction distribution of non-target categories, j∈{T,S}; p j The sum of elements in the non-target categories; calculate separately. and

[0035]

[0036]

[0037] S4.4, Definition Represents a binary distribution of the target category and other non-target categories;

[0038] S4.5 Perform KL divergence decomposition and construct the NNSD function:

[0039]

[0040]

[0041] Where a and b are adjustable parameters; a and b represent the slope of the linear function and the slope of the linear function, respectively. The intersection;

[0042] S4.6 Constructing the self-distillation loss function

[0043]

[0044] Preferably, in S5, the total loss function for:

[0045]

[0046] Where α, β k γ and γ represent adjustable parameters, respectively.

[0047] Preferably, the adjustable parameter β k In the middle, β r =(1-β) f ).

[0048] Preferably, the feature extraction network refers to the ResNet-18 neural network.

[0049] A storage medium storing a computer program that, when executed by a processor, causes the processor to perform the aforementioned self-distillation facial expression recognition method based on weighted non-target categories.

[0050] A computing device includes a processor and a memory for storing a processor-executable program, wherein when the processor executes the program stored in the memory, it implements the above-described self-distillation facial expression recognition method based on weighted non-target categories.

[0051] Compared with the prior art, the present invention has the following advantages and beneficial effects:

[0052] 1. Existing technologies mostly utilize manual feature extraction or network structure modification. Manual feature extraction has poor recognition performance and is not easy to capture effective features. Network structure modification has poor versatility and is highly dependent on the training process. This invention designs a training framework based on the loss function without changing the network structure, which can more conveniently and easily improve the feature extraction of existing networks, enhance the robustness of the network, and thus improve performance.

[0053] 2. For the few solutions designed from the perspective of training framework, existing technologies rely on multiple neural networks, which increases training costs and also has extremely high instability and poor robustness; this invention only involves a single neural network, with lower training costs and better stability.

[0054] 3. This invention introduces a self-distillation method, which, through the invariance of features and discrimination in the affine transformation of the image, enables the network to better cope with subtle changes in facial images, obtain effective features, and thus perform expression recognition.

[0055] 4. As the network's grasp of the data increases, the utilization of dark knowledge becomes more important, necessitating consideration of non-target class distillation. Therefore, this invention reweights the data, setting the weights for non-target class distillation as related to... and The non-proportional straight line; while preserving the dynamics of non-target class distillation, it expands the benefits of knowledge distillation. Attached Figure Description

[0056] Figure 1 This is a flowchart of the training process in the self-distillation expression recognition method based on weighted non-target categories of the present invention;

[0057] Figure 2 This is a flowchart of the training method in the self-distillation expression recognition method based on weighted non-target categories of the present invention;

[0058] Figures 3(a) to 3(c) These are schematic diagrams illustrating the weights of traditional knowledge distillation KD, decoupled knowledge distillation DKD, and the NNSD of this invention for knowledge distillation of non-target categories. Detailed Implementation

[0059] The present invention will now be described in further detail with reference to the accompanying drawings and specific embodiments.

[0060] Example 1

[0061] like Figure 1 and Figure 2As shown in the figure, this embodiment presents a self-distillation facial expression recognition method based on weighted non-target categories. The image to be recognized is input into a feature extraction network, which extracts facial expression features before facial expression recognition is performed. The feature extraction network can be a ResNet-18 neural network.

[0062] The feature extraction network refers to the trained feature extraction network; the feature extraction network training method includes the following steps:

[0063] S1. The original training image is flipped and randomly rotated to obtain a flipped image and a rotated image, respectively. The original image, flipped image, and rotated image are then input into the feature extraction network to obtain the feature maps M corresponding to the original image, flipped image, and rotated image, respectively. o Feature map M f and feature map M r And obtain the corresponding logit vector f i , i∈{o,f,r}.

[0064] S2. Calculate the feature map M respectively. i Corresponding CAM diagram A i,c CAM diagrams can reflect the image locations that the network focuses on during classification.

[0065] CAM diagram A o,c Perform flipping and the same rotation separately to convert to CAM image A′. f,c and CAM diagram A′ r,c Using CAM diagram A f,c and CAM diagram A′ f,c Differences and CAM diagram A r,c and CAM diagram A′ r,c To construct a feature consistency loss based on the differences.

[0066] Specifically, S2 includes the following steps:

[0067] S2.1 Calculate the feature map M o Corresponding CAM diagram A o,c Calculate feature map M f Corresponding CAM diagram A f,c Calculate feature map M r Corresponding CAM diagram A r,c :

[0068]

[0069] Where i∈{o,f,r}; c is the category index; L is the input dimension of the fully connected layer, i.e., the image feature output dimension; w l,cThis represents the weights of the l-th feature and category c in the fully connected layer;

[0070] S2.2, CAM drawing A o,c Perform flipping and the same rotation separately to convert to CAM image A′. f,c and CAM diagram A′ r,c ;

[0071] S2.3. Affine transformations affect the location of the image the network focuses on, thus impacting its ability to capture information. Intuitively, however, when performing different basic affine transformations on the same image, the expression regions the network can obtain for discrimination should be consistent. Based on this idea, this invention utilizes MSE loss to construct feature consistency loss. To minimize A f,c and A′ f,c The differences, and A r,c and A′ r,c The differences enhance the network's ability to extract important facial features.

[0072] This invention utilizes CAM diagram A f,c and CAM diagram A′ f,c Differences and CAM diagram A r,c and CAM diagram A′ r,c To construct a feature consistency loss based on the differences.

[0073]

[0074] Where C is the number of categories; c is the category number; w and h are the width and height of the feature map, respectively; N is the number of samples; and (n) is the sample number.

[0075] Therefore, neural networks can be constrained from the perspective of features to effectively focus on similar facial features under different affine transformations.

[0076] S3, using the logit vector f i Construct the cross-entropy loss function

[0077] Specifically, S3 includes the following steps:

[0078] S3.1, Transfer the logit vector f i The inputs are respectively fed into the SoftMax function to obtain the corresponding discriminant distribution vector P. i ;

[0079] S3.2 Constructing the cross-entropy loss function

[0080]

[0081] Where N is the number of samples; (n) is the sample number; CE(.) is the cross-entropy function; and l is the corresponding one-hot label.

[0082] S4. For the logit vector f o and the logit vector f f and logit vector f r The KL divergence is calculated using the mean, and then decomposed; subsequently, a self-distillation loss function is constructed.

[0083] Specifically, S4 includes the following sub-steps:

[0084] The logit vector f f and logit vector f r The mean is set as the logit vector d. T logit vector f o Set as logit vector d S :

[0085]

[0086] d S =f o

[0087] S4.2, transfer the logit vector d T and logit vector d S After passing through the Softmax function containing the temperature temp, the output p is obtained. T and output p S :

[0088]

[0089]

[0090] and And, t∈{1,2,...,C} represents the index of the target category;

[0091] S4.3, Definition To independently model the prediction distribution of non-target categories, j∈{T,S}; p j The sum of elements in the non-target categories; calculate separately. and

[0092]

[0093]

[0094] S4.4, Definition Represents a binary distribution of the target category and other non-target categories;

[0095] S4.5 Perform KL divergence decomposition and construct the NNSD function:

[0096]

[0097]

[0098] Where a and b are adjustable parameters; a and b represent the slope of the linear function and the slope of the linear function, respectively. The intersection;

[0099] We believe this weight should be dynamic, set as a non-proportional straight line with respect to the predicted value of the target category. The slope and intercept can be adjusted accordingly to better extract more useful facial expression information during knowledge distillation. The process of assigning corresponding weights to non-target categories when the target category weight is 1 is as follows: Figures 3(a) to 3(c) As shown.

[0100] To address this issue and avoid the over-reliance on teacher networks in traditional knowledge distillation, this invention proposes knowledge distillation using non-proportional linear weights for non-target analogies, and uses this as a self-distillation method. Using NNSD(p T ||p S For example, a non-proportional linear weight is expressed as: when When the target category is very small, the proportion of distillation for non-target categories increases relative to the original knowledge distillation, while when... When the value is large, the original non-target class weights are reduced to mitigate the impact of non-target class distillation. Furthermore, compared to decoupled knowledge distillation, this weighting retains the dynamics of the original distillation and can better adapt to the samples.

[0101] S4.6 Constructing the self-distillation loss function

[0102]

[0103] S5. Utilizing Feature Consistency Loss Cross-entropy loss function and self-distillation loss function Construct the total loss function The feature extraction network is trained using the total loss function:

[0104] Total loss function for:

[0105]

[0106] Where α, β k γ and β represent adjustable parameters, respectively. r =(1-β) f ).

[0107] By optimizing this joint loss, we can not only better handle the relationship between target and non-target categories and improve the knowledge transfer effect in the network, but also make the network more robust and easier to capture important facial expression information.

[0108] This invention, based on affine transformation invariance, performs relevant affine transformations on the images input to the network, inputting them simultaneously with the original images. By constructing a feature consistency loss, the network's consistency in identifying regions of interest for images with different affine transformations is improved. Furthermore, a discriminative consistency loss is constructed, which performs self-distillation in facial expression recognition using the output of the affine-transformed images. This simplifies the traditional knowledge distillation method that utilizes multiple networks while enhancing the network's ability to capture important facial expression information. Finally, it proposes using a general straight line as the weight for non-target class knowledge transfer as the loss function in self-distillation to better handle the relationship between target and non-target classes, thereby obtaining consistent discriminative information. This method can improve the ability of traditional networks to acquire important facial features for expression recognition, and also enhances robustness to facial data collected in the field.

[0109] Example 2

[0110] This embodiment provides a storage medium, characterized in that the storage medium stores a computer program, which, when executed by a processor, causes the processor to perform the self-distillation facial expression recognition method based on weighted non-target categories as described in Embodiment 1.

[0111] Example 3

[0112] This embodiment provides a computing device, including a processor and a memory for storing processor-executable programs. The processor, when executing the program stored in the memory, implements the self-distillation facial expression recognition method based on weighted non-target categories as described in Embodiment 1.

[0113] The above embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above embodiments. Any changes, modifications, substitutions, combinations, or simplifications made without departing from the spirit and principle of the present invention shall be considered equivalent substitutions and shall be included within the protection scope of the present invention.

Claims

1. A self-distillation facial expression recognition method based on weighted non-target categories, characterized in that: The image to be recognized is input into a feature extraction network, which extracts facial expression features and then performs facial expression recognition. The feature extraction network refers to the trained feature extraction network; the feature extraction network training method includes the following steps: S1. The original training image is flipped and randomly rotated to obtain flipped and rotated images, respectively. The original image, flipped image, and rotated image are then input into the feature extraction network to obtain the feature maps corresponding to the original image, flipped image, and rotated image, respectively. Feature map and feature map And obtain the corresponding logit vector. ; S2. Calculate the feature maps respectively. Corresponding CAM diagram ; CAM diagram Perform flipping and identical rotations separately to convert to a CAM image. and CAM diagram Using CAM diagrams and CAM diagram Differences and CAM diagrams and CAM diagram To construct a feature consistency loss based on the differences. , : ; Where C is the number of categories; c is the category number; w and h are the width and height of the feature map, respectively; N is the number of samples; and (n) is the sample number. S3, using logit vectors Construct the cross-entropy loss function ;include: S3.1, Transfer the logit vector The inputs are respectively fed into the SoftMax function to obtain the corresponding discriminant distribution vectors. ; S3.2 Constructing the cross-entropy loss function : ; Where N is the number of samples; (n) is the sample index; CE(.) is the cross-entropy function; and l is the corresponding one-hot label. S4. For the logit vector and logit vector and logit vector The KL divergence is calculated using the mean, and then decomposed; subsequently, a self-distillation loss function is constructed. Includes: S4.1, transferring the logit vector and logit vector The mean is set as a logit vector. logit vector Set as a logit vector : ; ; S4.2, Transfer the logit vector and logit vector After passing through the Softmax function containing the temperature temp, the output is obtained. and output : , ; , ; and middle, Indicates the index of the target category; S4.3, Definition To independently model the prediction distribution of non-target categories, ; express The sum of elements in the non-target categories; calculate separately. and : ; ; S4.4, Definition Represents a binary distribution of the target category and other non-target categories; S4.5 Perform KL divergence decomposition and construct... function: ; ; in, , These are adjustable parameters; and Let represent the slope of the linear function, and and respectively. The intersection; S4.6 Constructing the self-distillation loss function : ; S5. Utilizing Feature Consistency Loss Cross-entropy loss function and self-distillation loss function Construct the total loss function Training the feature extraction network using the total loss function; Total loss function for: ; in, , , These represent adjustable parameters.

2. The self-distillation facial expression recognition method based on weighted non-target categories according to claim 1, characterized in that: In S2, the feature map Corresponding CAM diagram for: ; in, ; c is the category number; L is the input dimension of the fully connected layer, i.e., the image feature output dimension; This represents the weights of the l-th feature and category c in the fully connected layer.

3. The self-distillation facial expression recognition method based on weighted non-target categories according to claim 1, characterized in that: The adjustable parameters middle, .

4. The self-distillation facial expression recognition method based on weighted non-target categories according to claim 1, characterized in that: The feature extraction network refers to the ResNet-18 neural network.

5. A storage medium, characterized in that, The storage medium stores a computer program that, when executed by a processor, causes the processor to perform the self-distillation facial expression recognition method based on weighted non-target categories as described in any one of claims 1-4.

6. A computing device, comprising a processor and a memory for storing a processor-executable program, characterized in that, When the processor executes the program stored in the memory, it implements the self-distillation facial expression recognition method based on weighted non-target categories as described in any one of claims 1-4.