Driver expression recognition model training method and device, medium and electronic equipment

By combining ResNet18 and Transformer encoding sub-models and using the Softmax cross-entropy loss function and center loss function, the feature distribution of the driver expression recognition model is optimized, which solves the shortcomings of the driver expression recognition model in terms of accuracy and achieves higher expression recognition accuracy.

CN116844204BActive Publication Date: 2026-06-30CHINA FAW CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA FAW CO LTD
Filing Date
2023-06-20
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing driver facial expression recognition models are insufficient in terms of accuracy and have difficulty effectively distinguishing different categories of facial expressions.

Method used

A neural network model combining a ResNet18 sub-model with multiple Transformer encoding sub-models, along with Softmax cross-entropy loss and center loss functions, is used to improve facial expression recognition capabilities through data augmentation and feature distribution optimization.

Benefits of technology

By improving the distribution of facial expression features, the correlation of long-distance feature information in facial expression images is enhanced, thereby improving the recognition accuracy of the driver's facial expression recognition model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116844204B_ABST
    Figure CN116844204B_ABST
Patent Text Reader

Abstract

This application provides a training method, apparatus, medium, and electronic device for a driver expression recognition model. The method includes: acquiring multiple batches of image training sets, and training the driver expression recognition model using these batches to reach a preset training batch size. The driver expression recognition model of this application, based on the structural characteristics of a Transformer encoder, is combined with a ResNet18 residual network, and a center loss function is introduced to improve the distribution of expression features. The driver expression recognition model strengthens the correlation between long-distance feature information in expression images, enabling it to extract discriminative feature information. Training parameters are updated by combining the Softmax cross-entropy loss function with the center loss function. The introduction of the center loss function improves the distribution of expression features, reducing the internal spacing of expressions of the same category and thus increasing the distance between different categories of expression features, making it easier for the network to distinguish facial expression features and improving recognition accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of machine learning technology, and more specifically, to a training method, apparatus, medium, and electronic device for a driver facial expression recognition model. Background Technology

[0002] As living standards continue to improve, cars have become a major mode of transportation, bringing great convenience to people. The driver's driving condition has a significant impact on safe driving.

[0003] Therefore, this application provides a training method for a driver facial expression recognition model to solve the above-mentioned technical problems. Summary of the Invention

[0004] The purpose of this application is to provide a training method, apparatus, medium, and electronic device for a driver facial expression recognition model, which can solve at least one of the aforementioned technical problems.

[0005] The specific plan is as follows:

[0006] According to a specific embodiment of this application, in a first aspect, this application provides a method for training a driver's facial expression recognition model, comprising:

[0007] Multiple batches of image training sets are obtained. Each batch of image training sets includes multiple training images. Each training image includes facial feature information of the driver. The driver's expression type in each training image belongs to one of multiple preset expression types.

[0008] The driver's facial expression recognition model is trained using the multiple batches of image training sets to reach a preset training batch;

[0009] The driver expression recognition model can classify driver expressions. It includes a neural network model that combines a ResNet18 sub-model and multiple Transformer encoding sub-models. The driver expression recognition model improves expression recognition capability based on the calculation results of the Softmax cross-entropy loss function and the center loss function.

[0010] Optionally, training the driver's facial expression recognition model using the multiple batches of image training sets to reach a preset training batch includes:

[0011] Each training image in each batch of image training set is sequentially input into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image;

[0012] The facial feature sequence of each training image is input into the multiple Transformer encoding sub-models to obtain the training expression type of the corresponding training image;

[0013] When any batch of training images is lower than the preset training batch, the step of sequentially inputting each training image from the next batch of image training set into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image is triggered.

[0014] Training ends when any batch of training data reaches the preset training batch size.

[0015] Optionally, the ResNet18 sub-model includes: Conv1 layer, Conv2_x layer, Conv3_x layer, Conv4_x layer and Conv5_x layer;

[0016] Accordingly, the step of sequentially inputting each training image from each batch of image training sets into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image includes:

[0017] The feature map of each training image in each batch of training images is obtained through Conv1 layer, Conv2_x layer, Conv3_x layer, Conv4_x layer and Conv5_x layer;

[0018] The feature maps of each training image are dimensionally adjusted to obtain the facial feature sequences of the corresponding training images.

[0019] Optionally, the plurality of Transformer coded sub-models includes eight Transformer coded sub-models.

[0020] Optionally, obtaining multiple batches of image training sets includes:

[0021] Acquire multiple raw facial images, where the driver's expression in each raw facial image belongs to one of multiple preset expression types;

[0022] Each original facial image is resized to obtain a standard-sized image of the corresponding original facial image.

[0023] Data augmentation is performed on each standard-sized image to obtain multiple training images corresponding to the standard-sized images;

[0024] The multiple training images are allocated in batches to obtain multiple batches of image training sets.

[0025] Optionally, λ in the central loss function is equal to 0.5.

[0026] Optionally, the preset training batch includes 200 batches.

[0027] According to a specific embodiment of this application, in a second aspect, this application provides a training device for a driver's facial expression recognition model, comprising:

[0028] The acquisition unit is used to acquire multiple batches of image training sets, wherein each batch of image training sets includes multiple training images, each training image includes facial feature information of the driver, and the driver's expression type in each training image belongs to one of multiple preset expression types.

[0029] The training unit is used to train the driver's facial expression recognition model using the multiple batches of image training sets to reach a preset training batch.

[0030] The driver expression recognition model can classify driver expressions. It includes a neural network model 3 that combines a ResNet18 sub-model and multiple Transformer encoding sub-models. The driver expression recognition model improves expression recognition ability based on the calculation results of the Softmax cross-entropy loss function and the center loss function.

[0031] Optionally, training the driver's facial expression recognition model using the multiple batches of image training sets to reach a preset training batch includes:

[0032] Each training image in each batch of image training set is sequentially input into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image;

[0033] The facial feature sequence of each training image is input into the multiple Transformer encoding sub-models to obtain the training expression type of the corresponding training image;

[0034] When any batch of training images is lower than the preset training batch, the step of sequentially inputting each training image from the next batch of image training set into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image is triggered.

[0035] Training ends when any batch of training data reaches the preset training batch size.

[0036] Optionally, the ResNet18 sub-model includes: Conv1 layer, Conv2_x layer, Conv3_x layer, Conv4_x layer and Conv5_x layer;

[0037] Accordingly, the step of sequentially inputting each training image from each batch of image training sets into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image includes:

[0038] The feature map of each training image in each batch of training images is obtained through Conv1 layer, Conv2_x layer, Conv3_x layer, Conv4_x layer and Conv5_x layer;

[0039] The feature maps of each training image are dimensionally adjusted to obtain the facial feature sequences of the corresponding training images.

[0040] Optionally, the plurality of Transformer coded sub-models includes eight Transformer coded sub-models.

[0041] Optionally, obtaining multiple batches of image training sets includes:

[0042] Acquire multiple raw facial images, where the driver's expression in each raw facial image belongs to one of multiple preset expression types;

[0043] Each original facial image is resized to obtain a standard-sized image of the corresponding original facial image.

[0044] Data augmentation is performed on each standard-sized image to obtain multiple training images corresponding to the standard-sized images;

[0045] The multiple training images are allocated in batches to obtain multiple batches of image training sets.

[0046] Optionally, λ in the central loss function is equal to 0.5.

[0047] Optionally, the preset training batch includes 200 batches.

[0048] According to a specific embodiment of this application, in a third aspect, this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the training method for the driver facial expression recognition model as described in any of the preceding claims.

[0049] According to a specific embodiment of this application, in a fourth aspect, this application provides an electronic device, including: one or more processors; and a storage device for storing one or more programs, which, when executed by the one or more processors, cause the one or more processors to implement the training method for the driver facial expression recognition model as described in any of the preceding claims.

[0050] Compared with the prior art, the above-described solutions of this application have at least the following beneficial effects:

[0051] This application provides a training method, apparatus, medium, and electronic device for a driver expression recognition model. The method includes: acquiring multiple batches of image training sets, and training the driver expression recognition model using these batches to reach a preset training batch size. The driver expression recognition model of this application, based on the structural characteristics of a Transformer encoder, is combined with a ResNet18 residual network, and a center loss function is introduced to improve the distribution of expression features. The driver expression recognition model strengthens the correlation between long-distance feature information in expression images, enabling it to extract discriminative feature information. Training parameters are updated by combining the Softmax cross-entropy loss function with the center loss function. The introduction of the center loss function improves the distribution of expression features, reducing the internal spacing of expressions of the same category and thus increasing the distance between different categories of expression features, making it easier for the network to distinguish facial expression features and improving recognition accuracy. Attached Figure Description

[0052] Figure 1 A flowchart illustrating a method for training a driver facial expression recognition model according to an embodiment of this application is shown;

[0053] Figure 2 A unit block diagram of a training device for a driver facial expression recognition model according to an embodiment of this application is shown. Detailed Implementation

[0054] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0055] The terminology used in the embodiments of this application is for the purpose of describing particular embodiments only and is not intended to limit the application. The singular forms “a,” “said,” and “the” used in the embodiments of this application and the appended claims are also intended to include the plural forms, and “multiple” generally includes at least two unless the context clearly indicates otherwise.

[0056] It should be understood that the term "and / or" used in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.

[0057] It should be understood that although the terms first, second, third, etc., may be used in the embodiments of this application, these descriptions should not be limited to these terms. These terms are only used to distinguish the descriptions. For example, first may also be referred to as second without departing from the scope of the embodiments of this application, and similarly, second may also be referred to as first.

[0058] Depending on the context, the words “if” or “suppose” as used here can be interpreted as “when” or “in response to determination” or “in response to detection.” Similarly, depending on the context, the phrases “if determination” or “if detection (of the stated condition or event)” can be interpreted as “when determination” or “in response to determination” or “when detection (of the stated condition or event)” or “in response to detection (of the stated condition or event).”

[0059] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that an article or device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such an article or device. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the article or device that includes said element.

[0060] It should be noted that any symbols and / or numbers present in the specification that are not marked in the accompanying drawings are not reference numerals.

[0061] The optional embodiments of this application are described in detail below with reference to the accompanying drawings.

[0062] The embodiments provided in this application are embodiments of a training method for a driver facial expression recognition model.

[0063] The following is combined with Figure 1 The embodiments of this application will be described in detail.

[0064] Step S101: Obtain multiple batches of image training sets.

[0065] Each batch of image training sets includes multiple training images, each training image includes the driver's facial feature information, and the driver's expression type in each training image belongs to one of a variety of preset expression types.

[0066] Before training the driver expression recognition model, a large number of training images need to be collected. These images are divided into several training sets for batch training of the driver expression recognition model. Each training image includes facial feature information of various driver expression types to ensure sufficient training of the driver expression recognition model to reach the preset training batch.

[0067] The number of training images in each training set affects the convergence speed of the network. If the number is too small, the network will not be able to converge in time; if the number is too large, the GPU memory may be insufficient, and the system will run slowly. Therefore, based on the performance of the experimental environment, the number of training images in each training set was set to 64. This ensures that the driver expression recognition model can converge in a timely manner while maintaining the system's operating efficiency.

[0068] The preset expression types include: anger, fear, contempt, happiness, sadness, and surprise.

[0069] In some specific embodiments, obtaining multiple batches of image training sets includes the following steps:

[0070] Step S101-1: Obtain multiple original facial images.

[0071] In each original facial image, the driver's expression type belongs to one of several preset expression types.

[0072] The KMU-FED facial expression image dataset in driving scenarios was used as the original image dataset, which contains 1106 original images in driving scenarios.

[0073] The driver's facial region is obtained from each original image in the original image dataset using the Viola-Jones detection algorithm. The original images are then cropped based on the facial region to obtain the original facial image.

[0074] The original image dataset is divided into a training set and a test set in an 8:2 ratio. The training set is used to train the network and fit the model. The images in the training set are further divided into multiple batches according to the training batches, and the images in each batch are randomly selected. The test set is used to predict driver expression recognition model and to measure the performance and classification ability of the optimal model. In this embodiment, the classification accuracy is obtained by training the driver expression recognition model on the test set.

[0075] Step S101-2: Adjust the size of each original facial image to obtain a standard-sized image of the corresponding original facial image.

[0076] For example, the dimensions of a standard-sized image are 224×224.

[0077] Step S101-3: Perform data augmentation processing on each standard-size image to obtain multiple training images corresponding to the standard-size image.

[0078] The data augmentation processing performed on each standard-sized image can be understood as expanding the standard-sized image using four data augmentation methods: random horizontal flipping, random rotation, random cropping, and random occlusion. The probability of random horizontal flipping is set to the default value of 0.5. The rotation angle for random rotation data augmentation is within 40 degrees. Random cropping data augmentation fills the image with a fill size of 32, cropping the image from random positions within the standard-sized image, resulting in a cropped standard-sized image of size 224×224. The occlusion probability parameter for random occlusion data augmentation uses the default value. Performing data augmentation on the standard-sized images increases the amount of training images for the driver's expression recognition model, enabling the model to have better generalization ability.

[0079] Step S101-4: Distribute the multiple training images in batches to obtain multiple batches of image training sets.

[0080] Normalization is achieved by dividing each pixel in the input training image by 255. The normalized image is then standardized to convert it to a standard normal distribution, making the model more likely to converge and accelerating the training process.

[0081] Step S102: Use the multiple batches of image training sets to train the driver expression recognition model to reach the preset training batch.

[0082] The driver expression recognition model can classify driver expressions. It includes a neural network model that combines a ResNet18 sub-model and multiple Transformer encoding sub-models. The driver expression recognition model improves expression recognition capability based on the calculation results of the Softmax cross-entropy loss function and the center loss function.

[0083] In this embodiment, the training parameters are updated by combining the Softmax cross-entropy loss function with the center loss function. The loss function measures the difference between the predicted and actual values. For facial expression recognition tasks, deep learning features not only need to be separable but also need to have a certain discriminative power; discriminative power indicates compact intra-class variations and separable inter-class differences. The Softmax loss function can only separate features within different categories and does not affect the intra-class features of each category. The center loss function can effectively enhance the network's discriminative power regarding features, thus the Softmax cross-entropy loss function and the center loss function are effectively combined.

[0084] The functional expression for an effective combination of the Softmax cross-entropy loss function and the center loss function:

[0085] L=Ls+λLc.

[0086] Where Ls represents the Softmax cross-entropy loss function, Lc represents the center loss function, and λ represents the adjustment coefficient.

[0087] By changing the value of λ, the intra-class distance of samples of the same expression category is effectively reduced, thereby increasing the distance between samples of different expression categories and improving the network's ability to judge expression features.

[0088] The center loss function is combined with the softmax loss function to determine the closeness between the actual output and the predicted output. The optimizer uses SGD stochastic gradient descent to optimize the loss function during training. The momentum parameter is set to 0.9, and the initial learning rate is 0.1. Every 20 training batches, the learning rate is adjusted to 0.3 times the previous value. This allows for a larger learning rate in the early stages of training to quickly approach the optimal value. Once the optimal value is approached, the learning rate is gradually reduced to find the network's optimal value. Therefore, after 200 training epochs, the preset training batch size is reached, and training is completed.

[0089] The driver facial expression recognition model in this application, based on the structural characteristics of the Transformer encoder, combines it with a ResNet18 residual network and introduces a center loss function to improve the distribution of facial expression features. This driver facial expression recognition model strengthens the correlation between long-distance features in facial expression images, enabling it to extract discriminative feature information. Training parameters are updated by combining the Softmax cross-entropy loss function with the center loss function. The introduction of the center loss function improves the distribution of facial expression features by reducing the internal spacing of expressions of the same category, thereby increasing the distance between facial expression features of different categories. This makes it easier for the network to distinguish facial expression features and improves recognition accuracy.

[0090] In some specific embodiments, training the driver's facial expression recognition model using the multiple batches of image training sets to reach a preset training batch includes the following steps:

[0091] Step S102-1: Input each training image from each batch of image training sets into the ResNet18 sub-model in sequence to obtain the facial feature sequence of the corresponding training image.

[0092] In some specific embodiments, the ResNet18 sub-model includes: Conv1 layer, Conv2_x layer, Conv3_x layer, Conv4_x layer and Conv5_x layer.

[0093] In this specific embodiment, the first five convolutional layers are used, while the average pooling and fully connected layers in the ResNet18 sub-model are omitted. See the ResNet18 parameter table below:

[0094]

[0095] Accordingly, the step of sequentially inputting each training image from each batch of image training sets into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image includes the following steps:

[0096] Step S102-1-1: Obtain the feature map of each training image in each batch of image training set through Conv1 layer, Conv2_x layer, Conv3_x layer, Conv4_x layer and Conv5_x layer.

[0097] The ResNet18 sub-model was downsampled four times to obtain a feature map with a size of 7×7 and 512 channels.

[0098] Step S102-1-2: Adjust the dimensions of the feature map of each training image to obtain the facial feature sequence of the corresponding training image.

[0099] To adapt the output of the ResNet18 sub-model to the input of the Transformer encoding sub-model, this specific embodiment adjusts the dimensions of the feature maps for each training image, flattening the 7×7 feature maps to generate a facial feature sequence. The facial feature sequence has a feature dimension of 49 and 512 channels.

[0100] Step S102-2: Input the facial feature sequence of each training image into the multiple Transformer encoding sub-models to obtain the training expression type of the corresponding training image.

[0101] After the facial feature sequence is input into the Transformer encoder, a class token and a Position Embedding encoding are also added. The class token is a vector used for classification; this vector is a trainable parameter used to output the final classification result. The Position Embedding encoding is also a trainable parameter. Because the Transformer uses parallel computation and abandons sequential operations, position encoding needs to be added to the input sequence to obtain the order information of the input sequence. The position encoding is used to obtain the absolute or relative position information of the sequence.

[0102] In some specific embodiments, the plurality of Transformer coded sub-models includes eight Transformer coded sub-models.

[0103] The Transformer encoding sub-model comprises: a norm layer, a multi-head self-attention layer, an MLP (Multilayer Perceptron), and a second Softmax unit. The facial feature sequence of each training image is input into these multiple Transformer encoding sub-models. After normalization by the norm layer, three feature vectors are obtained: a query vector, a key vector, and a value vector. The weights of the value vectors are obtained by calculating a matching function between the query vector and the key vector. The weights are then normalized by the second Softmax unit, and the normalized weights are summed with the facial feature sequence to obtain a sum. The sum is then processed by the multi-head self-attention layer to obtain an attention value. Finally, the attention value is normalized by the norm layer and input into the MLP to obtain the training expression type for each training image.

[0104] Step S102-3: When any batch of training images is lower than the preset training batch, the step of sequentially inputting each training image in the next batch of image training set into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image is triggered.

[0105] Step S102-4: When any batch of training data reaches the preset training batch, training ends.

[0106] When training the driver expression recognition model, the training expression type obtained after the training images are recognized by the model may be the same as or different from the preset expression type. The training effect is detected by statistically analyzing the training classification accuracy. When any batch of training data reaches the preset training batch size, it indicates that the expected effect has been achieved, and the driver expression recognition model is no longer trained. When any batch of training data falls below the preset training batch size, the next batch of images is obtained for continued training until any batch of training data reaches the preset training batch size. Optionally, λ in the center loss function is equal to 0.5.

[0107] The preset training batch includes 200 batches. That is, the multi-batch image training set includes 200 batches of image training sets. After training the driver expression recognition model for 200 batches, the loss function value and training accuracy of the driver expression recognition model converge and tend to stabilize.

[0108] This application also provides an apparatus embodiment that follows the above embodiments, used to implement the method steps described in the above embodiments. The interpretation of the same names is the same as that in the above embodiments, and the same technical effects are achieved. Therefore, it will not be repeated here.

[0109] like Figure 2 As shown, this application provides a training device 200 for a driver facial expression recognition model, comprising:

[0110] The acquisition unit 201 is used to acquire multiple batches of image training sets, wherein each batch of image training sets includes multiple training images, each training image includes facial feature information of the driver, and the expression type of the driver in each training image belongs to one of multiple preset expression types.

[0111] Training unit 202 is used to train the driver expression recognition model using the multiple batches of image training sets to reach a preset training batch;

[0112] The driver expression recognition model can classify driver expressions. It includes a neural network model that combines a ResNet18 sub-model and multiple Transformer encoding sub-models. The driver expression recognition model improves expression recognition capability based on the calculation results of the Softmax cross-entropy loss function and the center loss function.

[0113] Optionally, training the driver's facial expression recognition model using the multiple batches of image training sets to reach a preset training batch includes:

[0114] Each training image in each batch of image training set is sequentially input into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image;

[0115] The facial feature sequence of each training image is input into the multiple Transformer encoding sub-models to obtain the training expression type of the corresponding training image;

[0116] When any batch of training images is lower than the preset training batch, the step of sequentially inputting each training image from the next batch of image training set into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image is triggered.

[0117] Training ends when any batch of training data reaches the preset training batch size.

[0118] Optionally, the ResNet18 sub-model includes: Conv1 layer, Conv2_x layer, Conv3_x layer, Conv4_x layer and Conv5_x layer;

[0119] Accordingly, the step of sequentially inputting each training image from each batch of image training sets into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image includes:

[0120] The feature map of each training image in each batch of training images is obtained through Conv1 layer, Conv2_x layer, Conv3_x layer, Conv4_x layer and Conv5_x layer;

[0121] The feature maps of each training image are dimensionally adjusted to obtain the facial feature sequences of the corresponding training images.

[0122] Optionally, the plurality of Transformer coded sub-models includes eight Transformer coded sub-models.

[0123] Optionally, obtaining multiple batches of image training sets includes:

[0124] Acquire multiple raw facial images, where the driver's expression in each raw facial image belongs to one of multiple preset expression types;

[0125] Each original facial image is resized to obtain a standard-sized image of the corresponding original facial image.

[0126] Data augmentation is performed on each standard-sized image to obtain multiple training images corresponding to the standard-sized images;

[0127] The multiple training images are allocated in batches to obtain multiple batches of image training sets.

[0128] Optionally, λ in the central loss function is equal to 0.5.

[0129] Optionally, the preset training batch includes 200 batches.

[0130] The driver facial expression recognition model in this application, based on the structural characteristics of the Transformer encoder, combines it with a ResNet18 residual network and introduces a center loss function to improve the distribution of facial expression features. This driver facial expression recognition model strengthens the correlation between long-distance features in facial expression images, enabling it to extract discriminative feature information. Training parameters are updated by combining the Softmax cross-entropy loss function with the center loss function. The introduction of the center loss function improves the distribution of facial expression features by reducing the internal spacing of expressions of the same category, thereby increasing the distance between facial expression features of different categories. This makes it easier for the network to distinguish facial expression features and improves recognition accuracy.

[0131] This embodiment provides an electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method steps described in the above embodiment.

[0132] This application provides a non-volatile computer storage medium storing computer-executable instructions that can perform the steps described in the above embodiments.

[0133] Finally, it should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the systems or apparatus disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the descriptions are relatively simple, and relevant parts can be referred to the method section.

[0134] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A training method for a driver facial expression recognition model, characterized in that, include: Multiple batches of image training sets are obtained. Each batch of image training sets includes multiple training images. Each training image includes facial feature information of the driver. The driver's expression type in each training image belongs to one of multiple preset expression types. The driver's facial expression recognition model is trained using the multiple batches of image training sets to reach a preset training batch; The driver expression recognition model can classify driver expressions. It includes a neural network model that combines a ResNet18 sub-model and multiple Transformer encoding sub-models. The driver expression recognition model improves expression recognition ability based on the calculation results of the Softmax cross-entropy loss function and the center loss function. The Transformer encoding sub-model includes: a norm layer, a multi-head self-attention layer, an MLP (Multilayer Perceptron), and a second Softmax unit. The facial feature sequence of each training image is input into these multiple Transformer encoding sub-models. After normalization by the norm layer, three feature vectors are obtained: a query vector, a key vector, and a value vector. The weights of the value vectors are obtained by calculating the matching function between the query vector and the key vector. The weights are normalized by the second Softmax unit, and the normalized weights are summed with the facial feature sequence to obtain a sum. The sum is processed by the multi-head self-attention layer to obtain an attention value. The attention value is normalized by the norm layer and then input into the MLP to finally obtain the training expression type for each training image.

2. The method according to claim 1, characterized in that, The step of training the driver's facial expression recognition model using the multiple batches of image training sets to reach a preset training batch includes: Each training image in each batch of image training set is sequentially input into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image; The facial feature sequence of each training image is input into the multiple Transformer encoding sub-models to obtain the training expression type of the corresponding training image; When any batch of training images is lower than the preset training batch, the step of sequentially inputting each training image from the next batch of training images into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image is triggered. Training ends when any batch of training data reaches the preset training batch size.

3. The method according to claim 2, characterized in that, The ResNet18 sub-model includes: Conv1 layer, Conv2_x layer, Conv3_x layer, Conv4_x layer, and Conv5_x layer; Accordingly, the step of sequentially inputting each training image from each batch of image training sets into the ResNet18 sub-model to obtain the facial feature sequence of the corresponding training image includes: The feature map of each training image in each batch of training images is obtained through Conv1 layer, Conv2_x layer, Conv3_x layer, Conv4_x layer and Conv5_x layer; The feature maps of each training image are dimensionally adjusted to obtain the facial feature sequences of the corresponding training images.

4. The method according to claim 1, characterized in that, The multiple Transformer coding sub-models include 8 Transformer coding sub-models.

5. The method according to claim 1, characterized in that, The acquisition of multiple batches of image training sets includes: Acquire multiple raw facial images, where the driver's expression in each raw facial image belongs to one of multiple preset expression types; Each original facial image is resized to obtain a standard-sized image of the corresponding original facial image. Data augmentation is performed on each standard-sized image to obtain multiple training images corresponding to the standard-sized images; The multiple training images are allocated in batches to obtain multiple batches of image training sets.

6. The method according to claim 1, characterized in that, The λ value in the central loss function is equal to 0.

5.

7. The method according to claim 6, characterized in that, The preset training batch includes 200 batches.

8. A training device for a driver facial expression recognition model, characterized in that, include: The acquisition unit is used to acquire multiple batches of image training sets, wherein each batch of image training sets includes multiple training images, each training image includes facial feature information of the driver, and the driver's expression type in each training image belongs to one of multiple preset expression types. The training unit is used to train the driver's facial expression recognition model using the multiple batches of image training sets to reach a preset training batch. The driver expression recognition model includes a neural network model that combines a ResNet18 sub-model and multiple Transformer encoding sub-models. The driver expression recognition model improves expression recognition capability based on the calculation results of the Softmax cross-entropy loss function and the center loss function. The Transformer encoding sub-model includes: a norm layer, a multi-head self-attention layer, an MLP (Multilayer Perceptron), and a second Softmax unit. The facial feature sequence of each training image is input into these multiple Transformer encoding sub-models. After normalization by the norm layer, three feature vectors are obtained: a query vector, a key vector, and a value vector. The weights of the value vectors are obtained by calculating the matching function between the query vector and the key vector. The weights are normalized by the second Softmax unit, and the normalized weights are summed with the facial feature sequence to obtain a sum. The sum is processed by the multi-head self-attention layer to obtain an attention value. The attention value is normalized by the norm layer and then input into the MLP to finally obtain the training expression type for each training image.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1 to 7.

10. An electronic device, characterized in that, include: One or more processors; Storage device for storing one or more programs. Wherein, when the one or more programs are executed by the one or more processors, the one or more processors implement the method as described in any one of claims 1 to 7.