A training method, an age classification method and a system of an age classification model

By constructing an age-segmented dataset and optimizing the feature discrimination module, combined with the ViT module and cost-sensitive loss, the accuracy and stability issues of face age prediction were resolved, achieving accurate prediction under diverse face image conditions.

CN116665268BActive Publication Date: 2026-06-23XIAMEN MEITUZHIJIA TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIAMEN MEITUZHIJIA TECH
Filing Date
2023-05-19
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing face age prediction methods suffer from poor accuracy and stability when face image information is diverse and the number of face images of different age values ​​is unevenly distributed.

Method used

A face age training dataset is constructed, divided into multiple age subsets, and data sampling weights are set. The dataset is then optimized using a feature discrimination module and a ViT module. Multi-class supervision is performed using cost-sensitive loss, and prediction output is based on the combined Softmax confidence score.

Benefits of technology

It achieves accurate and stable facial age prediction under diverse facial image conditions, and improves the balance of data proportions of different age groups and feature discrimination ability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116665268B_ABST
    Figure CN116665268B_ABST
Patent Text Reader

Abstract

The application discloses an age classification model generation method, an age classification method and system, and a training method. i Input into a face feature extraction network to obtain a face feature map f i ; the face feature map f i is input into a feature discrimination module D f for first optimization processing; the face feature map corresponding feature vector is input into a first full connection layer for second optimization processing; and iterative training is performed to obtain an age classification model. In terms of training data, the weight setting of data sampling is performed on face images of different age stages, and the data proportion of different age stages is balanced; in model optimization, a feature discrimination module and a ViT module are introduced to reduce the ambiguity information of face images of different age stages, and accurate and stable prediction is realized.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of facial age prediction technology, and in particular to an age classification model generation method, age classification method and system. Background Technology

[0002] In recent years, intelligent products based on facial image recognition have emerged in large numbers. Facial age attribute, as one of the important attributes in facial images, has been widely used in various facial recognition application scenarios, such as video surveillance, product recommendation, and market analysis. However, in practical applications, diverse facial image information, such as lighting, occlusion, and makeup, will reduce the discriminative power between faces of different ages, posing a significant challenge to facial age prediction tasks.

[0003] Existing methods for facial age prediction are mainly divided into regression and classification methods. Both methods primarily encode features from different facial images using a feature extraction network, then input these features into regression and classification output modules for age prediction. Regression methods encode the true age into a vector of a certain length for regression prediction, and finally calculate the age value from the regression values ​​based on a certain threshold. Classification methods treat each age value in the true age set as a category for classification prediction, and finally output the age value from the category with the highest classification confidence.

[0004] However, the accuracy and stability of the above two methods are poor when the facial image information is diverse and the number of facial images of different age values ​​is not evenly distributed. Summary of the Invention

[0005] The main objective of this invention is to provide an age classification model generation method, an age classification method, and a system, aiming to solve the technical problem that existing age classification model generation methods suffer from poor accuracy and stability in predicting facial age when facial image information is diverse and the number of facial images of different age values ​​is not evenly distributed.

[0006] To achieve the above objectives, this invention provides a method for generating an age classification model, comprising the following steps: constructing a face age training dataset, wherein the training dataset includes multiple face image data labeled with real ages; dividing the training dataset into multiple age segment subsets and setting weights for data sampling; and randomly selecting data x from the training dataset according to the set weights. i The input is fed into the face feature extraction network to obtain the face feature map f. i ; to extract facial feature images f i Input to feature discrimination module D f In the middle, and perform the first optimization process; the facial feature map f iThe input is fed into the ViT module to obtain the corresponding feature vector, which is then fed into the first fully connected layer for the second optimization process. The network parameters are updated iteratively during training until the first and second optimization processes are completed, resulting in the age classification model.

[0007] Optionally, the training dataset can be divided into multiple age group subsets, and the weights for data sampling can be set. Specifically, this includes the following steps: Based on a preset age range, the training dataset is divided into n age group subsets, resulting in age group subsets x1 to x2. n Using 1 to n as the age range subset x1~x n The corresponding data labels are then used to form the training dataset X = {x1, x2, ..., x...} n}, 1≤j≤n, and j is an integer; according to the sampling weight calculation formula, for each age group subset x j Data sampling weight w j Configure the settings; the sampling weight calculation formula is as follows: Among them, a j x represents a subset of age groups j The amount of data.

[0008] Optionally, the first optimization process specifically involves: adjusting the feature discrimination module D according to the discriminant loss calculation formula. f Take the minimum of the discriminative loss. Enable feature discrimination module D f facial feature maps from different age groups f i Perform discrimination and classification; among them, This represents the maximum likelihood estimation, y i facial feature map f i The corresponding real age category.

[0009] Optional, feature discrimination module D f The network structure is a fully convolutional network, which contains multiple convolutional modules and a second fully connected layer. The dimension of the second fully connected layer is the same as the number of age group subsets.

[0010] Optionally, the network structure of the ViT module includes a multi-head self-attention (MHSA) structure and a multilayer perceptron (MLP) structure; the specific computation method of the ViT module is as follows:

[0011] f i+1 B,H*W,C =f i B,H*W,C +MHSA(LN(f i B,H*W,C ));

[0012] f i+2 B,H*W,C=f i+1 B,H*W,C +MLP(LN(f i+1 B,H*W,C ));

[0013] in,

[0014] B represents the number of feature maps in the input MHSA structure, and H represents the number of face feature maps f. i High, W is the facial feature map f i The width of f is C, which represents the facial feature map. i The number of channels, LN represents the layer normalization operation; Q, K, V are the number of channels generated by the face feature map f. i The query vector, key vector, and value vector are learned through linear projection parameters, and b is the attention offset parameter used as the position encoding.

[0015] Optionally, the first fully connected layer has the same dimension as the number of different age values ​​contained in the training dataset, with each age value corresponding to one dimension.

[0016] Optionally, the second optimization process specifically involves optimizing the ViT module based on the cost-sensitive regularization loss calculation formula, minimizing the cost-sensitive regularization loss.

[0017] in, The predicted age is the output of the first fully connected layer, y i facial feature map f i The corresponding real age category This represents maximum likelihood estimation, where λ is the weighting parameter and M is the weighting parameter. (2) (y,·) is the cost matrix used to measure the Euclidean distance between different age categories. express M corresponds to y (2) The constant product of corresponding elements in the row position.

[0018] Optionally, after obtaining the age classification model, the network parameters of the model are fixed, and forward prediction is performed. When predicting age in the forward direction, the softmax confidence scores of each age category are combined to produce the prediction output. The specific calculation formula is as follows: N is the number of distinct age values ​​included in the training dataset. The predicted age is the output of the first fully connected layer.

[0019] This invention also provides an age classification method, which uses an age classification model generated by the above-mentioned age classification model generation method for classification, and includes at least the following steps: acquiring a face image to be predicted, and extracting face features from it to obtain a face feature map f.i ; to extract facial feature images f i The data are input into the feature discrimination module and the ViT module respectively to obtain the face feature map f. i The corresponding feature vector is input into the first fully connected layer to obtain the corresponding predicted age output. The predicted age output of the first fully connected layer is combined with the confidence weight of each age category to obtain the age classification result.

[0020] Corresponding to the age classification method, the present invention also provides an age classification system, including: a face feature extraction module, used to acquire a face image to be predicted and extract face features from it to obtain a face feature map f. i The feature discrimination module is used to obtain the facial feature map f. i The ViT module is used to obtain facial feature maps f. i The facial feature map f is obtained. i The corresponding feature vector is input into the first fully connected layer; the first fully connected layer is used to obtain the feature vector and output the corresponding predicted age; the Softmax classification module is used to combine the predicted age output of the first fully connected layer to obtain the predicted age category and output the age classification result according to the confidence weight.

[0021] The beneficial effects of this invention are:

[0022] (1) This invention balances the proportion of data from different age groups by setting weights for data sampling of face images of different age groups in terms of training data; in terms of model optimization, on the one hand, a feature discriminator is introduced to learn the discriminative nature of face features between different age groups, and on the other hand, the ViT module is introduced to aggregate global discriminative information of features at the high level of the network, which can reduce the ambiguity of faces of different age groups and achieve accurate and stable face age prediction.

[0023] (2) This invention divides the training dataset into age groups by pre-setting the age range of the subsets, and then calculates the sampling weights for each age group subset x according to the sampling weight calculation formula. j Data sampling weight w j By making these settings, the proportion of data from different age groups can be effectively balanced, thus improving training effectiveness.

[0024] (3) This invention improves the discriminative power of facial features at different age values ​​by using cost-sensitive loss to perform multi-class supervision on different age values;

[0025] (4) When performing forward age prediction, this invention combines the Softmax confidence of each age category to make prediction output, which can achieve more accurate and stable face age prediction. Attached Figure Description

[0026] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this invention, illustrate exemplary embodiments of the invention and are used to explain the invention, but do not constitute an undue limitation of the invention. In the drawings:

[0027] Figure 1 This is a simplified flowchart of the age classification model generation method of the present invention;

[0028] Figure 2 This is a schematic diagram of the network structure of the age classification model of the present invention;

[0029] Figure 3 This is a simplified flowchart of the age classification method of the present invention. Detailed Implementation

[0030] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0031] like Figure 1 As shown, an age classification model generation method of the present invention includes the following steps: constructing a face age training dataset, wherein the training dataset includes multiple face image data labeled with real ages; dividing the training dataset into multiple age segment subsets and setting weights for data sampling; and randomly selecting data x from the training dataset according to the set weights. i (x i (That is, a certain data point from a subset of age groups) is input into a face feature extraction network to obtain a face feature map f. i ; to extract facial feature images f i Input to feature discrimination module D f In the middle, and perform the first optimization process; the facial feature map f i The input is fed into the ViT module to obtain the corresponding feature vector, which is then fed into the first fully connected layer for the second optimization process. The network parameters are updated iteratively during training until the first and second optimization processes are completed, resulting in the age classification model.

[0032] Preferably, the face image data size is 224*224 pixels, and the age range is 0 to 80 years old. Therefore, the training dataset includes face image data with labeled real ages of 0, 1, 2, 3, ..., 78, 79, 80.

[0033] This invention balances the proportion of data from different age groups by setting weights for data sampling of facial images of different age groups in the training data. In terms of model optimization, on the one hand, a feature discriminator is introduced to learn the discriminative properties of facial features between different age groups. On the other hand, a ViT module is introduced to aggregate global discriminative information of features at the high level of the network, which can reduce the ambiguity of facial information of different age groups and achieve accurate and stable facial age prediction.

[0034] The following is combined with Figure 1 and Figure 2 To further illustrate this invention, in this embodiment, the training dataset is divided into multiple age group subsets, and weights are set for data sampling. Specifically, the steps include:

[0035] Based on the preset age range of the subsets, the training dataset is divided into n age range subsets, resulting in age range subsets x1 to x2. n Using 1 to n as the age range subset x1~x n The corresponding data labels are then used to form the training dataset X = {x1, x2, ..., x...} n}, 1≤j≤n, and j is an integer;

[0036] Based on the sampling weight calculation formula, for each age group subset x j Data sampling weight w j Configure the settings; the sampling weight calculation formula is as follows:

[0037]

[0038] Among them, a j x represents a subset of age groups j The amount of data.

[0039] Preferably, the age range of the preset subset is 10 years, so n = 8. Among them, the subset x1 is the face data labeled with a real age greater than or equal to 0 years old and less than or equal to 10 years old, x2 is the age greater than 10 years old and less than or equal to 20 years old, x3 is the age greater than 20 years old and less than or equal to 30 years old, and so on.

[0040] Therefore, in this embodiment, the training dataset X = {x1, x2, ..., x8},

[0041] This invention divides the training dataset into age groups by pre-setting a subset age range, and then calculates the weights for each age group subset x according to the sampling weight calculation formula. j Data sampling weight w j By configuring these settings, the proportion of data from different age groups can be effectively balanced, thereby improving training effectiveness.

[0042] Preferably, the face feature extraction network adopts the MobileOne-S0 network structure, and the final output feature map size of the face feature extraction network is 7x7 pixels. Face feature extraction includes, but is not limited to, the extraction of visual feature information such as facial contour features and texture features.

[0043] In this embodiment, the first optimization process specifically involves: adjusting the feature discrimination module D according to the discrimination loss calculation formula. f Take the minimum of the discriminative loss. Enable feature discrimination module D f facial feature maps from different age groups f i To improve the model's ability to distinguish feature information between different age groups, discrimination and classification are performed. This represents the maximum likelihood estimation (which can be simply understood as averaging the loss values ​​obtained over all samples), y i facial feature map f i The corresponding real age category.

[0044] In this embodiment, the feature discrimination module D f The network structure is a fully convolutional network, which contains multiple convolutional modules and a second fully connected layer. The dimension of the second fully connected layer is the same as the number of age group subsets.

[0045] Preferably, feature discrimination module D f The network structure consists of three convolutional modules and one fully connected layer. The convolutional modules all have 3x3 kernels, and the number of channels in the three convolutional modules are 16, 32, and 64, respectively. The fully connected layer has 8 dimensions.

[0046] In this embodiment, the network structure of the ViT module includes a multi-head self-attention (MHSA) structure and a multilayer perceptron (MLP) structure; the specific computation method of the ViT module is as follows:

[0047] f i+1 B,H*W,C =f i B,H*W,C +MHSA(LN(f i B,H*W,C ));

[0048] f i+2 B,H*W,C =f i+1 B,H*W,C +MLP(LN(f i+1 B,H*W,C ));

[0049] in,

[0050] B represents the number of feature maps in the input MHSA structure, and H represents the number of face feature maps f. i High, W is the facial feature map f i The width of f is C, which represents the facial feature map. i The number of channels, LN represents the layer normalization operation; Q, K, V are the number of channels generated by the face feature map f. i The Query vector, Key vector, and Value vector are learned through linear projection parameters. Specifically, let f be the input MHSA structured face feature map. i The quantity is B, the size is H*W, the number of channels is C, and the linear projection parameter is W. Q W K W V Its dimension is HWC*D, where D is the linear mapping dimension. Q, K, and V are respectively derived from f. i ×W Q f i ×W K f i ×W V The matrix product is used for calculation; b is the attention offset parameter used as the positional encoding.

[0051] In this embodiment, the dimension of the first fully connected layer is the same as the number of different age values ​​contained in the training dataset, with each age value corresponding to one dimension.

[0052] Preferably, the dimension of the first fully connected layer is 80.

[0053] In this embodiment, the second optimization process specifically involves optimizing the ViT module based on the cost-sensitive regularization loss calculation formula to minimize the cost-sensitive regularization loss.

[0054] in, The predicted age is the output of the first fully connected layer, y i facial feature map f i The corresponding real age category This represents maximum likelihood estimation, where λ is the weighting parameter and M is the weighting parameter. (2) (y,·) is the cost matrix used to measure the Euclidean distance between different age categories. express M corresponds to y (2) The constant product of corresponding elements in the row position.

[0055] Furthermore, M (2) The cost matrix (y, ·) has a dimension of 80×80, let M... (2)In the equation (y, ·), the row number is 'a' and the column number is 'b'. The formula for calculating the Euclidean distance between the elements represented by the 'a'-th age category and the 'b'-th age category is as follows:

[0056]

[0057] Preferably, λ = 1. The subscript 2 indicates the calculation of the L2 norm, and the superscript 2 indicates the square.

[0058] This invention utilizes cost-sensitive loss for multi-class supervision of different age values. It applies a larger loss to data samples where the predicted class differs significantly from the true class, while applying a smaller loss to samples where the predicted class is closer to the true class. This aligns well with age classification tasks, as data with closer age classes exhibit less difference, while data with more distant age classes exhibit greater difference. Therefore, it can improve the discriminative power of facial features across different age values.

[0059] In this embodiment, after obtaining the age classification model, the network parameters of the model are fixed (network parameters refer to the parameters learned by optimizing the loss function in the convolutional module; fixing the network parameters means fixing the parameters of the network convolutional module), and forward prediction is performed. When predicting age in the forward direction, the prediction output is based on the Softmax confidence scores of each age category. Softmax is the Softmax calculation function, and the Softmax confidence score is... The output value is calculated using the following formula: N is the number of distinct age values ​​included in the training dataset. The predicted age is the output of the first fully connected layer.

[0060] When performing forward age prediction, this invention integrates the Softmax confidence scores of various age categories for prediction output, which can achieve more accurate and stable facial age prediction.

[0061] like Figure 3 As shown, the present invention also provides an age classification method, which uses an age classification model generated by the above-mentioned age classification model generation method for classification, and includes at least the following steps: acquiring a face image to be predicted, and extracting face features from it to obtain a face feature map f. i ; to extract facial feature images f i The data are input into the feature discrimination module and the ViT module respectively to obtain the face feature map f. i The corresponding feature vector is input into the first fully connected layer to obtain the corresponding predicted age output. The predicted age output of the first fully connected layer is combined with the confidence weight of each age category to obtain the age classification result.

[0062] This invention also provides an age classification system, including: a face feature extraction module, used to acquire a face image to be predicted and extract face features from it to obtain a face feature map f. i The feature discrimination module is used to obtain the facial feature map f. i The ViT module is used to obtain facial feature maps f. i The facial feature map f is obtained. i The corresponding feature vector is input into the first fully connected layer; the first fully connected layer is used to obtain the feature vector and output the corresponding predicted age; the Softmax classification module is used to combine the predicted age output of the first fully connected layer to obtain the predicted age category and output the age classification result according to the confidence weight.

[0063] It should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the device embodiments, equipment embodiments, and storage medium embodiments, since they are basically similar to the method embodiments, the descriptions are relatively simple, and relevant parts can be referred to the descriptions of the method embodiments.

[0064] Furthermore, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0065] The foregoing description illustrates and describes preferred embodiments of the present invention. It should be understood that the present invention is not limited to the forms disclosed herein and should not be construed as excluding other embodiments. It can be used in various other combinations, modifications, and environments, and can be altered within the scope of the inventive concept by means of the foregoing teachings or techniques or knowledge in related fields. Any modifications and variations made by those skilled in the art that do not depart from the spirit and scope of the present invention should be within the protection scope of the appended claims.

Claims

1. A method for generating an age classification model, characterized in that, Includes the following steps: Construct a face age training dataset, which includes multiple face image data labeled with real ages; The training dataset is divided into multiple age group subsets, and the weights for data sampling are set accordingly. Data is randomly selected from the training dataset based on the set weights. The data is input into a face feature extraction network to obtain a face feature map. ; Facial feature map Input to feature discrimination module In the middle, and perform the first optimization process; Facial feature map The input is fed into the ViT module to obtain the corresponding feature vector, which is then fed into the first fully connected layer for the second optimization process. Iterative training and updating of network parameters are performed until the first and second optimization processes are completed, resulting in an age classification model. The training dataset is divided into multiple age group subsets, and the weights for data sampling are set. The specific steps include the following: Based on the preset age range of the subsets, the training dataset is divided into n age range subsets, resulting in the age range subsets. Using 1 to n as a subset of age groups The corresponding data labels are the training dataset. , 1≤j≤n, and j is an integer; Based on the sampling weight calculation formula, for each age group subset Data sampling weight Configure the settings; the sampling weight calculation formula is as follows: ; in, Represents a subset of age groups The amount of data; The first optimization process specifically involves: based on the discriminant loss calculation formula, optimizing the feature discrimination module... Take the minimum of the discriminative loss. This enables the feature discrimination module Facial feature maps from different age groups Perform discrimination and classification; in, ; This represents the maximum likelihood estimation. facial feature map The corresponding real age category; The network structure of the ViT module includes a multi-head self-attention (MHSA) structure and a multilayer perceptron (MLP) structure. The specific calculation method for the ViT module is as follows: = +MHSA(LN( )); = +MLP(LN( )); Where MHSA(Q,K,V)= ; B represents the number of feature maps in the input MHSA structure, and H represents the number of face feature maps. High, W is a facial feature map The width of the face feature map is C. The number of channels, LN represents the layer normalization operation; Q, K, V are the number of channels derived from the face feature map. The query vector, key vector, and value vector are obtained through linear projection parameters. The attention offset parameter is used as a positional encoding; The dimension of the first fully connected layer is the same as the number of different age values ​​contained in the training dataset, with each age value corresponding to one dimension. The second optimization process involves optimizing the ViT module based on the cost-sensitive regularization loss calculation formula, aiming to minimize the cost-sensitive regularization loss. ; in, ; The predicted age is the output of the first fully connected layer. facial feature map The corresponding real age category This represents the maximum likelihood estimation. For weight parameters, This is the cost matrix used to measure the Euclidean distance between different age categories. express and correspond The constant product of corresponding elements in the row position.

2. The age classification model generation method according to claim 1, characterized in that: Feature discrimination module The network structure is a fully convolutional network, which contains multiple convolutional modules and a second fully connected layer. The dimension of the second fully connected layer is the same as the number of age group subsets.

3. The age classification model generation method according to claim 1, characterized in that: After obtaining the age classification model, fix the network parameters of the model and perform forward prediction; When performing forward age prediction, the prediction output is calculated by combining the Softmax confidence scores of each age category. The specific calculation formula is as follows: ; N is the number of distinct age values ​​included in the training dataset. The predicted age is the output of the first fully connected layer.

4. An age classification method, characterized in that, Classification using an age classification model generated by the age classification model generation method according to any one of claims 1-3 includes at least the following steps: Obtain the image of the face to be predicted, and extract its facial features to obtain a facial feature map. ; Facial feature map The data are input into the feature discrimination module and the ViT module respectively to obtain the face feature map. The corresponding feature vector; The feature vector is input into the first fully connected layer to obtain the corresponding predicted age output; By combining the predicted age output of the first fully connected layer, the confidence weights of each age category are obtained, and the age classification results are obtained.

5. An age classification system, characterized in that, The age classification model generation method according to any one of claims 1-3 includes: The face feature extraction module is used to acquire the face image to be predicted and extract its face features to obtain a face feature map. ; The feature discrimination module is used to obtain facial feature maps. ; The ViT module is used to obtain facial feature maps. Obtain facial feature map The corresponding feature vectors are then input into the first fully connected layer; The first fully connected layer is used to obtain the feature vector and output the corresponding predicted age. The Softmax classification module is used to combine the predicted age output of the first fully connected layer to obtain the predicted age category, and output the age classification result according to the confidence weight.