A fine-grained face age estimation method based on horizontal pyramid matching
By combining face detection and residual network model with a horizontal pyramid matching method, the overall and local age features of the face are extracted, which solves the problem of difficulty in accurately estimating subtle age differences in local areas of the face in the existing technology and achieves higher accuracy in age estimation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NORTHEASTERN UNIV CHINA
- Filing Date
- 2023-03-15
- Publication Date
- 2026-06-23
AI Technical Summary
Existing age estimation methods struggle to accurately capture and assess subtle age differences in local facial features, and they fail to effectively utilize the continuity between different facial components, particularly when estimating adolescent faces.
A horizontal pyramid matching-based method is adopted, which uses face detection algorithm and ResNet50 residual network model to extract face age features. The overall and local age features of the face are extracted by combining pyramid segmentation with global average pooling and global max pooling.
It achieves fine-grained estimation of facial age, accurately extracting and combining global and local facial features, thus improving the accuracy and robustness of age estimation.
Smart Images

Figure CN116246330B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image classification technology, and in particular to a fine-grained face age estimation method based on horizontal pyramid matching. Background Technology
[0002] Age information, as an important human biometric, has numerous applications in the field of human-computer interaction and significantly impacts the performance of facial recognition systems. Age estimation is the study of predicting true age or age group based on facial images; it is a special type of pattern recognition problem. Although age estimation has been extensively studied for many years, accurately estimating human age from a single image remains very challenging. Especially when estimating the facial age of teenagers, accurately capturing and assessing subtle age differences is a pressing issue that needs to be addressed.
[0003] Much existing research focuses on feature extraction. Anthropometric models, active appearance models, and biometric-inspired features are well-known models for extracting features for age estimation. Most methods extract deeper global features from facial images; however, the age information presented on the face is influenced by some local features, such as crow's feet. Furthermore, the relationships between local features can affect the estimation results, but existing methods rarely focus on significant local facial features. Secondly, some existing age estimation methods often employ a divide-and-conquer strategy to handle heterogeneous data caused by non-stationary aging processes; however, facial aging is also a continuous process, and the continuous relationships between different components have not yet been effectively utilized. Summary of the Invention
[0004] The technical problem to be solved by the present invention is to address the shortcomings of the prior art by providing a fine-grained face age estimation method based on horizontal pyramid matching, which can estimate face age under different backgrounds.
[0005] To solve the above-mentioned technical problems, the technical solution adopted by this invention is: a fine-grained face age estimation method based on horizontal pyramid matching. For face images with different backgrounds, firstly, a face detection algorithm is used to detect the position of the face in the image. Then, the face image is input into a ResNet50 residual network model to obtain an age feature map. The age feature map is then horizontally segmented at different pyramid scales to obtain multiple feature slices. Finally, global average pooling and global max pooling are performed on each feature slice to obtain the overall and local age features of the face, thus completing the face age estimation. Specifically, the method includes the following steps:
[0006] Step 1: Face image detection; acquire several images containing human faces, and use an integral channel detection algorithm to detect the multi-angle positions of the faces in the images, and unify the image pixels;
[0007] First, face detection is performed on the original face image at 5° intervals between -60° and 60°, and face detection is also performed at angles of -90°, 90° and 180° to obtain the face with the highest detection score, and then rotated to a frontal position; then the facial size of the face image is expanded; finally, the generated image is compressed to a uniform pixel size.
[0008] Step 2: Age feature extraction; The face image obtained in Step 1 is input into the ResNet50 residual network model to obtain the age feature map;
[0009] A ResNet50 residual network model was constructed, and the pooling layers and fully connected layers were removed from the model. The preprocessed face image from step 1 was input into the ResNet50 and passed through five stages (stage0, stage1, stage2, stage3, and stage4). Stage0 consists of a basic convolutional layer and a pooling layer. Stages 1 to 4 are residual blocks with bottleneck structures, with 3, 4, 6, and 3 bottleneck structures respectively. Finally, a 2048-dimensional age feature map was obtained.
[0010] Step 3: Copy the age feature map obtained in Step 2 n times. Divide each of the n feature maps horizontally using a pyramid of different scales, dividing each into 2... n-1 A number of equally level feature blocks;
[0011] After obtaining the age feature map from step 2, the age feature map is copied n times, and a horizontal pyramid pooling module is constructed. Then, the horizontal spatial pyramid pooling module is used to obtain the local and global spatial information of each feature map, and the n feature maps are horizontally partitioned into 2... n-1 One horizontal feature block;
[0012] Step 4: Perform global average pooling and global max pooling on the feature blocks of the four pyramids of different scales obtained in Step 3 in the horizontal direction, add the feature maps after the two pooling, and then perform 1×1 convolution to reduce the computation and realize face age estimation.
[0013] After obtaining the feature blocks divided in step 3, global mean pooling and global maximum pooling are performed on the above horizontal feature blocks respectively; the pooled feature maps are summed and 1×1 convolution is performed to reduce the dimension of the feature map from 2048 to 256; each column feature is input into a non-shared fully connected layer; cross-entropy loss function is used for supervision; finally, the Softmax function is used to estimate the age of the face in the image and output the age estimate of the category corresponding to the person in the image.
[0014] The beneficial effects of adopting the above technical solution are as follows: The fine-grained face age estimation method based on horizontal pyramid matching provided by this invention addresses the problem of combining global and local facial age features. It extracts facial age features based on a ResNet50 residual network, uses a horizontal pyramid segmentation method to divide the feature map, and performs global average pooling and global max pooling on the feature map. This allows for the complete extraction of both the overall age features and the most recognizable local features of the face. It solves the problem of combining global and fine-grained local facial age features and achieves the extraction of both overall age features and the most recognizable local fine-grained features. This has significant application value in real-world environments. Attached Figure Description
[0015] Figure 1 This is a diagram illustrating the overall architecture of a fine-grained face age estimation method based on horizontal pyramid matching, as provided in an embodiment of the present invention.
[0016] Figure 2 A flowchart for the integration channel detection provided in an embodiment of the present invention;
[0017] Figure 3 This is a diagram of the ResNet50 residual network architecture provided in an embodiment of the present invention;
[0018] Figure 4 A schematic diagram of a single residual block provided in an embodiment of the present invention;
[0019] Figure 5 This is a schematic diagram of horizontal pyramid matching provided for an embodiment of the present invention. Detailed Implementation
[0020] The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and examples. The following examples are for illustrative purposes only and are not intended to limit the scope of the invention.
[0021] This embodiment demonstrates a simulation experiment of a fine-grained face age estimation method based on horizontal pyramid matching, using a Windows 10 system as the software environment and VSCode 1.73.1-x64 as the simulation environment.
[0022] In this embodiment, a fine-grained face age estimation method based on horizontal pyramid matching is proposed. For face images with different backgrounds, the method first uses a face detection algorithm to detect the location of faces in the image. Then, the face image is input into a ResNet50 residual network model to obtain an age feature map. The age feature map is then horizontally segmented at different pyramid scales to obtain multiple feature slices. Finally, global average pooling and global max pooling are performed on each feature slice to obtain the overall age features of the face and the most recognizable local age features, thus completing the face age estimation. Figure 1 As shown, the specific steps include:
[0023] Step 1: Face image detection; acquire several images containing human faces, and use an integral channel detection algorithm to detect the multi-angle positions of the faces in the images, and unify the image pixels;
[0024] First, face detection is performed on the original face image at 5° intervals between -60° and 60°, and face detection is also performed at angles of -90°, 90° and 180° to obtain the face with the highest detection score, and then rotated to a frontal position; then the facial size of the face image is expanded; finally, the generated image is compressed to a uniform pixel size.
[0025] In this embodiment, the original face image is first detected using an integral channel detection algorithm, such as... Figure 2 As shown, after obtaining the integral channel features, 10 feature channels are calculated at scales of 1, 1 / 2, and 1 / 4 respectively. These 10 channels consist of: gradient histograms in 6 directions, 3 LUV color channels, and 1 gradient magnitude. These channels can efficiently calculate and capture different information from the input image. After calculating the 10 channels, rectangular regions of varying sizes are randomly selected within each of these 10 channels, and the sum of the pixel values of all pixels within each region is calculated. Finally, 30,000 rectangular regions are randomly selected to form the integral channel feature pool. In the grayscale image, the gradient magnitude and direction of pixel (x, y) are as follows:
[0026]
[0027] α(x,y)=arctan(H(x,y+1)-H(x,y-1) / H(x+1,y)-H(x-1,y))
[0028] Where H(x, y) is the pixel value of pixel (x, y).
[0029] A gradient histogram is a weighted histogram where the bin indices are calculated based on the direction of the gradient, and the weights are calculated based on the magnitude of the gradient. The formula for calculating the gradient histogram channels is: Q θ(x, y) = G (x, y) × L [ω (x, y) = θ], G (x, y) and Q θ (x, y) represent the gradient magnitude and quantized gradient direction at pixel (x, y) in the image, respectively. L is an indicator function, and θ is the quantization range of the gradient direction ω(x, y). In this embodiment, four gradient histogram channels are selected, so the value range of θ is -60° to 60°, -90°, 90°, and 180°, respectively. After extracting the features, they are concatenated and normalized, and then classified using a soft-connected Adaboost classifier. Then, the facial dimensions of the face image are expanded by 40% in both width and height. Finally, the generated images are uniformly compressed to 256×256 pixels.
[0030] Step 2: Age feature extraction; The face image obtained in Step 1 is input into the ResNet50 residual network model to obtain the age feature map;
[0031] Construct a ResNet50 residual network model, and remove the pooling layers and fully connected layers from the model, such as... Figure 3 As shown in Table 1, the preprocessed face image from step 1 is input into a ResNet50 network and undergoes five stages (stage0, stage1, stage2, stage3, and stage4) to extract age features, as shown in Table 1. Stage0 consists of a basic convolutional layer and a pooling layer, while stages1 to 4 are residual blocks with bottleneck structures, containing 3, 4, 6, and 3 bottleneck structures respectively. Finally, a 2048-dimensional age feature map is obtained.
[0032] In this embodiment, the residual blocks from stage 1 to stage 4 are as follows: Figure 4 As shown, a residual block can be represented as x l+1 =xl+F(xl, W) l The residual block is divided into two parts: the direct mapping part and the residual part, x l It is a direct mapping, F(xl, W) l The residual part consists of two convolution operations; in this embodiment, a residual block with the same number of input and output feature channels is selected.
[0033] Table 1. ResNet50 network structure parameters used for age feature extraction
[0034]
[0035] Step 3: Copy the age feature map obtained in Step 2 four times, and divide the four feature maps horizontally using pyramids of different scales, dividing them into 1, 2, 4, and 8 equal horizontal feature blocks respectively.
[0036] After obtaining the age feature map from step 2, the age feature map is copied four times and a horizontal pyramid pooling module is constructed. Then, the horizontal spatial pyramid pooling module is used to learn and enhance the discriminative age features of the face at different horizontal pyramid scales to obtain the local and global spatial information of each feature map.
[0037] Horizontal pyramid pooling divides the four feature maps horizontally into 1, 2, 4, and 8 horizontal blocks respectively. Through horizontal pyramid pooling, fixed-length vectors of the local face at different horizontal pyramid scales are obtained. These vectors are then input into a convolutional layer and a fully connected layer for learning and classification, improving the model's ability to extract local facial age features in a fine-grained manner from global to local.
[0038] Step 4: Perform global average pooling and global max pooling on the feature blocks of the four pyramids of different scales obtained in Step 3 in the horizontal direction, add the feature maps after the two pooling, and then perform 1×1 convolution to reduce the computation and realize face age estimation.
[0039] After obtaining the feature blocks divided in step 3, global mean pooling and global maximum pooling are performed on the above horizontal feature blocks respectively; the pooled feature maps are summed and 1×1 convolution is performed to reduce the dimension of the feature map from 2048 to 256; each column feature is input into a non-shared fully connected layer; cross-entropy loss function is used for supervision; finally, the Softmax function is used to estimate the age of the face in the image and output the age estimate of the category corresponding to the person in the image.
[0040] In this embodiment, as Figure 5 As shown, assume each horizontal block is F i,j Let i and j represent the pyramid scale index and the index of each level block; then, each level block is pooled using global average pooling and global max pooling, and the feature maps after the two pooling methods are added together to generate a column feature vector G. i,j =avgpool(F i,j )+maxpool(F i,j After that, each G i,j It is fed into a 1x1 convolutional layer to reduce the dimension from 2048 to 256, denoted as H. i,j ; convert the feature vector H of each column i,j The input is fed into the corresponding classifier FC in the non-shared fully connected layer. i,jIn the process, the softmax function is used to estimate the age value; during training, the output of a given image is a set of predictions. Therefore each It can be formalized as follows: Where P is the total number of face images, and Wi ,j It's FCi ,j The learning weights; y is the true age of the input image I; the loss function is the learning weights for each output image. The sum of cross-entropy losses: Where N is the batch size and CE is the cross-entropy loss.
[0041] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope defined by the claims of the present invention.
Claims
1. A fine-grained face age estimation method based on horizontal pyramid matching, characterized in that: For face images with different backgrounds, the face detection algorithm is first used to detect the position of the face in the image. Then, the face image is input into the ResNet50 residual network model to obtain the age feature map. The age feature map is then horizontally segmented at different pyramid scales to obtain multiple feature slices. Finally, global average pooling and global max pooling are performed on each feature slice to obtain the overall age features and local age features of the face, thus completing the face age estimation. The method specifically includes the following steps: Step 1: Face image detection; acquire several images containing human faces, and use an integral channel detection algorithm to detect the multi-angle positions of the faces in the images, and unify the image pixels; First, face detection is performed on the original face image at 5° intervals between -60° and 60°, and face detection is also performed at angles of -90°, 90° and 180° to obtain the face with the highest detection score, and then rotated to a frontal position; then the facial size of the face image is expanded; finally, the generated image is compressed to a uniform pixel size. Step 2: Age feature extraction; The face image obtained in Step 1 is input into the ResNet50 residual network model to obtain the age feature map; Construct a ResNet50 residual network model and remove the pooling and fully connected layers from the model; input the preprocessed face image from step 1 into ResNet50 and pass through 5 stages of the ResNet50 network to finally obtain a 2048-dimensional age feature map. Step 3: Copy the age feature map obtained in Step 2 four times, and divide the four feature maps horizontally using pyramids of different scales, dividing them into 1, 2, 4, and 8 equal horizontal feature blocks respectively. Step 4: Perform global average pooling and global max pooling on the feature blocks of the four pyramids of different scales obtained in Step 3 in the horizontal direction, add the feature maps after the two pooling, and then perform 1×1 convolution to reduce the computation and realize face age estimation. After obtaining the horizontal feature blocks that have been divided in step 3, respectively, on the above horizontal feature blocks... Global average pooling and global max pooling are performed on the top layer, where i and j represent the pyramid scale index and the order of each horizontal block at that scale. Then, each horizontal block is pooled using global average pooling and global max pooling. The feature maps after the two pooling methods are added together to generate a column feature vector. After that, each It is fed into a 1x1 convolutional layer to reduce the dimension from 2048 to 256, represented as ; for each column feature vector The input is fed into the corresponding classifier in the non-shared fully connected layer. In the training process, the age value is estimated using the Softmax function; during training, the cross-entropy loss function is used for supervision.
2. The fine-grained face age estimation method based on horizontal pyramid matching according to claim 1, characterized in that: The ResNet50 network has five stages: stage0, stage1, stage2, stage3, and stage4. Stage0 consists of a basic convolutional layer and a pooling layer. Stages 1 to 4 are residual blocks with bottleneck structures, with 3, 4, 6, and 3 bottleneck structures respectively.