A face detection method and device
By introducing the RepViT-M0_9 model and the FPIoU2 loss function into the YOLOv8n model, the feature extraction and bounding box loss function of YOLOv8n are improved, solving the problems of detection accuracy for small targets and detection stability in complex scenes, and achieving higher detection accuracy and robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HOHAI UNIV
- Filing Date
- 2024-12-26
- Publication Date
- 2026-06-26
AI Technical Summary
The YOLOv8 model suffers from low accuracy in detecting small objects and limited ability to capture detailed features in high-precision face detection tasks, and it is difficult to achieve stable detection results in complex scenes.
The RepViT-M0_9 model is used to replace the Backbone module of YOLOv8n for feature extraction, and a P2 layer is introduced in the Neck module to enhance feature fusion. The loss function FPIoU2 is used to replace the bounding box loss function CIoU of YOLOv8n.
The model improves the detection accuracy and robustness of small target faces, and can stably detect faces in complex backgrounds, significantly improving the detection effect, especially in application scenarios with dense crowds and significant differences in face size, providing accurate and reliable detection results.
Smart Images

Figure CN122290184A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of face recognition technology, specifically to a face detection method and apparatus based on an improved YOLOv8n model. Background Technology
[0002] In recent years, with the development of computer vision and deep learning technologies, face detection has become increasingly widely used in fields such as security monitoring, identity verification, and intelligent human-computer interaction. Traditional face detection methods mostly rely on manually designed features, making them difficult to adapt to complex and ever-changing scenarios. Deep learning models based on convolutional neural networks have advantages in image feature extraction, significantly improving the accuracy and robustness of face detection. The YOLO (You Only Look Once) series, as a single-stage detection model, has gained widespread attention in the field of object detection due to its efficient end-to-end detection performance. YOLOv8, as a widely used version of this series, further optimizes the network structure, improving detection accuracy and real-time performance. However, in high-precision face detection tasks, the YOLOv8 model still has some shortcomings, such as low accuracy in detecting small objects, limited ability to capture detailed features, and difficulty in achieving stable detection results in complex scenes. Summary of the Invention
[0003] To address the aforementioned issues, this invention proposes a face detection method and apparatus based on an improved YOLOv8n model. By replacing the backbone network, adding a feature fusion layer, and improving the loss function, the model's detection performance for small target faces and variable scenes is enhanced.
[0004] This invention is achieved using the following technical solution:
[0005] In a first aspect, the present invention provides a face detection method, comprising:
[0006] The image to be detected is input into a pre-trained target detection model for detection to obtain the face recognition result;
[0007] The object detection model is obtained by replacing the original Backbone module with the RepViT-M0_9 model on the basis of the YOLOv8n model for feature extraction, and replacing the bounding box loss function CIoU of the YOLOv8n model with the loss function FPIoU2. The RepViT-M0_9 model consists of a stem module, RepViTBlock modules of layers 1-26, and an SPPF module connected in sequence. The output of the RepViTBlock module of layer 3 is used as feature map P2, the output of the RepViTBlock module of layer 7 is used as feature map P3, the output of the RepViTBlock module of layer 23 is used as feature map P4, and the output of the SPPF module is used as feature map P5. Feature maps P2, P3, P4, and P5 are fused in the Neck module of the YOLOv8n model.
[0008] Furthermore, the stem module includes:
[0009] The input image to be detected is convolved by a 3×3 convolutional layer with a stride of 2, so that the number of output channels is half the number of input channels. After batch normalization and GELU activation function operation, it is convolved again by a 3×3 convolutional layer with a stride of 2 to restore the number of output channels to the original number of input channels. After batch normalization, the low-level feature map is obtained.
[0010] Furthermore, the RepViTBlock modules in layers 1-3, 5-7, 9-23, and 25-26 include:
[0011] The input feature map is convolved with a 3×3 depthwise convolutional layer with a stride of 1 and a 1×1 depthwise convolutional layer to extract spatial features. The extracted spatial features are then added together to obtain feature map A1.
[0012] The feature map A1 is input into the Squeeze-and-Excitation module for processing to obtain the channel attention map A2.
[0013] The channel attention map A2 is increased in dimensionality through a 1×1 pointwise convolution operation, then reduced in dimensionality through a GELU activation function operation, and finally reduced in dimensionality through a 1×1 pointwise convolution operation to obtain the feature map A3. In the RepViTBlock modules of layers 1-3 and 5-7, the channel attention map A2 is increased in dimensionality through a 1×1 pointwise convolution operation, and then reduced in dimensionality directly through a 1×1 pointwise convolution operation without going through a GELU activation function operation to obtain the feature map A3.
[0014] Add the channel attention map A2 to the feature map A3 to obtain the feature map A4;
[0015] The RepViTBlock modules in layers 4, 8, and 24 include:
[0016] The input feature map is convolved by a 3×3 depthwise convolutional layer with a stride of 2 to obtain feature map B1.
[0017] The feature map B1 is input into the Squeeze-and-Excitation module for processing to obtain the channel attention map B2.
[0018] The channel attention map B2 is processed by a 1×1 pointwise convolution operation to obtain the feature map B3;
[0019] Feature map B3 is increased in dimensionality through a 1×1 pointwise convolution, then reduced in dimensionality again through a 1×1 pointwise convolution after the GELU activation function operation, resulting in feature map B4. In the 4th layer RepViTBlock module, feature map B3 is increased in dimensionality through a 1×1 pointwise convolution, then reduced in dimensionality directly through a 1×1 pointwise convolution without the GELU activation function operation, resulting in feature map B4.
[0020] Add feature map B3 to feature map B4 to obtain feature map B5;
[0021] Specifically, the Squeeze-and-Excitation module is enabled in the RepViTBlock modules of layers 1, 5, 9, 11, 13, 15, 17, 19, 21, and 25, while the Squeeze-and-Excitation module is not enabled in the RepViTBlock modules of the remaining layers.
[0022] Furthermore, the Squeeze-and-Excitation module includes:
[0023] After performing global average pooling on the input feature map, it is fed into a 1×1 convolutional layer for dimensionality reduction. After batch normalization and ReLU activation, it is fed into a 1×1 convolutional layer again for dimensionality increase. Then, the sigmoid activation function is used to generate the weights for each channel. The generated weights are multiplied by the original input feature map to obtain the channel attention map.
[0024] Furthermore, the feature fusion of feature maps P2, P3, P4, and P5 in the Neck module of the YOLOv8n model includes:
[0025] The feature map P5 is input into the first upsampling layer, and the feature map Q1 is obtained by upsampling the feature map P5 through the first upsampling layer.
[0026] After concatenating feature map Q1 and feature map P4 through the first Concat layer, the concatenation is sent to the first C2f module. The first C2f module outputs feature map Q2, and then upsamples feature map Q2 through the second upsampling layer to obtain feature map Q3.
[0027] After concatenating feature map Q3 and feature map P3 through the second Concat layer, the concatenation is sent to the second C2f module, which outputs feature map Q4. The feature map Q5 is then obtained by upsampling feature map Q4 through the third upsampling layer.
[0028] After concatenating feature map Q5 and feature map P2 through the third Concat layer, the concatenation is sent to the third C2f module. The third C2f module outputs the detection feature map E1 and sends it to the first detection head. At the same time, the detection feature map E1 is passed through the first convolutional layer to generate feature map Q6.
[0029] After concatenating feature map Q6 and feature map Q4 through the fourth Concat layer, the concatenation is sent to the fourth C2f module. The fourth C2f module outputs the detection feature map E2 and sends it to the second detection head. At the same time, the detection feature map E2 is passed through the second convolutional layer to generate feature map Q7.
[0030] After concatenating feature map Q7 and feature map Q2 through the fifth Concat layer, the concatenation is sent to the fifth C2f module. The fifth C2f module outputs the detection feature map E3 and sends it to the third detection head. At the same time, the detection feature map E3 is passed through the third convolutional layer to generate feature map Q8.
[0031] After concatenating feature map Q8 and feature map P5 through the sixth Concat layer, the concatenation is sent to the sixth C2f module, which then outputs the detection feature map E4 and sends it to the fourth detection head.
[0032] Furthermore, the loss function FPIoU2 is:
[0033]
[0034]
[0035]
[0036] q = e -p ,q∈(0,1]
[0037] Among them, L FPIoU2 Let λ be the loss function, λ be the hyperparameter, q be the parameter used to calculate the non-monotonic attention weights in the loss function, and IoU be the loss function. FocalerThis is the intersection-union ratio (IoU) of the Focaler mechanism, where p is the geometric penalty term; IoU is the ratio of the intersection area to the union area of the predicted and ground truth boxes, where d and u are preset thresholds, [d,u]∈[0,1]. By adjusting the values of d and u, the IoU is optimized. Focaler Focusing on regression samples of varying difficulty; dw1, dw2, dh1, and dh2 represent the absolute values of the distances between the predicted bounding box and the corresponding edges of the target bounding box, W gt and h gt Let represent the width and height of the target bounding box, respectively. A penalty term is used with the target bounding box size as the denominator, and the non-monotonic attention function is controlled by the hyperparameter λ.
[0038] Furthermore, the image to be detected is a preprocessed image, and the preprocessing includes scaling the image to be detected using LetterBox scaling.
[0039] In a second aspect, the present invention provides a face detection device, comprising:
[0040] The face detection module is used to input the image to be detected into a pre-trained target detection model for detection and to obtain face recognition results;
[0041] The object detection model is obtained by replacing the original Backbone module with the RepViT-M0_9 model on the basis of the YOLOv8n model for feature extraction, and replacing the bounding box loss function CIoU of the YOLOv8n model with the loss function FPIoU2. The RepViT-M0_9 model consists of a stem module, RepViTBlock modules of layers 1-26, and an SPPF module connected in sequence. The output of the RepViTBlock module of layer 3 is used as feature map P2, the output of the RepViTBlock module of layer 7 is used as feature map P3, the output of the RepViTBlock module of layer 23 is used as feature map P4, and the output of the SPPF module is used as feature map P5. Feature maps P2, P3, P4, and P5 are fused in the Neck module of the YOLOv8n model.
[0042] Furthermore, the face detection device further includes:
[0043] The preprocessing module is used to preprocess the image to be detected.
[0044] Thirdly, the present invention provides a computing device including a processor and a memory, wherein the processor is configured to execute a computer program stored in the memory to implement the steps of the face detection method according to any one of the first aspects.
[0045] Fourthly, the present invention provides a computer-readable storage medium storing a computer program thereon, wherein when the computer program is executed by a processor, it implements the steps of the face detection method as described in any one of the first aspects.
[0046] Compared with the prior art, the beneficial technical effects of the present invention are as follows:
[0047] This invention improves the YOLOv8n model by replacing the original Backbone module with the RepViT-M0_9 model for feature extraction. It also introduces a P2 layer into the Neck module of the YOLOv8n model to enhance feature fusion for small targets. Furthermore, it replaces the bounding box loss function CIoU in the YOLOv8n model with the FPIoU2 loss function, resulting in a more robust and accurate model for detecting small faces. This improved model not only effectively identifies faces of different sizes but also stably detects faces in complex backgrounds, significantly enhancing its performance in detecting small targets. This method is widely applicable to scenarios with dense crowds and significant differences in face size, providing accurate and reliable detection results for practical applications. Attached Figure Description
[0048] Figure 1 This is a schematic diagram of the target detection model structure of the present invention;
[0049] Figure 2 This is a schematic diagram of the stem module structure;
[0050] Figure 3 The diagram shows the cfgs parameter configuration for the RepViTBlock modules in layers 1-26.
[0051] Figure 4 This is a structural diagram of the RepViTBlock module in layers 1-3, 5-7, 9-23, and 25-26;
[0052] Figure 5 This is a structural diagram of the RepViTBlock module in layers 4, 8, and 24;
[0053] Figure 6 This is a schematic diagram of the Squeeze-and-Excitation module;
[0054] Figure 7 This is a schematic diagram of the Neck module.
[0055] Figure 8 This is a comparison chart of the detection results of the traditional YOLOv8n (a) and the target detection model of the present invention (b). Detailed Implementation
[0056] The present invention will be further described below with reference to specific embodiments. These embodiments are only used to more clearly illustrate the technical solutions of the present invention and should not be construed as limiting the scope of protection of the present invention.
[0057] This invention provides a face detection method, including:
[0058] The image to be detected is input into a pre-trained target detection model for detection to obtain the face recognition result;
[0059] The target detection model is based on the YOLOv8n model, using the RepViT-M0_9 model to replace the original Backbone module for feature extraction. The P2 layer is introduced into the Neck module of the YOLOv8n model to enhance feature fusion for small targets, and the loss function FPIoU2 is used to replace the bounding box loss function CIoU of the YOLOv8n model.
[0060] like Figure 1 As shown, the target detection model includes the RepViT-M0_9 model, the Neck module, and the detection module.
[0061] The RepViT-M0_9 model consists of a stem module (layer 0), RepViTBlock modules (layers 1-26), and an SPPF module connected in sequence. The output of the RepViTBlock module (layer 3) is used as feature map P2, the output of the RepViTBlock module (layer 7) is used as feature map P3, the output of the RepViTBlock module (layer 23) is used as feature map P4, and the output of the SPPF module is used as feature map P5. Feature maps P2, P3, P4, and P5 are then fused in the Neck module of the YOLOv8n model.
[0062] like Figure 2 As shown, the stem module is responsible for initial feature extraction and downsampling, including:
[0063] (1) Convolution + Batch Normalization
[0064] By using a 3×3 convolutional layer with a stride of 2, the input channels are expanded from three channels of RGB to input_channel / 2, and the spatial dimension of the input is reduced by half before batch normalization is performed.
[0065] (2) GELU activation function
[0066] By using the GELU activation function, the model has a strong ability to extract nonlinear features in the initial stage, thereby helping to capture complex patterns and local features.
[0067] The GELU activation function is:
[0068] GELU(x)=xΦ(x)
[0069] Where Φ(x) is the standard normal cumulative distribution function of the input, and the specific formula is:
[0070]
[0071] Here, erf is the error function used to calculate the Gaussian integral of x, which gives GELU the properties of being smooth and non-monotonic.
[0072] (3) Convolution + Batch Normalization
[0073] By using a 3x3 convolutional layer with a stride of 2, the number of channels is further expanded to input_channel, and the spatial dimension is halved again.
[0074] The STEAM module extracts low-level features through two 3×3 convolutions, enabling the model to capture preliminary edge and texture information, laying the foundation for feature extraction in subsequent layers. Each convolution operation halves the spatial size of the input, significantly reducing the computational cost of subsequent layers. This is particularly important in deep networks, helping to improve model efficiency. The GELU activation function is used after the first convolution, allowing the model to capture non-linear features during low-level feature extraction, enhancing its feature representation capabilities from an early stage.
[0075] The RepViTBlock module is the core component of the RepViT-M0_9 model, which consists of a series of module configurations. The parameters of each RepViTBlock module include kernel size, expansion factor, number of output channels, Squeeze-and-Excitation (SE) module, and stride, which can be flexibly configured according to the cfgs parameter.
[0076] The RepViT-M0_9 model includes 26 layers of RepViTBlock modules. Each item in the configuration array cfgs of the RepViT-M0_9 model has the format [k,t,c,SE,s], representing:
[0077] ·k: kernel size
[0078] ·t: expansion factor
[0079] c: Number of output channels
[0080] • SE: Whether the SE module is enabled, 1 indicates enabled, 0 indicates disabled.
[0081] ·s: Step size
[0082] The specific parameter configurations for the RepViTBlock modules in layers 1-26 are as follows: Figure 3 As shown.
[0083] The RepViTBlock module has two basic structures, both of which consist of two parts: token_mixer and channel_mixer.
[0084] When the stride parameter in the cfgs configuration is 1, the token_mixer is a residual structure, such as... Figure 4 As shown; when the parameter is 2, token_mixer is a linear structure, such as Figure 5 As shown.
[0085] In the cfgs configuration, the SE module is enabled when the SE parameter is 1, and disabled when the parameter is 0.
[0086] like Figure 4 As shown, the structure of the RepViTBlock module (stride=1) in layers 1-3, 5-7, 9-23, and 25-26 includes:
[0087] (1) token_mixer, including the RepVGGDW module and the SE module;
[0088] RepVGGDW module:
[0089] The input feature map is convolved with 3×3 depthwise convolution with stride 1 and 1×1 depthwise convolution to extract spatial features. The extracted spatial features are then added together to obtain feature map A1.
[0090] Spatial features are extracted using 3×3 depthwise convolutions and 1×1 depthwise convolutions, along with residual connections and batch normalization, ensuring the consistency of features within each channel.
[0091] SE module (optional):
[0092] The feature map A1 is input into the SE module for processing to obtain the channel attention map A2.
[0093] The SE module generates channel-level attention weights to enhance the response of important channels.
[0094] like Figure 6 As shown, the SE module includes:
[0095] After performing global average pooling on the input feature map, it is fed into a 1×1 convolutional layer for dimensionality reduction. After batch normalization and ReLU activation, it is fed into a 1×1 convolutional layer again for dimensionality increase. Then, the sigmoid activation function is used to generate the weights for each channel. The generated weights are multiplied by the original input feature map to obtain the channel attention map.
[0096] More specifically, the SE module includes:
[0097] Squeeze operation: Performs global average pooling on the input x to obtain the global statistics x_se for each channel;
[0098] Excitation operation: The number of channels is compressed to the reduced number of channels through a 1×1 convolution, and then normalized and activated to extract the reduced channel features; then, the number of channels is restored to the original number of channels through a 1×1 convolution.
[0099] Channel weighting: Finally, a sigmoid activation function is used to generate weights for each channel. These weights are then multiplied by the original input feature map to obtain the channel attention map. This increases the network's focus on important features and reduces the influence of irrelevant features.
[0100] The SE module generates attention weights based on global information for each channel, enabling the model to adaptively adjust the response intensity of different channels, thereby enhancing the focus on key features. By reducing the number of channels and using 1×1 convolution operations, the SE module achieves significant performance improvements with relatively low computational cost. The SE module also supports custom activation functions, normalization layers, and max pooling settings, facilitating optimization across different tasks.
[0101] When stride=1, the RepViTBlock module is used to extract spatial features while maintaining feature consistency within channels, strengthening the focus on key channels, and concentrating the model's attention on important features, thereby improving overall performance.
[0102] (2) channel_mixer
[0103] like Figure 4 As shown, the structure of channel_mixer is as follows:
[0104] (a) Pointwise convolution:
[0105] Conv2d_BN(oup,2*oup,1,1,0) is used, which is a 1×1 convolution used to expand features in the channel dimension.
[0106] (b) Activation function (GELU) (optional) (not available in the first 7 layers)
[0107] Nonlinear activation functions are used to introduce nonlinearity to enhance expressive power.
[0108] (c) Pointwise convolution:
[0109] Using Conv2d_BN(2*oup,oup,1,1,0,bn_weight_init=0), the number of channels is restored to the original number of output channels, and the final enhanced features are output.
[0110] The specific steps are as follows: the channel attention map A2 is increased in dimension by a 1×1 pointwise convolution operation, and then reduced in dimension by a 1×1 pointwise convolution operation after the GELU activation function operation to obtain the feature map A3; the channel attention map A2 and the feature map A3 are added together to obtain the feature map A4.
[0111] The `channel_mixer` first expands the number of input channels to twice the number of hidden channels, then reduces it back to the number of output channels to capture more inter-channel interaction information. It introduces non-linearity through an activation function to enhance the model's ability to distinguish features. Residual connections are used to superimpose the input and output, ensuring that while introducing stronger feature representation capabilities, the stability of feature information is maintained.
[0112] like Figure 5 As shown, the RepViTBlock modules (stride=2) in layers 4, 8, and 24 include:
[0113] (1) token_mixer
[0114] Conv2d_BN (Depthwise Convolution):
[0115] Feature map B1 is obtained by performing a 3×3 depthwise convolution with a stride of 2 on the input feature map. This depthwise convolution achieves downsampling and expands the receptive field, thus reducing the feature map resolution.
[0116] SE module (optional):
[0117] The structure is the same as the SE module mentioned above.
[0118] Input feature map B1 into the Squeeze-and-Excitation module to obtain channel attention map B2.
[0119] Conv2d_BN (pointwise convolution):
[0120] The channel attention map B2 is processed by a 1×1 pointwise convolution operation to obtain the feature map B3. The output is then mapped to a new channel space through pointwise convolution, further integrating the features.
[0121] This structure reduces computational cost through downsampling, enhances global feature extraction capabilities, and increases the receptive field, helping to capture key features in important regions and improving the model's ability to capture these regions. While reducing computational cost, it also enhances the model's ability to express and focus on spatial and channel features.
[0122] (2) channel_mixer
[0123] The structure is the same as the aforementioned channel_mixer structure.
[0124] Feature map B3 is increased in dimensionality by a 1×1 pointwise convolution, then reduced in dimensionality by a 1×1 pointwise convolution after the GELU activation function operation, resulting in feature map B4; feature map B3 and feature map B4 are added together to obtain feature map B5.
[0125] Among them, the SE module is enabled in the RepViTBlock modules of layers 1, 5, 9, 11, 13, 15, 17, 19, 21, and 25, while the SE module is not enabled in the RepViTBlock modules of the remaining layers.
[0126] like Figure 7 The diagram shown is a structural diagram of the Neck module of the present invention. The output of the 3rd layer RepViTBlock module is P2, the output of the 7th layer RepViTBlock module is P3, the output of the 23rd layer RepViTBlock module is P4, and the output of the SPPF module is P5.
[0127] Feature maps P2, P3, P4, and P5 are fused in the Neck module, including:
[0128] The feature map P5 is input into the first upsampling layer, and the feature map Q1 is obtained by upsampling the feature map P5 through the first upsampling layer.
[0129] After concatenating feature map Q1 and feature map P4 through the first Concat layer, the concatenation is sent to the first C2f module. The first C2f module outputs feature map Q2, and then upsamples feature map Q2 through the second upsampling layer to obtain feature map Q3.
[0130] After concatenating feature map Q3 and feature map P3 through the second Concat layer, the concatenation is sent to the second C2f module, which outputs feature map Q4. The feature map Q5 is then obtained by upsampling feature map Q4 through the third upsampling layer.
[0131] After concatenating feature map Q5 and feature map P2 through the third Concat layer, the concatenation is sent to the third C2f module. The third C2f module outputs the detection feature map E1 and sends it to the first detection head. At the same time, the detection feature map E1 is passed through the first convolutional layer to generate feature map Q6.
[0132] After concatenating feature map Q6 and feature map Q4 through the fourth Concat layer, the concatenation is sent to the fourth C2f module. The fourth C2f module outputs the detection feature map E2 and sends it to the second detection head. At the same time, the detection feature map E2 is passed through the second convolutional layer to generate feature map Q7.
[0133] After concatenating feature map Q7 and feature map Q2 through the fifth Concat layer, the concatenation is sent to the fifth C2f module. The fifth C2f module outputs the detection feature map E3 and sends it to the third detection head. At the same time, the detection feature map E3 is passed through the third convolutional layer to generate feature map Q8.
[0134] After concatenating feature map Q8 and feature map P5 through the sixth Concat layer, the concatenation is sent to the sixth C2f module, which then outputs the detection feature map E4 and sends it to the fourth detection head.
[0135] The original YOLOv8n model backbone network undergoes five downsampling iterations, resulting in five feature layers (P1, P2, P3, P4, and P5). The detection head performs target detection using the feature maps derived from these three layers (P3, P4, and P5), with feature map scales of 80×80, 40×40, and 20×20 pixels, respectively. This invention derives a new detection head from the P2 layer features on the YOLOv8n model. The P2 layer detection head feature map has a resolution of 160×160 pixels.
[0136] Furthermore, the original YOLOv8n network model uses CIOU as its bounding box loss function. This function considers the distance between the center point and the diagonal distance of the bounding box, making it better suited for object detection tasks of various shapes. However, this function does not directly capture the shape difference between the detection box and the target box, causing the predicted box to continuously expand during the regression process, and it does not consider the distribution of easy and difficult samples.
[0137] To address this issue, this invention improves the bounding box loss function of the YOLOv8n network model to FPIoU2. FPIoU2 effectively solves the problems of bounding box enlargement, bounding box shape differences, and regression balance between high-quality and low-quality samples in CIOU regression through a dynamic focusing mechanism and a size penalty term. In particular, by using a non-monotonic attention function and dynamically adjusting the values of d and u, FPIoU2 can achieve effective weight adjustment between samples of different difficulty levels, thereby improving the performance of the object detection model.
[0138] The bounding box loss function FPIoU2 is:
[0139]
[0140]
[0141]
[0142] q = e -p ,q∈(0,1]
[0143] Among them, L FPIoU2 Let λ be the loss function, λ be the hyperparameter, q be the parameter used to calculate the non-monotonic attention weights in the loss function, and IoU be the loss function. Focaler This refers to the intersection-union ratio (IoU) of the Focaler mechanism. `p` is a geometric penalty term that calculates the edge distance between the predicted bounding box and the target bounding box, normalizing the result by using the target bounding box size as the denominator, and increasing the penalty for predicted boxes with large size differences. `IoU` is the ratio of the intersection area to the union area of the predicted and ground truth bounding boxes. `d` and `u` are preset thresholds that control the sensitivity of the Focaler mechanism. `[d,u]∈[0,1]`. By adjusting the values of `d` and `u`, the IoU can be optimized. Focaler Focusing on regression samples of varying difficulty; dw1, dw2, dh1, and dh2 represent the absolute values of the distances between the predicted bounding box and the corresponding edges of the target bounding box, W gt and h gt Let represent the width and height of the target bounding box, respectively. A penalty term is used with the target bounding box size as the denominator, and the non-monotonic attention function is controlled by the hyperparameter λ. To avoid enlarging the detection frame and better adapt to the target size.
[0144] After the model is built, it is trained using a dataset. First, the images in the dataset are preprocessed.
[0145] Preprocessing includes LetterBox scaling and Mosaic data augmentation.
[0146] (1) LetterBox scaling
[0147] YOLOv8n's default image input size is 640x640, so it is necessary to resize images of normal size to the standard size. However, simply using resize may cause image distortion, so LetterBox scaling is used. The principle is to scale proportionally and fill the other parts with the background color. After scaling, the resulting image size is 640×640×3.
[0148] (2) Mosaic data augmentation
[0149] In the dataset, four images are randomly selected and stitched together using random scaling, random cropping, and random arrangement.
[0150] Mosaic data augmentation is employed, using four images randomly, scaling them randomly, and then randomly distributing and stitching them together. This greatly enriches the detection dataset, especially since random scaling adds many small targets, making the network more robust. Directly calculating the data from the four images means that a small mini-batch size is not required to achieve good results.
[0151] Then, the object detection model is trained using the preprocessed dataset to obtain the final model.
[0152] To verify the performance of the model on the face detection task, this invention compares the evaluation results of existing object detection algorithms on the WiderFace dataset validation set. The comparison methods include ACF-WIDER, Two-stageCNN, LDCF+, CMS-RCNN, and SSH (evaluation results data are from the WiderFace website). Furthermore, the evaluation results of YOLOv8n and the object detection model of this invention were obtained under the same experimental environment. The evaluation comparison results are shown in Table 1.
[0153] Table 1
[0154]
[0155]
[0156] As shown in Table 1, the model of this invention significantly improves detection accuracy compared to the original YOLOv8n, with an improvement of 1.4% in Easy mAP, 2.1% in Medium mAP, and 6.3% in Hard mAP. Compared to other methods, the model of this invention also exhibits superior performance. This advantage is mainly attributed to the well-designed YOLOv8n model and the specific improvements of this invention for small-sized face detection environments.
[0157] The detection results on the test set are as follows Figure 8 As shown, the comparison between the algorithms before and after the improvement demonstrates a significant performance enhancement. In complex scenes containing numerous small targets, the model of this invention exhibits higher detection accuracy. Specifically, the model of this invention can more accurately capture small target faces, significantly reducing false positives and false negatives, especially in crowded or occluded environments, where its recognition accuracy is effectively improved. By comparing the number of faces detected by the two methods, YOLOv8n detected 227 targets in the original model, while the model of this invention detected 300 targets, indicating a significant improvement in both detection accuracy and quantity. This improvement provides more reliable support for the practical application of face detection tasks in complex application scenarios.
[0158] The present invention also provides a face detection device, comprising:
[0159] The face detection module is used to input the image to be detected into a pre-trained target detection model for detection and to obtain face recognition results;
[0160] The object detection model is obtained by replacing the original Backbone module with the RepViT-M0_9 model on the basis of the YOLOv8n model for feature extraction, and replacing the bounding box loss function CIoU of the YOLOv8n model with the loss function FPIoU2. The RepViT-M0_9 model consists of a stem module, RepViTBlock modules of layers 1-26, and an SPPF module connected in sequence. The output of the RepViTBlock module of layer 3 is used as feature map P2, the output of the RepViTBlock module of layer 7 is used as feature map P3, the output of the RepViTBlock module of layer 23 is used as feature map P4, and the output of the SPPF module is used as feature map P5. Feature maps P2, P3, P4, and P5 are fused in the Neck module of the YOLOv8n model.
[0161] The face detection device also includes:
[0162] The preprocessing module is used to preprocess the image to be detected.
[0163] The present invention provides a computing device including a processor and a memory, wherein the processor is configured to execute a computer program stored in the memory to implement the steps of the face detection method according to any one of the first aspects.
[0164] The present invention provides a computer-readable storage medium storing a computer program thereon, wherein when the computer program is executed by a processor, it implements the steps of the face detection method as described in any of the first aspects.
[0165] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0166] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0167] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0168] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0169] The present invention has been disclosed above with reference to preferred embodiments, but it is not intended to limit the present invention. All technical solutions obtained by adopting equivalent substitutions or equivalent transformations fall within the protection scope of the present invention.
Claims
1. A face detection method, characterized in that, include: The image to be detected is input into a pre-trained target detection model for detection to obtain the face recognition result; The object detection model is obtained by replacing the original Backbone module with the RepViT-M0_9 model on the basis of the YOLOv8n model for feature extraction, and replacing the bounding box loss function CIoU of the YOLOv8n model with the loss function FPIoU2. The RepViT-M0_9 model consists of a stem module, RepViTBlock modules of layers 1-26, and an SPPF module connected in sequence. The output of the RepViTBlock module of layer 3 is used as feature map P2, the output of the RepViTBlock module of layer 7 is used as feature map P3, the output of the RepViTBlock module of layer 23 is used as feature map P4, and the output of the SPPF module is used as feature map P5. Feature maps P2, P3, P4, and P5 are fused in the Neck module of the YOLOv8n model.
2. The face detection method according to claim 1, characterized in that, The stem module includes: The input image to be detected is convolved by a 3×3 convolutional layer with a stride of 2, so that the number of output channels is half the number of input channels. After batch normalization and GELU activation function operation, it is convolved again by a 3×3 convolutional layer with a stride of 2 to restore the number of output channels to the original number of input channels. After batch normalization, the low-level feature map is obtained.
3. The face detection method according to claim 1, characterized in that, The RepViTBlock modules in layers 1-3, 5-7, 9-23, and 25-26 include: The input feature map is convolved with a 3×3 depthwise convolutional layer with a stride of 1 and a 1×1 depthwise convolutional layer to extract spatial features. The extracted spatial features are then added together to obtain feature map A1. The feature map A1 is input into the Squeeze-and-Excitation module for processing to obtain the channel attention map A2. The channel attention map A2 is increased in dimensionality through a 1×1 pointwise convolution operation, then reduced in dimensionality through a GELU activation function operation, and finally reduced in dimensionality through a 1×1 pointwise convolution operation to obtain the feature map A3. In the RepViTBlock modules of layers 1-3 and 5-7, the channel attention map A2 is increased in dimensionality through a 1×1 pointwise convolution operation, and then reduced in dimensionality directly through a 1×1 pointwise convolution operation without going through a GELU activation function operation to obtain the feature map A3. Add the channel attention map A2 to the feature map A3 to obtain the feature map A4; The RepViTBlock modules in layers 4, 8, and 24 include: The input feature map is convolved by a 3×3 depthwise convolutional layer with a stride of 2 to obtain feature map B1. The feature map B1 is input into the Squeeze-and-Excitation module for processing to obtain the channel attention map B2. The channel attention map B2 is processed by a 1×1 pointwise convolution operation to obtain the feature map B3; Feature map B3 is increased in dimensionality through a 1×1 pointwise convolution, then reduced in dimensionality again through a 1×1 pointwise convolution after the GELU activation function operation, resulting in feature map B4. In the 4th layer RepViTBlock module, feature map B3 is increased in dimensionality through a 1×1 pointwise convolution, then reduced in dimensionality directly through a 1×1 pointwise convolution without the GELU activation function operation, resulting in feature map B4. Add feature map B3 to feature map B4 to obtain feature map B5; Specifically, the Squeeze-and-Excitation module is enabled in the RepViTBlock modules of layers 1, 5, 9, 11, 13, 15, 17, 19, 21, and 25, while the Squeeze-and-Excitation module is not enabled in the RepViTBlock modules of the remaining layers.
4. The face detection method according to claim 3, characterized in that, The Squeeze-and-Excitation module includes: After performing global average pooling on the input feature map, it is fed into a 1×1 convolutional layer for dimensionality reduction. After batch normalization and ReLU activation, it is fed into a 1×1 convolutional layer again for dimensionality increase. Then, the sigmoid activation function is used to generate the weights for each channel. The generated weights are multiplied by the original input feature map to obtain the channel attention map.
5. The face detection method according to claim 1, characterized in that, The feature fusion of feature maps P2, P3, P4, and P5 in the Neck module of the YOLOv8n model includes: The feature map P5 is input into the first upsampling layer, and the feature map Q1 is obtained by upsampling the feature map P5 through the first upsampling layer. After concatenating feature map Q1 and feature map P4 through the first Concat layer, the concatenation is sent to the first C2f module. The first C2f module outputs feature map Q2, and then upsamples feature map Q2 through the second upsampling layer to obtain feature map Q3. After concatenating feature map Q3 and feature map P3 through the second Concat layer, the concatenation is sent to the second C2f module, which outputs feature map Q4. The feature map Q5 is then obtained by upsampling feature map Q4 through the third upsampling layer. After concatenating feature map Q5 and feature map P2 through the third Concat layer, the concatenation is sent to the third C2f module. The third C2f module outputs the detection feature map E1 and sends it to the first detection head. At the same time, the detection feature map E1 is passed through the first convolutional layer to generate feature map Q6. After concatenating feature map Q6 and feature map Q4 through the fourth Concat layer, the concatenation is sent to the fourth C2f module. The fourth C2f module outputs the detection feature map E2 and sends it to the second detection head. At the same time, the detection feature map E2 is passed through the second convolutional layer to generate feature map Q7. After concatenating feature map Q7 and feature map Q2 through the fifth Concat layer, the concatenation is sent to the fifth C2f module. The fifth C2f module outputs the detection feature map E3 and sends it to the third detection head. At the same time, the detection feature map E3 is passed through the third convolutional layer to generate feature map Q8. After concatenating feature map Q8 and feature map P5 through the sixth Concat layer, the concatenation is sent to the sixth C2f module, which then outputs the detection feature map E4 and sends it to the fourth detection head.
6. The face detection method according to claim 1, characterized in that, The loss function FPIoU2 is: q=e -p ,q∈(0,1] Among them, L FPIoU2 Let λ be the loss function, λ be the hyperparameter, q be the parameter used to calculate the non-monotonic attention weights in the loss function, and IoU be the loss function. Focaler This is the intersection-union ratio (IoU) of the Focaler mechanism, where p is the geometric penalty term; IoU is the ratio of the intersection area to the union area of the predicted and ground truth boxes, where d and u are preset thresholds, [d, u] ∈ [0, 1]. By adjusting the values of d and u, the IoU is optimized. Focaler Focusing on regression samples of varying difficulty; dw1, dw2, dh1, and dh2 represent the absolute values of the distances between the predicted bounding box and the corresponding edges of the target bounding box, W gt and h gt Let represent the width and height of the target bounding box, respectively. A penalty term is used with the target bounding box size as the denominator, and the non-monotonic attention function is controlled by the hyperparameter λ.
7. The face detection method according to claim 1, characterized in that, The image to be detected is a preprocessed image, and the preprocessing includes scaling the image to be detected using LetterBox scaling.
8. A face detection device, characterized in that, include: The face detection module is used to input the image to be detected into a pre-trained target detection model for detection and to obtain face recognition results; The object detection model is obtained by replacing the original Backbone module with the RepViT-M0_9 model on the basis of the YOLOv8n model for feature extraction, and replacing the bounding box loss function CIoU of the YOLOv8n model with the loss function FPIoU2. The RepViT-M0_9 model consists of a stem module, RepViTBlock modules of layers 1-26, and an SPPF module connected in sequence. The output of the RepViTBlock module of layer 3 is used as feature map P2, the output of the RepViTBlock module of layer 7 is used as feature map P3, the output of the RepViTBlock module of layer 23 is used as feature map P4, and the output of the SPPF module is used as feature map P5. Feature maps P2, P3, P4, and P5 are fused in the Neck module of the YOLOv8n model.
9. A computing device, characterized in that, It includes a processor and a memory, the processor being used to execute a computer program stored in the memory to implement the steps of the face detection method as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, it implements the steps of the face detection method as described in any one of claims 1 to 7.