A model-based quantization MDSSD face detection method

By constructing the MDSSD model and its quantization model MDSSD Lite, the model structure and feature map of the SSD algorithm are optimized, improving the recall rate and detection speed of small face detection, which is suitable for real-time face detection on smart terminal devices.

CN112232270BActive Publication Date: 2026-06-26GUANGXI UNIVERSITY OF TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
GUANGXI UNIVERSITY OF TECHNOLOGY
Filing Date
2020-10-29
Publication Date
2026-06-26

Smart Images

  • Figure CN112232270B_ABST
    Figure CN112232270B_ABST
Patent Text Reader

Abstract

The application discloses a kind of MDSSD face detection methods based on model quantification, comprising, based on the integral graph of input image of convolutional neural network calculation and setting different size feature template extraction all sample features;Read the feature value of all samples, select the minimum loss feature value as the classification attribute of the first weak classifier;According to the weight value of the next round feature calculated according to the light weight strategy and the weight of the weak classifier;Obtain a plurality of weak classifiers in turn and combine into strong classifier;The preselected position in the candidate frame is input into the strong classifier one by one for detection until all the weak classifiers confirm that the preselected position is a face, and the classification ends.The application establishes MDSSD Lite light model by quantization compression of MDSSD face detection model, and compared with SSD, the recall rate for small face and blurred face is higher, while the detection speed and detection accuracy are maintained.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical field of face detection, and in particular to a model-quantized MDSSD face detection method. Background Technology

[0002] With the rise of deep learning, intelligent analysis technologies related to faces have become a key focus and hot topic in the field of artificial intelligence. New algorithms are constantly improving the scores of face-related tasks, and face recognition technology has now surpassed the highest human level. Simultaneously, face-related industrial applications are the most widespread. For example, applications related to face detection include intelligent security, city brains, safe driving, and China's Skynet system; applications related to face recognition include face payment, intelligent access control, face attendance, and face verification on various intelligent terminal devices. Face-related technologies are closely related to the security of various systems. At the same time, face-related technologies are increasingly being applied to all aspects of life, such as finding missing children and smart education. Furthermore, with the improvement of computer computing power and the application of 5G networks, the cost of data storage and the latency of data transmission will decrease, and face-related applications will be deployed on more and more intelligent terminals, truly realizing an intelligent society and benefiting humanity. Face detection is the process by which intelligent terminals determine whether a face exists in an input image and locate its position. The premise of face detection technology is the ability to accurately detect faces, unaffected by the background of the face image. Therefore, face detection, as the foundation and core technology for face-related tasks, has received widespread attention from researchers.

[0003] Face detection models based on the SSD algorithm can quickly and accurately identify faces in natural scene images, and the algorithm also has a high detection speed. However, the SSD face detection algorithm still has significant room for improvement in recall for detecting small faces in both natural and unnatural scenes. Therefore, a new network model, MDSSD, and its quantized model, MDSSD Lite (Mix Deconvolution Single Shot MultiBox Detector), are constructed for face detection. The MDSSD algorithm improves many shortcomings of the SSD algorithm in face detection, including model structure, detection feature maps, parameter configuration, and loss function. Furthermore, machine learning methods are used to configure the model to reduce human intervention, significantly improving the model's detection performance. Summary of the Invention

[0004] The purpose of this section is to outline some aspects of embodiments of the present invention and to briefly describe some preferred embodiments. Simplifications or omissions may be made in this section, as well as in the abstract and title of this application, to avoid obscuring the purpose of these documents; however, such simplifications or omissions should not be construed as limiting the scope of the invention.

[0005] In view of the aforementioned existing problems, the present invention is proposed.

[0006] Therefore, this invention provides a model-quantized MDSSD face detection method that can solve the problems of low recall rate and low detection speed for small faces and blurred faces.

[0007] To address the aforementioned technical problems, this invention provides the following technical solution: It includes: calculating the integral image of the input image based on a convolutional neural network and extracting features from all samples using feature templates of different sizes; reading the feature values ​​of all samples and selecting the feature value with the minimum loss as the classification attribute of the first weak classifier; calculating the weight values ​​of the features in the next round according to a lightweight strategy and calculating the weights of the weak classifiers; sequentially obtaining multiple weak classifiers and combining them into a strong classifier; inputting pre-selected positions within candidate boxes into the strong classifiers for detection one by one, until all the weak classifiers confirm that the pre-selected positions are faces, at which point the classification ends.

[0008] As a preferred embodiment of the MDSSD face detection method based on model quantization described in this invention, the convolutional neural network includes: a convolutional layer, a pooling layer, and an activation layer; the convolutional layer includes multiple convolutional kernels, which slide with a fixed stride when inputting the image, scanning the entire image and performing discrete convolution calculations, and nonlinearly mapping the output of the convolution operation through an activation function to obtain the input features of the next layer; the pooling layer divides the obtained feature image into blocks after the convolution operation, and calculates the maximum or average value within each block to obtain the pooled image; the activation layer uses the activation function to nonlinearly map the output of the previous layer, thereby introducing nonlinearity into the network, enabling the network to capture more complex nonlinear patterns.

[0009] As a preferred embodiment of the MDSSD face detection method based on model quantization described in this invention, the convolutional layer further includes:

[0010]

[0011]

[0012] F x,y =-I x-1,y-1 -2I x,y-1 -I x+1,y-1 +I x-1,y+1 +2I x,y+1 +I x+1,y+1

[0013] The stride of the convolution kernel k in each direction can be greater than 1. When the stride is s (s>1), the size of the output feature map is as follows:

[0014]

[0015] Where padding is the expansion, m*n is the input image size, k is the convolution kernel, I is the input image sub-image, and x and y are the coordinate values.

[0016] As a preferred embodiment of the MDSSD face detection method based on model quantization described in this invention, the pooling layer further includes: performing pooling operations on the output feature map of the convolutional layer to compress the image size and reduce overfitting; and using max pooling and average pooling to replace the entire candidate region.

[0017] As a preferred embodiment of the model-quantized MDSSD face detection method described in this invention, the activation layer further includes:

[0018] f(x) = max(0,x)

[0019] The gradient is either 1 or 0, which avoids the problems of gradient vanishing or gradient explosion. When the input is positive, the gradient of the loss function is always 1, which greatly reduces the amount of computation during model training.

[0020] As a preferred embodiment of the MDSSD face detection method based on model quantization described in this invention, the lightweight strategy includes: Tensorflow converting the fractional part of the floating-point parameters into integers through linear transformation; calculating the converted parameters; and using linear transformation to restore the final result to the floating-point type.

[0021]

[0022] Where r represents the original model parameter value, B represents the number of bits for quantization, q represents the quantized model parameter value, and z represents the quantized 0 value.

[0023] As a preferred embodiment of the MDSSD face detection method based on model quantization described in this invention, the method further includes: using Tensorflow to quantize and compress the constructed MDSSD model; after the MDSSD model has been trained, using the lightweight strategy to convert the MDSSD model parameters from 32-bit floating-point type to 8-bit integer type for storage; finally obtaining the MDSSD Lite lightweight model.

[0024] As a preferred embodiment of the model-quantized MDSSD face detection method described in this invention, the construction of the MDSSD model includes: the MDSSD algorithm using k-means to perform cluster analysis on the ground truth boxes to find the optimal number, size, and proportion of prior boxes, and using a custom IOU distance as a metric for cluster analysis.

[0025] d IOU (box,centroid)=1-IOU(box,centroid)

[0026] The clustering loss is the IOU distance between the Ground Truth and the cluster centers; the smaller the distance, the larger the IOU value. The number of clusters k is specified, and the cluster centers (W) are randomly initialized. i H i ), i∈{1,2,…,k}, where W i H i Let the length and width of the cluster center be represented respectively; place the cluster center and the ground truth center at the origin of the coordinate system and calculate the IOU distance between each ground truth and the cluster; assign the ground truth to the cluster with the smallest IOU distance; after all the ground truth boxes have been assigned, recalculate the cluster center and keep updating it until the cluster center no longer changes; use the median of the cluster center as the final prior box size and proportion.

[0027] As a preferred embodiment of the MDSSD face detection method based on model quantization described in this invention, the method for calculating the integral image to extract the features includes dividing the feature template into two regions and calculating the sum of pixel values ​​in each of the two regions, with the difference between the sums of the two regions serving as the feature value of the feature template; the integral image uses a matrix to describe global image information, and the value of each point in the integral image is equal to the sum of all pixel values ​​at the top-left corner of that point, as follows.

[0028]

[0029] I(x,y)=f(x,y)+I(x-1,y)+I(x,y-1)-I(x-1,y-1)

[0030] Where I represents the integral image, f represents the original image, and x,y,x′,y′ represent pixel positions.

[0031] As a preferred embodiment of the MDSSD face detection method based on model quantization described in this invention, the strong classifier is obtained by: continuously adjusting the data distribution during training to reduce the weight of correctly classified samples; sequentially learning each base classifier until the number of weak classifiers reaches a predetermined value; and constructing a linear combination of classifiers using a weighted averaging strategy to obtain the strong classifier. Given training samples x = {(x1,y1),(x2,y2),…,(x…}, the strong classifier is obtained by: ... n ,y n )},x n For the training sample feature vector, yn The training sample label takes a value of +1 or -1; each training data point is assigned an initial weight value, with all samples having equal weights.

[0032] D1=(ω 11 ,ω 12 ,…,ω 1i ,…,ω 1n )

[0033]

[0034] For the base classifier G m (x), the error rate of the weighted training samples in the classifier is as follows:

[0035]

[0036] Among them, I(G m (x i )≠y i ) is an indicator function, which takes the value 0 or 1, then the current classifier G m The formula for calculating the weight of (x) is as follows:

[0037]

[0038] After updating the weight distribution of all training samples, the final strong classifier is as follows:

[0039] D m+1 =(ω m+1,1 ,ω m+1,2 ,…,ω m+1,n )

[0040]

[0041]

[0042]

[0043] Among them, Z m As a normalization factor, ω m,i The range of values ​​is normalized to be between [0,1], so that the sum of the weights of all samples is equal to 1. For m = 1, 2, ..., n, each weak classifier is trained in sequence according to the above steps.

[0044] The beneficial effects of this invention are as follows: This invention establishes a lightweight model, MDSSDLite, by quantizing and compressing the MDSSD face detection model. Compared with SSD, it has a higher recall rate for small and blurred faces, while maintaining a faster detection speed and accuracy. Attached Figure Description

[0045] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. Wherein:

[0046] Figure 1 This is a schematic flowchart of the MDSSD face detection method based on model quantization according to an embodiment of the present invention;

[0047] Figure 2 This is a schematic diagram of the convolution operation of the MDSSD face detection method based on model quantization according to an embodiment of the present invention;

[0048] Figure 3 This is a schematic diagram of a model-quantized MDSSD face detection method according to an embodiment of the present invention, including padding convolution operations.

[0049] Figure 4 This is a pooling diagram of the MDSSD face detection method based on model quantization according to an embodiment of the present invention;

[0050] Figure 5 This is an integral diagram of the MDSSD face detection method based on model quantization according to an embodiment of the present invention.

[0051] Figure 6 This is a schematic diagram of the MDSSD structure of the MDSSD face detection method based on model quantization according to an embodiment of the present invention;

[0052] Figure 7 This is a schematic diagram of the WiderFace dataset for the MDSSD face detection method based on model quantization, as described in an embodiment of the present invention.

[0053] Figure 8 This is a schematic diagram comparing the PR curves of various models in the MDSSD face detection method based on model quantization according to an embodiment of the present invention. Detailed Implementation

[0054] To make the above-mentioned objects, features, and advantages of the present invention more apparent and understandable, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the protection scope of the present invention.

[0055] Many specific details are set forth in the following description in order to provide a full understanding of the invention. However, the invention may also be practiced in other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of the invention. Therefore, the invention is not limited to the specific embodiments disclosed below.

[0056] Secondly, the term "one embodiment" or "embodiment" as used herein refers to a specific feature, structure, or characteristic that may be included in at least one implementation of the present invention. The phrase "in one embodiment" appearing in different places in this specification does not necessarily refer to the same embodiment, nor is it a single or selective embodiment that is mutually exclusive with other embodiments.

[0057] This invention is described in detail with reference to the schematic diagrams. When detailing the embodiments of this invention, for ease of explanation, the cross-sectional views illustrating the device structure may be partially enlarged, not adhering to the usual scale. Furthermore, the schematic diagrams are merely examples and should not be construed as limiting the scope of protection of this invention. In actual fabrication, the three-dimensional spatial dimensions of length, width, and depth should be included.

[0058] Furthermore, in the description of this invention, it should be noted that the terms "upper," "lower," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. These terms are used solely for the convenience of describing the invention and for simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the invention. In addition, the terms "first," "second," or "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.

[0059] Unless otherwise explicitly specified and limited, the terms "installation," "connection," and "joining" in this invention should be interpreted broadly. For example, they can refer to fixed connections, detachable connections, or integral connections; similarly, they can refer to mechanical connections, electrical connections, or direct connections, or indirect connections through an intermediate medium, or internal connections between two components. Those skilled in the art can understand the specific meaning of the above terms in this invention based on the specific circumstances.

[0060] Example 1

[0061] Reference Figures 1-6 The first embodiment of the present invention provides a model-quantized MDSSD face detection method, comprising:

[0062] S1: Calculate the integral image of the input image using a convolutional neural network and extract features from all samples by setting feature templates of different sizes. It should be noted that the convolutional neural network includes:

[0063] Convolutional layers, pooling layers, and activation layers;

[0064] A convolutional layer consists of multiple convolutional kernels. The kernels slide with a fixed stride when the input image is used to scan the entire image and perform discrete convolution calculations. The output of the convolution operation is non-linearly mapped through an activation function to obtain the input features of the next layer of the network.

[0065] After the convolution operation, the pooling layer divides the obtained feature image into blocks and calculates the maximum or average value within each block to obtain the pooled image.

[0066] The activation layer uses the activation function to perform a nonlinear mapping on the output of the previous layer, thereby introducing nonlinearity into the network and enabling the network to capture more complex nonlinear patterns.

[0067] Reference Figure 2 and Figure 3 Convolutional layers also include:

[0068]

[0069]

[0070] F x,y =-I x-1,y-1 -2I x,y-1 -I x+1,y-1 +I x-1,y+1 +2I x,y+1 +I x+1,y+1

[0071] The stride of the convolution kernel k in each direction can be greater than 1. When the stride is s (s>1), the size of the output feature map is as follows:

[0072]

[0073] Where padding is the expansion, m*n is the input image size, k is the convolution kernel, I is the input image sub-image, and x and y are the coordinate values;

[0074] Reference Figure 4 The pooling layer also includes:

[0075] Pooling is performed on the output feature maps of convolutional layers to compress image size and reduce overfitting.

[0076] The entire candidate region is replaced using max pooling and average pooling.

[0077] The activation layer also includes,

[0078] f(x) = max(0,x)

[0079] The gradient is either 1 or 0, which avoids the problems of gradient vanishing or gradient explosion. When the input is positive, the gradient of the loss function is always 1, which greatly reduces the amount of computation during model training.

[0080] Reference Figure 5 The calculation of integral image features includes:

[0081] The feature template is divided into two regions, and the sum of the pixel values ​​in the two regions is calculated separately. The difference between the sums of the two regions is used as the feature value of the feature template.

[0082] Integral images use matrices to describe global image information. The value of each point in an integral image is equal to the sum of the values ​​of all pixels above and below that point, as follows:

[0083]

[0084] I(x,y)=f(x,y)+I(x-1,y)+I(x,y-1)-I(x-1,y-1)

[0085] Where I represents the integral image, f represents the original image, and x,y,x′,y′ represent pixel positions.

[0086] S2: Read the feature values ​​of all samples and select the feature value with the minimum loss as the classification attribute of the first weak classifier.

[0087] S3: Calculate the weights of the features for the next round and the weights of the weak classifier based on the lightweight strategy. It should be noted in this step that the lightweight strategy includes:

[0088] Tensorflow converts the fractional part of floating-point parameters into integers using a linear transformation.

[0089] Calculate the conversion parameters and use a linear transformation to restore the final result to a floating-point type;

[0090]

[0091] Where r represents the original model parameter value, B represents the number of bits for quantization, q represents the parameter value of the quantized model, and z represents the 0 value after quantization;

[0092] The constructed MDSSD model was quantized and compressed using Tensorflow;

[0093] After the MDSSD model is trained, a lightweight strategy is used to convert the MDSSD model parameters from 32-bit floating-point type to 8-bit integer type for storage.

[0094] The final result is the MDSSD Lite lightweight model.

[0095] Reference Figure 6 Building the MDSSD model includes:

[0096] The MDSSD algorithm uses k-means to cluster ground truth boxes to find the optimal number, size, and proportion of prior boxes, and uses a custom IOU distance as the metric for cluster analysis.

[0097] d IOU (box,centroid)=1-IOU(box,centroid)

[0098] The clustering loss is the IOU distance between the Ground Truth and the cluster center; the smaller the distance, the larger the IOU value.

[0099] Specify the number of clusters k and randomly initialize the cluster centers (W) i H i ), i∈{1,2,…,k}, where W i H i These represent the length and width of the cluster center, respectively.

[0100] Place the cluster center and the ground truth center at the origin of the coordinate system and calculate the IOU distance between each ground truth and the cluster;

[0101] Assign the Ground Truth to the cluster with the smallest IOU distance. After all Ground Truth boxes have been assigned, recalculate the cluster center and keep updating until the cluster center no longer changes.

[0102] The median of the cluster center is used as the final prior box size and proportion.

[0103] S4: Obtain multiple weak classifiers sequentially and combine them into a strong classifier. It should also be noted that the combined strong classifier includes:

[0104] During training, the data distribution is continuously adjusted to reduce the weight of correctly classified samples;

[0105] The learning process continues until the number of weak classifiers reaches a predetermined value.

[0106] A strong classifier is obtained by constructing a linear combination of classifiers using a weighted average strategy;

[0107] Given training samples x = {(x1, y1), (x2, y2), ..., (x n ,y n )},x n For the training sample feature vector, y n The label for the training samples takes a value of +1 or -1;

[0108] Each training data point is assigned an initial weight value, with all samples having equal weights.

[0109] D1=(ω 11 ,ω 12 ,…,ω 1i ,…,ω 1n )

[0110]

[0111] For the base classifier G m (x), the error rate of the weighted training samples in the classifier is as follows:

[0112]

[0113] Among them, I(G m (x i )≠y i ) is an indicator function, which takes the value 0 or 1, then the current classifier G m The formula for calculating the weight of (x) is as follows:

[0114]

[0115] After updating the weight distribution of all training samples, the final strong classifier is as follows:

[0116] D m+1 =(ω m+1,1 ,ω m+1,2 ,…,ω m+1,n )

[0117]

[0118]

[0119]

[0120] Among them, Z m As a normalization factor, ω m,i The range of values ​​is normalized to be between [0,1], so that the sum of the weights of all samples is equal to 1. For m = 1, 2, ..., n, each weak classifier is trained in sequence according to the above steps.

[0121] S5: Input the pre-selected positions within the candidate boxes into the strong classifiers for detection one by one, until all the weak classifiers confirm that the pre-selected positions are faces, and then end the classification.

[0122] Preferably, this embodiment constructs a new network MDSSD model and its quantized model MDSSD Lite, namely MixDeconvolution Single Shot MultiBox Detector, for face detection. The MDSSD algorithm improves upon many shortcomings of the SSD algorithm in face detection, including model structure, detection feature map, parameter configuration, and model lightweighting. Furthermore, it uses deep neural network learning methods to configure the model to reduce human experience intervention, thereby significantly improving the model's detection performance.

[0123] Example 2

[0124] Reference Figure 7 This embodiment uses the Wider Face (face detection benchmark) dataset for comparative experiments. This dataset contains 32,203 images of different sizes and aspect ratios, comprising 61 event categories, and 393,703 faces with different skin tones, scales, and poses. While the dataset includes manually annotated ground truth boxes for both training and validation data, it does not provide corresponding ground truth boxes for the test face images. Therefore, this embodiment uses the Wider Face training data to train the model and tune its hyperparameters, and uses the validation data to test the lightweight model.

[0125] Since the input images for the SDD, MDSSD, and MDSSD Lite algorithms are all 300×300 pixels, the images need to be forcibly converted to 300×300 pixels before model training. At the same time, the ground truth boxes of the faces input to the model need to be scaled synchronously. Experiments show that faces with ground truth boxes that are less than 11 pixels in length or width will cause the loss to fail to converge during model training, resulting in ineffective learning. Therefore, in the data preprocessing stage, these tiny face samples are first removed. Secondly, in order to shorten the data processing time during model training, the dataset also needs to be converted to VOC format, which is a specific format of text data.

[0126] The SSD model, MDSSD model, and MDSSD Lite model are all implemented using Python 3.6 based on the Tensorflow 1.14 framework. The machine configurations for model training and testing are shown in the table below:

[0127] Table 1: Experimental Environment Configuration Table.

[0128] server DELL Tower operating system Windows 10 GUP NVIDIA GTX 1080Ti CUP Intel Core i7-8700 @ 3.20GHz Memory 32G Video memory 8G

[0129] The SSD network training uses transfer learning, specifically training a VGG16 image classification model on the ImageNet dataset. The convolutional layer parameters of this pre-trained model are used to initialize the first five convolutional blocks of the SSD network. The first three convolutional blocks of the SSD network are then fixed. The deep layers of the backbone network and the classification / regression module are fine-tuned using the Wider Face VOC format dataset to train the face detection model. Similarly, the MDSSD face detection model uses the pre-trained SSD face detection model to initialize the parameters of the backbone network, then uses the Wider Face VOC format dataset to fine-tune all model parameters and train the feature fusion module and the classification / regression module. In contrast, the MDSSD Lite model of this invention uses post-training quantization, so it directly uses the MDSSD parameters for quantization without retraining using the training data.

[0130] Both the SSD and MDSSD networks are optimized using Adam, with a minimum learning rate decay of 0.0001 to ensure that their loss gradually approaches the global minimum.

[0131] Table 2: Training Hyperparameter Settings Table.

[0132] parameter SSD Network MDSSD Network Backbone network initialization method VGG16 SSD Batch size 32 32 Optimization methods Adam Adam Adam_bate1 0.9 0.9 Adam_bate2 0.999 0.999 Learning rate 0.001 0.001 Learning rate decay rate 0.90 0.90 Number of iterations 50000 50000

[0133] There are two evaluation methods for model evaluation: ROC curve and PR curve. However, for object detection tasks, since it is necessary to evaluate precision and recall, the PR curve is more intuitive. The PR curve is a curve connecting the recall and precision of the model at different thresholds. Precision and recall are commonly used evaluation metrics in binary classification tasks, which can be calculated from the confusion matrix.

[0134] Table 3: Confusion Matrix Table.

[0135] Marked as a face Marked as background Detecting faces TP FP Detection as background FN TF

[0136] If the Interchange of Union (IOU) value between the Ground Truth bounding box of the labeled face in the image and the bounding box of the face predicted by the model is greater than 0.5, it means that the face detection is correct. According to the above matching rules, TP in the table is the total number of faces correctly detected by the model, FN is the total number of faces missed by the model, and FP is the total number of faces incorrectly classified as background. The evaluation target is the accuracy of face detection. Therefore, the confusion matrix in face detection discards the indicator of the total number of correctly detected backgrounds, TN.

[0137] Precision and recall can be calculated based on the above metrics, as follows:

[0138]

[0139]

[0140] Precision indicates the proportion of real faces among detected faces, while recall indicates the proportion of faces detected among labeled faces in the test data. In face detection, the PR curve is the curve connecting the maximum detection precision for each given recall. The PR curve allows for an intuitive evaluation and comparison of the performance of different models.

[0141] Average accuracy (AP) is a quantitative evaluation metric for the Wider Face dataset. Its physical meaning is the area enclosed by the model's PR curve and the coordinate axes, as follows:

[0142] mAP=∫0 1 p(r)dr

[0143] Where p is precision and r is recall, AP represents the integral of precision p over recall r.

[0144] However, recall and precision are discrete in actual calculations, so it is necessary to approximate AP. Commonly used calculation methods include PASCAL VOC2007 and PASCAL VOC2012. VOC2007 uses the MAXIntegral method, also known as the 11-Point method. Given a recall of [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], the maximum precision of these 11 points is calculated, and the average of these values ​​is taken to calculate AP, as follows:

[0145]

[0146] Where max p(r) represents the maximum precision p given a recall r. VOC2012 uses the Integral method, which directly performs numerical integration over the region enclosed by the coordinate axes of the PR curve. Specifically, it calculates the sum of the products of precision and recall for each point of precision decline. According to the Interpolated Average Precision (AP) standard, AP is calculated using the following formula:

[0147]

[0148] Where i represents the number of faces detected so far, and △r i This represents the change in recall when the current confidence threshold changes, resulting in a change in precision. This paper uses the evaluation method (5-5) to calculate the AP value.

[0149] This embodiment compares and analyzes the trained SSD face detection model, MDSSD face detection model, and MDSSD Lite model in terms of detection speed, average accuracy, model size, and actual detection effect to test the effectiveness of the improved model.

[0150] Table 4: Experimental Results Data Table.

[0151]

[0152] Refer to Table 4 and Figure 8 All experiments were conducted on CPUs with identical configurations, without GPU acceleration. The tests were performed using the Wider Face validation set. Comparisons revealed that the SSD face detection model has a faster detection speed, reaching 28 frames per second, with a model size of only 97MB. However, its detection accuracy is lower, especially its recall. The MDSSD network, an improvement on the SSD network, has a larger model size and more parameters due to the addition of extra detection modules, layers, and prior boxes. Therefore, its detection speed is slightly slower than the SSD network, reaching 25 frames per second. However, the MDSSD network still meets the requirements for real-time face detection, and the speed loss compared to SSD is negligible. Furthermore, the MDSSD network has higher detection accuracy and face confidence, with an average accuracy of 0.813, significantly improving the recall rate for small face detection. Compared to the SSD model, its average accuracy is improved by 20.9%, effectively demonstrating the effectiveness of the model improvement. The MDSSD network-based quantized compression model is also described. The Lite model boasts high detection accuracy and the fastest detection speed, reaching 34 frames per second, with a minimum model size of only 63MB.

[0153] Therefore, the detection performance of the SSD, MDSSD, and MDSSD Lite models is comparable. However, due to the lack of rich semantic features in the low-level feature maps of the SSD network, it cannot detect slightly blurry faces. Furthermore, for normal face detection, the bounding box regression of the SSD face detection model is relatively inaccurate, failing to completely locate all face regions. In moderately complex scenes, the SSD model has a high false positive rate, while the MDSSD and MDSSD Lite models can detect faces well in natural scenes, with low false positive and false negative rates. In complex scenes, especially in images with dense faces, SSD almost completely fails to detect small or occluded faces. For complex scenes with simple backgrounds, the MDSSD and MDSSD Lite models can still detect almost all faces, while for complex scenes with complex backgrounds, only a few faces are missed. However, these missed faces often have very low resolution and are severely occluded, with indistinct facial features.

[0154] Overall, the MDSSD face detection model has high detection accuracy and speed, but it has more model parameters and is relatively complex to calculate, making it suitable for real-time face detection needs of high-performance devices. The MDSSD Lite model has faster detection speed and higher detection accuracy, and can be deployed on most devices for real-time face detection. The MDSSD model has a similar detection speed to the SSD model, but its detection accuracy is better than that of the SSD model. The MDSSD Lite model has both better detection accuracy and speed than the SSD model. Therefore, the MDSSD model and the MDSSD Lite model are more suitable for industrial face detection applications.

[0155] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A face detection method based on model quantization for MDSSD, characterized in that: include, The integral image of the input image is calculated based on a convolutional neural network, and features of all samples are extracted by setting feature templates of different sizes. The convolutional neural network includes convolutional layers, pooling layers, and activation layers; The convolutional layer includes multiple convolutional kernels. The convolutional kernels slide with a fixed stride when the input image is used to scan the entire image and perform discrete convolution calculations. The output of the convolution operation is non-linearly mapped through an activation function to obtain the input features of the next layer network. The convolutional layer also includes, F x,y =-I x-1,y-1 -2I x,y-1 -AND x+1,y-1 +I x-1,y+1 +2I x,y+1 +I x+1,y+1 The stride of the convolution kernel k in each direction can be greater than 1. When the stride is s (s>1), the size of the output feature map is as follows: Where padding is the expansion, m*n is the input image size, k is the convolution kernel, I is the input image sub-image, and x and y are the coordinate values; The pooling layer divides the obtained feature image into blocks after the convolution operation, and calculates the maximum or average value within each block to obtain the pooled image. The pooling layer also includes, Pooling is performed on the output feature map of the convolutional layer to compress the image size and reduce overfitting; The entire candidate region is replaced using max pooling and average pooling. The activation layer uses the activation function to perform a nonlinear mapping on the output of the previous layer, thereby introducing nonlinearity into the network and enabling the network to capture more complex nonlinear patterns. The activation layer also includes, f(x) = max(0,x) The gradient is either 1 or 0, which avoids the problems of gradient vanishing or gradient explosion. When the input is positive, the gradient of the loss function is always 1, which greatly reduces the amount of computation during model training. Read the feature values ​​of all the samples and select the feature value with the minimum loss as the classification attribute of the first weak classifier; The weights of the features in the next round are calculated based on the lightweight strategy, and the weights of the weak classifier are also calculated. The lightweight strategy includes, Tensorflow converts the fractional part of floating-point parameters into integers using a linear transformation. Calculate the conversion parameters and use a linear transformation to restore the final result to the floating-point type; Where r represents the original model parameter value, B represents the number of bits for quantization, q represents the parameter value of the quantized model, and z represents the 0 value after quantization; The constructed MDSSD model was quantized and compressed using the Tensorflow. Constructing the MDSSD model includes, The MDSSD algorithm uses k-means to cluster ground truth boxes to find the optimal number, size, and proportion of prior boxes, and uses a custom IOU distance as the metric for cluster analysis. d IOU (box,centroid)=1-IOU(box,centroid) The clustering loss is the IOU distance between the Ground Truth and the cluster center; the smaller the distance, the larger the IOU value. Specify the number of clusters k and randomly initialize the cluster centers (W) i H i ), i∈{1,2,…,k}, where W i H i These represent the length and width of the cluster center, respectively. Place the cluster center and the Ground Truth center at the origin of the coordinate system and calculate the IOU distance between each Ground Truth and the cluster; The Ground Truth is assigned to the cluster with the smallest IOU distance. After all the Ground Truth boxes have been assigned, the cluster center is recalculated and continuously updated until the cluster center no longer changes. The median of the cluster centers is used as the final prior box size and proportion; After the MDSSD model is trained, the lightweight strategy is used to convert the MDSSD model parameters from 32-bit floating-point type to 8-bit integer type for storage. The final result is the MDSSD Lite lightweight model; Multiple weak classifiers are obtained sequentially and combined into a strong classifier; Calculating the integral image to extract the features includes, The feature template is divided into two regions and the sum of pixel values ​​in the two regions is calculated respectively. The difference between the sums of the two regions is used as the feature value of the feature template. The integral image uses a matrix to describe the global information of the image. The value of each point in the integral image is equal to the sum of the pixel values ​​at the top left corner of that point, as follows: I(x,u)=f(x,y)+I(x-1,y)+I(x,y-1)-I(x-1,y-1) Where I represents the integral image, f represents the original image, and x,y,x',y' represent pixel positions; The combination to obtain the strong classifier includes, During training, the data distribution is continuously adjusted to reduce the weight of correctly classified samples; The learning process continues until the number of weak classifiers reaches a predetermined value. The strong classifier is obtained by constructing a linear combination of classifiers using a weighted average strategy. Given training samples x = {(x1, y1), (x2, y2), ..., (x n ,y n )},x n For the training sample feature vector, y n The label for the training samples takes a value of +1 or -1; Each training data point is assigned an initial weight value, with all samples having equal weights. D1=(ω 11 ,oh 12 ,…,oh 1i ,…,oh 1n ) For the base classifier G m (x), the error rates of the weighted training samples in the classifier are as follows: Among them, I(G m (x i )≠y i ) is an indicator function, which takes the value 0 or 1, then the current classifier G m The formula for calculating the weight of (x) is as follows: After updating the weight distribution of all training samples, the final strong classifier is as follows: D m+1 =(ω m+1,1 ,oh m+1,2 ,…,oh m+1,n ) Among them, Z m As a normalization factor, ω m,i The range of values ​​is normalized between [0,1], so that the sum of the weights of all samples equals 1. For m = 1, 2, ..., n, train each weak classifier in sequence according to the above steps; The pre-selected locations within the candidate boxes are input into the strong classifier for detection one by one until all the weak classifiers confirm that the pre-selected locations are faces, at which point the classification ends.