Next, the technical solutions in the embodiments of the present invention will be described in connection with the drawings of the embodiments of the present invention, and it is understood that the described embodiments are merely the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art are in the range of the present invention without making creative labor premise.
 See Figure 1-8 The present invention provides the following technical solutions:
 STRIKE God estimation method based on the depth of learning facial recognition lightweight, comprising the steps of:
 S1: using face detection algorithm ResNet10-SSD in the video stream based on key frames for face detection. Face detection module ResNet10 selected as a skeleton, feature extracting depth of the input image, and then into a series of successive convolution of stacked modules; feature extraction according to the different stages of the network to a different scale, respectively, into the SSD test head, while the prediction picture location information and confidence in the face of the box.
 The network structure figure 1 As shown in the figure four numerals in parentheses: input channel, output channel, the convolution kernel size, step; two numbers represent: the convolution kernel size, step; to the network as a single image input, image normalization after the first pass of a convolution kernel to 7, steps of the convolutional layer + BN + ReLU layer layer 2, i.e. FIG Conv module, the image is mapped to 32-dimensional tensor, while the resolution is halved; pool by of the core is 3, the maximum step size of the cell layer 2, tensor resolution halved again; after sequentially stacked successively by a residual block, i.e. module in FIG ResBlock, convolution module, i.e., module extracts ConvBlock FIG. wherein the first layer 4,6,7,8,9,10, each detection head into the SSD, i.e. FIG ConvSSD module, and then were predicted using a convolution layer position information of the face frame, confidence; core network The modular structure figure 2 Indicated.
 S2: human face block obtained in step S1, each detection head SSD demand prediction (x c Y c , W, h, o, c) the six values, where (x c Y c ) Represents the coordinates of the center frame, (w, h) represents the width and height of the box, o represents the contents of the box is the confidence of the background, and c represents the confidence content box human face; the face frame at the same time to improve the prediction position information performance, add (x c Y c , W, h) scaling factor:
 B w = D w · Exp (var w · L w )
 B h = D h · Exp (var h · L h )
 in Networks are predicted value, the absolute value of the face frame. (D w , D h ) Priori box width, height, For the cell where the center point of the target block with respect to the coordinates of the upper left corner, offset = 0.5 is center coordinates offset, For the scaling factor. Selecting different features at different layers prior frame (d w , D h. Generally, a size of each layer assigns a (s min S max ), To give first a priori two square frames, which are side length s min , Secondly, the aspect ratio of a plurality of assigned values of r, each assigned a r, length and width were obtained Two rectangular frames priori. Specifically, the input image resolution of 300 × 300, face detection network a total of m = 6 output layers, each layer resolutions are output [(38, 38), (19, 19), (10, 10), (5,5), (3,3), (1,1)], a priori distribution size (s min S max ) Are [(30,60), (60,111), (111,162), (162,213), (213,264), (264,315)], the a priori distribution of the aspect ratio r = [(2), (2, 3), (2,3), (2,3), (2,3), (2)], and therefore a priori the number of layers allocated to each frame, respectively [4,6,6,6,4 4]; the human face of each output layer predicted blocks obtained integrated together, it is necessary to re-apply NMS algorithm to obtain a final face detection result.
 S3: For people obtained in step S2 face frame information, provided a confidence threshold, the confidence is greater than a threshold, from the image center of the screen nearest the face frame as a target to be recognized; under the frame face area cut, using based walk MobileNet + GRU God wander recognition algorithm to identify;
 God Identity Module learn YOLO structure to build the network backbone, and add a human face key information borrow additional oversight CBAM attention module, through the global pool of people to get the current frame of the face feature vectors, on the one hand directly add a secondary branch, head pose estimation with auxiliary information as additional supervision; on the other hand the integration timing information by GRU, were distracted identify and classify the main branch, and finally get the recognition result; its network structure such as image 3 As shown in the figure four numbers in brackets represent: input channel, output channel, the convolution kernel size, step; two numbers represent: input channel, output channel; number represents: the number of hidden units; the current network key frame image, a keyframe extracted from the feature vectors as an input, while the output head pose estimation of the current frame, the current frame wander recognition result.
 S4: the network backbone moiety step S3, in addition to the ordinary convolution module pixel characteristic, i.e. FIG Conv module, the original input image to extract a first layer of a separable convolution depth are used instead of the traditional convolutional reduce network and computation parameters; readthrough network uses stacked modules and steps BottleneckCSP depth separable convolution module 2, i.e. FIG DWConv module extracts image feature depth, which also functions downsampling; each a convolution by the convolution module layer layer + BN + ReLU layers, i.e., the depth of the separable convolution module separable convolution depth-layer instead of a convolution layer, remaining unchanged;
 BottleneckCSP modules connected to local and residual phases - Cross retain residual strong feature extraction network capacity while reducing the amount of calculation, such as the structure Figure 4 Shown; tensor input module is divided into two, wherein all the way to reduce the number of channels by a convolution module, then after feature extraction module N residual, then through a separate convolution layer was adjusted dimensional space; the other way by a direct convolution additional layer, the two tensor channel dimension spliced together by a layer of BN + ReLU active layer, and finally with an independent adjustment of the output layer convolution dimensions.
 S5: tail backbone network in step S3 to add a module and retain the attention of the identity map, employing keys generated heat FIG face additional supervision, such as attachment Figure 5 Indicated;
 On the one hand, for the absent-minded recognition task, the network should pay more attention to the key position of the eyes, mouth, etc., unimportant position cheeks should be appropriately ignored, so you need to add CBAM attention modules feature enhancement and noise suppression; on the other hand, the FIG feature calculation shows the final output of the network backbone, each cell 117 has reached the pixel receptive field, if the input image recognition tasks God removed resolution of 112 × 112, the feel of each cell is characterized in FIG. wild has covered the entire image area, then the traditional CBAM module might some feature vectors compressed to about zero, it can not contribute to the final classification output, which is obviously unreasonable, so the need to preserve the identity map; CBAM calculation is as follows:
 Z C = F e (U) ⊙U
 Z CBAM = F p (Z C ) ⊙Z C
 Where U is the input feature FIG attention module, F e Channel represents the function to get attention, ⊙ point multiplication operation, Z C FIG characterized attention after application channel; F. p Function to get the attention of representatives of space, Z CBAM FIG characterized attention space after applying;
 The network configuration module Attention Image 6 Shown; F. e The maximum and mean cell pooling the two routes of the input feature FIG compressed to a U-dimensional vector, two subsequent core is a step size function as a convolution of the whole layer of the connection layer of Period 1, and the maximum shared pool parameters convolution layer and the average of two pools, and finally through the sigmoid activation function that is obtained in the range [0 - 1] channel attention, then the U-point multiplication operation, i.e. to obtain Z C; F p Z dimensions of the channel C Space pool and the average maximum pooling, the number two channel of attention resulting spliced in FIG channel dimensions, sigmoid activation performed by the convolution of one layer, obtained in the range [0 1] attention, and then with Z C Obtained by the operation in point Z CBAM , I.e., the final output of the module attention;
The feature map after application of attention will perform global average cellification to obtain the frame characteristic vector;
 S6: For the head attitude estimation assist branch in step S3, the frame feature vector obtained in step S5 is used to directly fit the head attitude, the pitch angle PITCH, the yaw, and YAW. Rolling angle ROLL, if included Figure 7 Indicated.
 S7: For the facial denseness of the face derived from the face of step S3, using the frame face feature vector, historical information obtained in step S5, the timing feature is extracted by the GRU unit, and the fusion feature vector is obtained, through a layer of full connection layer. After the state of the face people's face is classified, the score of the destination is obtained by the SoftMax activation function; the GRU calculation formula is as follows:
 R = σ (w r * [h t-1 , X t ])
 z = σ (w z * [h t-1 , X t ])
 Feature vector for current frame input t , Historical frame memory content h t-1 First, the first calculates the resetter R to control the memory content h. t-1 The output prediction contribution of the current frame is proportional to the mixed state h 'of the current frame; then calculate the update gate Z, the memory content h t-1 And the current state h ', while selecting memory effective information, forgetting the effect of invalid information, eventually getting an output feature vector h t The GRU module network structure is attached Figure 8 As shown, the findings of various doors are completed by the full connection layer;
 S8: For the CBAM Attitude Module, the head attitude estimation auxiliary branch, the three tasks of degenerate the main branch are also supervised simultaneously; for the CBAM attention module, use the eyes, nasal, and 51 faces. Use Gaussian nuclear generation hot graph:
 Where (x, y) is the horizontal, ordinate of the point in the thermal map, (X 0 Y 0 ) For the horizontal, ordinate, σ = 10 of the face-to-face point, and the Gaussian nuclear standard deviation; if there is an overlap area between the high-class Gaussian distribution, the maximum value is taken as the corresponding value of the thermogram; loss function Binary cross entropy:
 Where Y i , P i The thermal graph of the corresponding position and the predicted value of CBAM attention;
 For head attitude estimation auxiliary branches, since three attitude angle values are continuous values, the mean square error function can be taken as a loss function; at the same time, in order to accelerate convergence, the network output value can be scaled:
 Where Y i , P i The true value and predicted value of the head attitude angle, ε = 5 is the scaling coefficient;
 Multi-classifications for the dedication of the demonstration of the demonstration, and use the cross entropy loss function:
 Where Y ij , P ij The true value and predictive value of the JIN of the i-th sample is respectively;
 Weighted three sub-task losses to obtain total network loss:
 L = λ att L att + λ pose L pose + λ cls L cls
 The present invention takes λ att = Λ pose = Λ cls = 1.0;
 S9: The initial angle is generated according to the output result of the head posture estimation branch in step S3, which is convenient to correct the network output results in conjunction with the head attitude angle of the subsequent frame; coupled with the smoothing operation of the network predictive score, you can get a dedication recognition The final result of the algorithm.
 The present invention applies a paradise-based parallel based on deep learning to face-to-face decent recognition, and effectively extracts spatial feet and timing characteristics based on lightweight network design, which generates a spatial score, and has a speed and accuracy. Good practical effect.
 Although the embodiments of the present invention have been shown and described, those skilled in the art can be understood that these examples can be made, modified, and replaced without departing from the principles and spirit of the present invention. And variations, the scope of the invention is defined by the appended claims and their equivalents.