Face picture feature extraction method and device
By constructing a spatial multi-level network and an attention-based mixed-style network, a spatial multi-level mixed-style model is formed, which solves the problem of low accuracy in facial feature extraction and achieves higher accuracy in facial feature extraction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING ZHIDA TIANJIE COMMERCIAL OPERATION MANAGEMENT CO LTD
- Filing Date
- 2022-08-09
- Publication Date
- 2026-06-19
AI Technical Summary
The problem of low accuracy in facial feature extraction in existing technologies.
A spatial multi-level network and an attention-based mixed-style network are constructed to form a spatial multi-level mixed-style model. The accuracy of facial features is improved through multi-stage training.
It improves the accuracy of facial feature extraction and solves the problem of low accuracy in existing technologies.
Smart Images

Figure CN115359526B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of facial recognition technology, and in particular to a method and apparatus for extracting facial image features. Background Technology
[0002] Facial feature extraction actually requires more receptive fields of different orders, efficient interaction between information of different spatial orders, and effective suppression of the influence of different styles, backgrounds, and noise. Classical neural networks are not specifically designed for facial feature extraction, resulting in lower accuracy in the extracted facial features.
[0003] In realizing the present invention, the inventors discovered at least the following technical problem in the related technology: the low accuracy of the extracted facial features. Summary of the Invention
[0004] In view of this, the present disclosure provides a method, apparatus, electronic device, and computer-readable storage medium for extracting facial image features, in order to solve the problem of low accuracy of extracted facial features in the prior art.
[0005] A first aspect of this disclosure provides a method for extracting facial image features, comprising: constructing a spatial multi-level network and an attention-based mixed-style network, and constructing a spatial multi-level mixed-style network based on the spatial multi-level network and the attention-based mixed-style network; constructing a multi-stage network based on the spatial multi-level mixed-style network, convolutional layers, and normalization layers to obtain a spatial multi-level mixed-style model; acquiring a facial image, and extracting high-precision features of the facial image using the spatial multi-level mixed-style model.
[0006] A second aspect of this disclosure provides a facial image feature extraction apparatus, comprising: a first construction module configured to construct a spatial multi-level network and an attention-based mixed-style network, and to construct a spatial multi-level mixed-style network based on the spatial multi-level network and the attention-based mixed-style network; a second construction module configured to construct a multi-stage network based on the spatial multi-level mixed-style network, convolutional layers, and normalization layers to obtain a spatial multi-level mixed-style model; and an extraction module configured to acquire a facial image and extract high-precision features of the facial image using the spatial multi-level mixed-style model.
[0007] A third aspect of this disclosure provides an electronic device including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method described above.
[0008] A fourth aspect of this disclosure provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described method.
[0009] The beneficial effects of this disclosure embodiment compared with the prior art are as follows: This disclosure embodiment constructs a facial image feature extraction model, which includes: a facial image feature extraction branch and a quality regression branch; obtains a pre-trained model and initializes the facial image feature extraction model using the pre-trained model; performs multi-stage training on the initialized facial image feature extraction model; obtains a pedestrian dataset to be classified, which includes: multiple images of multiple people; and uses the multi-stage trained facial image feature extraction model to classify the images in the pedestrian dataset, obtaining one or more images of each person. Therefore, by adopting the above technical means, the problem of low accuracy of extracted facial features in the prior art can be solved, thereby improving the accuracy of extracted facial features. Attached Figure Description
[0010] To more clearly illustrate the technical solutions in the embodiments of this disclosure, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0011] Figure 1 This is a schematic diagram illustrating an application scenario of an embodiment of this disclosure;
[0012] Figure 2 This is a schematic flowchart of a method for extracting facial image features provided in an embodiment of this disclosure;
[0013] Figure 3 This is a schematic diagram of the structure of a facial image feature extraction device provided in an embodiment of this disclosure;
[0014] Figure 4 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this disclosure. Detailed Implementation
[0015] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, so as to provide a thorough understanding of the embodiments of this disclosure. However, those skilled in the art will understand that this disclosure may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this disclosure with unnecessary detail.
[0016] A method and apparatus for extracting facial image features according to an embodiment of the present disclosure will now be described in detail with reference to the accompanying drawings.
[0017] Figure 1This is a schematic diagram illustrating an application scenario of an embodiment of this disclosure. The application scenario may include terminal devices 101, 102, and 103, server 104, and network 105.
[0018] Terminal devices 101, 102, and 103 can be hardware or software. When terminal devices 101, 102, and 103 are hardware, they can be various electronic devices with displays that support communication with server 104, including but not limited to smartphones, tablets, laptops, and desktop computers. When terminal devices 101, 102, and 103 are software, they can be installed in the aforementioned electronic devices. Terminal devices 101, 102, and 103 can be implemented as multiple software programs or software modules, or as a single software program or software module; this disclosure does not impose any limitations on this. Furthermore, various applications can be installed on terminal devices 101, 102, and 103, such as data processing applications, instant messaging tools, social platform software, search applications, shopping applications, etc.
[0019] Server 104 can be a server that provides various services, such as a backend server that receives requests sent by terminal devices with which it has established communication connections. This backend server can receive and analyze the requests sent by the terminal devices and generate processing results. Server 104 can be a single server, a server cluster consisting of several servers, or a cloud computing service center. This embodiment of the disclosure does not impose any limitations on these aspects.
[0020] It should be noted that server 104 can be either hardware or software. When server 104 is hardware, it can be various electronic devices that provide various services to terminal devices 101, 102, and 103. When server 104 is software, it can be multiple software programs or software modules that provide various services to terminal devices 101, 102, and 103, or it can be a single software program or software module that provides various services to terminal devices 101, 102, and 103. This disclosure does not limit the scope of the embodiments.
[0021] Network 105 can be a wired network using coaxial cable, twisted pair, and fiber optic connection, or it can be a wireless network that enables interconnection of various communication devices without wiring, such as Bluetooth, Near Field Communication (NFC), Infrared, etc. This disclosure does not limit the scope of the network.
[0022] Users can establish a communication connection with server 104 via network 105 through terminal devices 101, 102, and 103 to receive or send information, etc. It should be noted that the specific types, quantities, and combinations of terminal devices 101, 102, and 103, server 104, and network 105 can be adjusted according to the actual needs of the application scenario, and this disclosure embodiment does not impose any limitations on this.
[0023] Figure 2 This is a schematic flowchart of a method for extracting facial image features provided in an embodiment of this disclosure. Figure 2 Methods for extracting facial image features can be derived from... Figure 1 The computer or server, or the software on the computer or server, executes the command. For example... Figure 2 As shown, the method for extracting features from this facial image includes:
[0024] S201, construct a spatial multi-level network and an attention-based mixed-style network, and construct a spatial multi-level mixed-style network based on the spatial multi-level network and the attention-based mixed-style network;
[0025] S202 is a multi-stage network constructed based on a spatial multi-level mixed style network, convolutional layers, and normalization layers to obtain a spatial multi-level mixed style model.
[0026] S203: Obtain a face image and extract high-precision features from the face image using a spatial multi-level mixed style model.
[0027] Spatial multi-level networks are designed to provide sufficient spatial level information of face images that are different from each other; attention-mixed style networks are designed to suppress the influence of different styles, backgrounds, and noise on the extraction of face features.
[0028] In existing technologies, classic neural networks are composed of blocks. Currently popular blocks have the concept of "expansion." For example, the classic EffcientNet has an expansion rate of 6.0; the classic RegNetZ has an expansion rate of 4.0. That is, if the input feature map has 64 channels, then in an EffcientNet block, it will be expanded to 64 x 6 = 384 for computation; in a RegNetZ block, it will be expanded to 64 x 4 = 256 for computation.
[0029] Practical experience has shown that the above structural design has the drawback that the calculation method is the same whether the scale is increased by 6 times or 4 times (e.g., uniformly using 3x3 depthwise convolution or uniformly using 3x3 grouped convolution). Therefore, it not only has computational redundancy, but also limited information that can be obtained (single method, limited means). Therefore, this disclosure constructs a spatial multi-level mixed style model, which can obtain sufficient facial information and suppress the influence of different styles, backgrounds, and noise on the extraction of facial features.
[0030] The expansion rate can be simply understood as the number of feature types extracted from the model. The following text will also involve a large number of basic concepts in neural network models. Since the basic concepts in neural network models are well-known and recognized, this disclosure only uses these basic concepts to build a new model. The meaning of these basic concepts has not changed. Therefore, the concepts cited later will not be explained in detail. If you have any questions, please consult the relevant knowledge yourself.
[0031] According to the technical solution provided in this disclosure, a facial image feature extraction model is constructed, which includes: a facial image feature extraction branch and a quality regression branch; a pre-trained model is obtained, and the facial image feature extraction model is initialized using the pre-trained model; the initialized facial image feature extraction model is trained in multiple stages; a pedestrian dataset to be classified is obtained, which includes: multiple images of multiple people; the images in the pedestrian dataset are classified using the facial image feature extraction model trained in multiple stages to obtain one or more images of each person. Therefore, by adopting the above technical means, the problem of low accuracy of extracted facial features in the prior art can be solved, thereby improving the accuracy of extracted facial features.
[0032] In step S201, constructing a spatial multi-level network and an attention-mixed style network includes: obtaining the expansion rate N of the spatial multi-level mixed style model to be constructed, where N is a positive integer; when N is even, determining that the number of network layers contained in both the spatial multi-level network and the attention-mixed style network is N / 2; when N is odd, determining that the number of network layers contained in the spatial multi-level network is (N+1) / 2, and determining that the number of network layers contained in the attention-mixed style network is (N-1) / 2; and constructing the spatial multi-level network and the attention-mixed style network based on the number of network layers contained in both the spatial multi-level network and the attention-mixed style network.
[0033] For example, when N is 5, the spatial multi-level network consists of a first spatial network layer, a second spatial network layer, and a third spatial network layer connected in sequence, while the attention-mixed style network consists of a first mixed style network layer and a second mixed style network layer connected in sequence. When N is 6, the spatial multi-level network consists of a first spatial network layer, a second spatial network layer, and a third spatial network layer connected in sequence, while the attention-mixed style network consists of a first mixed style network layer, a second mixed style network layer, and a third mixed style network layer connected in sequence. N can be any positive integer because the later spatial network layers and the later mixed style network layers have similar characteristics to the earlier ones. For example, the fourth and fifth spatial network layers can be derived from the first, second, and third spatial network layers, and the third mixed style network layer can be derived from the first and second mixed style network layers. To simplify the text and avoid being too verbose, the following explanation uses N=5 as an example to illustrate the spatial multi-level network and the attention-mixed style network.
[0034] In an optional embodiment, when N is 5, the spatial multi-level network consists of a first spatial network layer, a second spatial network layer, and a third spatial network layer connected in sequence. The first spatial network layer performs a first convolution calculation, a batch normalization calculation, and a first activation calculation on the face image in sequence to obtain a first spatial feature. The second spatial network layer performs a second convolution calculation and a batch normalization calculation on the first spatial feature in sequence to obtain a first result, and performs a first activation calculation on the sum of the first result and the first spatial feature to obtain a second spatial feature. The third spatial network layer performs feature stacking calculation, a second convolution calculation, and a batch normalization calculation on the first spatial feature and the second spatial feature in sequence to obtain a second result, and performs a first activation calculation on the sum of the second result and the second spatial feature to obtain a third spatial feature.
[0035] Convolutional computation is equivalent to convolutional layers; one can understand it as convolutional layers providing the convolutional computation. To illustrate this in more detail, the following examples demonstrate the computation of each network layer:
[0036] The first convolution calculation is a 1x1 kernel, C channels, 0 padding, 1 stride, and 1 dilation rate (Conv). The batch normalization calculation is BatchNorm (BN) normalization. The first activation calculation uses the PReLU activation function. The second convolution calculation is a 3x3 kernel, C channels, 1 padding, 1 stride, and 1 dilation rate (Conv). The feature stacking calculation is concat processing, which combines multiple features to obtain a new feature. For example, the first spatial feature and the second spatial feature are stacked sequentially to obtain a new feature. Then, the second convolution calculation and batch normalization calculation are performed on this new feature to obtain the second result.
[0037] The above embodiment is expressed by the following formula: the first spatial feature is f1, the second spatial feature is f2, the third spatial feature is f3, the first result is Δx1, the second result is Δx2, then:
[0038] Δx1=BN(Conv(f1,3x3))
[0039] f2 = PReLU(f1 + Δx1)
[0040] Δx2=BN(Conv(concat(f1,f2),3x3))
[0041] f3 = PReLU(f2 + Δx2)
[0042] If N is 7 or 8, then the fourth spatial feature is f4, and the first result is Δx3:
[0043] Δx3=BN(Conv(BN(Conv(concat(f1,f2,f3),1x1)),3x3))
[0044] f4 = PReLU(f3 + Δx3)
[0045] As can be seen from the above, this disclosure proposes a "spatial multi-order" technique, which includes operations such as residual calculation, convolution calculation, feature connection (feature stacking calculation), batch normalization, and activation calculation. The above is the detailed calculation process of the 3rd order. By repeating this process, the 4th order, 5th order, and so on up to the kth order can be derived (repeating the calculation process of the 3rd order).
[0046] If we define the "multi-order spatial" technique as an M-function, with the order as the parameter, then we have the following formula:
[0047] f2, f3, f4 = M(f1; 3)
[0048] To promote it, there are:
[0049] f2, f3…f k+1 =M(f1;k)
[0050] In an optional embodiment, the following steps are included: when N is 5, the attention-mixed style network consists of a first mixed style network layer and a second mixed style network layer connected in sequence; the first mixed style network layer performs the following operations: performing a third convolution calculation and instance normalization calculation on the first spatial features in sequence to obtain a third result; performing a first convolution calculation and a second activation calculation on the third result in sequence to obtain a fourth result; multiplying the third result and the fourth result to obtain a fifth result; performing feature stacking calculation, interactive sorting operation, first convolution calculation and group normalization calculation on the first spatial features and the fifth result in sequence to obtain a first mixed style feature.
[0051] The third convolution calculation is a convolution calculation (Conv) with a 3x3 kernel, C channels, 2 padding, 1 stride, and 3 dilation. Instance normalization is calculated using IN (InstanceNorm). The second activation calculation uses the sigmoid activation function. The interactive sorting operation is a shuffle process, where the features of the first spatial feature are located in channels 1, 3, 5, and 7, and the features of the fifth result are located in channels 2, 4, 6, and 8. Group normalization is calculated using GN (GroupNorm).
[0052] The above embodiments are expressed by formulas: It is the third result. It is the fourth result. It is the fifth result. This is the first characteristic of mixed styles. If it is the first characteristic of mixed styles, then:
[0053]
[0054]
[0055]
[0056] If we define the "attention-mixed style" technique as a T-function, then the formula is as follows:
[0057]
[0058] Obviously, if the input is f2, then:
[0059]
[0060] If the input is f3, then:
[0061]
[0062] The second mixed-style network layer performs the following operations: the second spatial features are sequentially subjected to the third convolution calculation and instance normalization calculation to obtain the sixth result; the sixth result is sequentially subjected to the first convolution calculation and the second activation calculation to obtain the seventh result; the sixth result and the seventh result are multiplied to obtain the eighth result; the second spatial features and the eighth result are sequentially subjected to feature stacking calculation, interactive sorting operation, first convolution calculation and group normalization calculation to obtain the second mixed-style features.
[0063] The second mixed-style network layer is similar to the first mixed-style network layer, and will not be described further.
[0064] The first spatial feature, second spatial feature, third spatial feature, first mixed style feature and second mixed style feature are sequentially stacked and activated to obtain the spatial multi-level mixed style features of the face image output by the spatial multi-level mixed style network. Among them, the spatial multi-level mixed style features are related to the high-precision features.
[0065] As mentioned above, the expansion rate can be simply understood as the number of feature types extracted from the model. If the expansion rate N is 5, then the model extracts 5 types of features, namely: first spatial features, second spatial features, third spatial features, first mixed style features, and second mixed style features.
[0066] To make it more convincing that N can be any positive integer, this embodiment of the disclosure takes N as 7 as an example: when N is 7, there are first spatial features, second spatial features, third spatial features and fourth spatial features, first mixed style features, second mixed style features and third mixed style features.
[0067] Spatial multi-level mixed style networks can be represented as:
[0068] f1 = BN(Conv(F,1x1))
[0069] f2, f3, f4 = M(f1; 3)
[0070]
[0071]
[0072]
[0073] F is a picture of a face.
[0074] Because the first spatial feature is the lowest-order spatial feature, it contains relatively little information and can be discarded (or not). Finally, the six features are stacked, and then PReLU activation is calculated; the formula is as follows:
[0075]
[0076] It should be noted that spatial features can also be updated with other more informative features, such as updating the first spatial feature to f. out :
[0077] The first spatial feature is subjected to a convolution operation (Conv) with a 1x1 kernel, C channels, 0 padding, 1 stride, and 1 dilation rate. This is followed by batch normalization to obtain f. out ;
[0078] f out =BN(Conv(f1,1x1))
[0079] It is worth noting that the above is only one possible form of the "spatial multi-level mixed style" module. For example, when the expansion rate is 6 (discarding the first spatial feature), there are many other combinations, such as:
[0080] f2, f3, f4 = M(f1; 3)
[0081]
[0082]
[0083]
[0084] or:
[0085] f2,f3=M(f1;2)
[0086]
[0087]
[0088]
[0089]
[0090] Both are acceptable.
[0091] In other words, various combinations can be formed from "spatial multi-level" technology and "attention mixed style" technology to create a variety of "spatial multi-level mixed style" modules.
[0092] In step 202, a multi-stage network is constructed based on a spatial multi-style mixing network, convolutional layers, and normalization layers to obtain a spatial multi-style mixing model, including: convolutional layers, including a first convolutional layer, a second convolutional layer, a third convolutional layer, and a fourth convolutional layer. The first convolutional layer, the second convolutional layer, the third convolutional layer, and the fourth convolutional layer sequentially perform fourth convolution calculations, fifth convolution calculations, sixth convolution calculations, and seventh convolution calculations. The normalization layer performs batch normalization calculations. The first convolutional layer, the normalization layer, and the three spatial multi-style mixing layers are then combined. The first stage network is formed by connecting the lattice network in sequence; the second stage network is formed by connecting the second convolutional layer, the normalization layer, and three spatial multi-level mixed style networks in sequence; the third stage network is formed by connecting the third convolutional layer, the normalization layer, and nine spatial multi-level mixed style networks in sequence; the fourth stage network is formed by connecting the fourth convolutional layer, the normalization layer, and three spatial multi-level mixed style networks in sequence; and the spatial multi-level mixed style model is formed by connecting the first stage network, the second stage network, the third stage network, and the fourth stage network in sequence.
[0093] The fourth convolution calculation can be a convolution calculation with a kernel of 3x3, 64 channels, a stride of 2, and padding of 1; the fifth convolution calculation can be a convolution calculation with a kernel of 3x3, 128 channels, a stride of 2, and padding of 1; the sixth convolution calculation can be a convolution calculation with a kernel of 3x3, 256 channels, a stride of 2, and padding of 1; the seventh convolution calculation can be a convolution calculation with a kernel of 3x3, 512 channels, a stride of 2, and padding of 1.
[0094] It should be noted that the number of spatial multi-style hybrid networks in the above four-stage networks can be adjusted.
[0095] All of the above-mentioned optional technical solutions can be combined in any way to form the optional embodiments of this application, and will not be described in detail here.
[0096] The following are embodiments of the apparatus disclosed herein, which can be used to execute embodiments of the method disclosed herein. For details not disclosed in the apparatus embodiments of this disclosure, please refer to the embodiments of the method disclosed herein.
[0097] Figure 3 This is a schematic diagram of a facial image feature extraction device provided in an embodiment of this disclosure. Figure 3 As shown, the facial image feature extraction device includes:
[0098] The first building module 301 is configured to build a spatial multi-level network and an attention-based mixed-style network, and to build a spatial multi-level mixed-style network based on the spatial multi-level network and the attention-based mixed-style network.
[0099] The second building module 302 is configured to build a multi-stage network based on a spatial multi-level mixed style network, convolutional layers and normalization layers to obtain a spatial multi-level mixed style model.
[0100] The extraction module 303 is configured to acquire face images and extract high-precision features from the face images using a spatial multi-level mixed style model.
[0101] Spatial multi-level networks are designed to provide sufficient spatial level information of face images that are different from each other; attention-mixed style networks are designed to suppress the influence of different styles, backgrounds, and noise on the extraction of face features.
[0102] In existing technologies, classic neural networks are composed of blocks. Currently popular blocks have the concept of "expansion." For example, the classic EffcientNet has an expansion rate of 6.0; the classic RegNetZ has an expansion rate of 4.0. That is, if the input feature map has 64 channels, then in an EffcientNet block, it will be expanded to 64 x 6 = 384 for computation; in a RegNetZ block, it will be expanded to 64 x 4 = 256 for computation.
[0103] Practical experience has shown that the above structural design has the drawback that the calculation method is the same whether the scale is increased by 6 times or 4 times (e.g., uniformly using 3x3 depthwise convolution or uniformly using 3x3 grouped convolution). Therefore, it not only has computational redundancy, but also limited information that can be obtained (single method, limited means). Therefore, this disclosure constructs a spatial multi-level mixed style model, which can obtain sufficient facial information and suppress the influence of different styles, backgrounds, and noise on the extraction of facial features.
[0104] The expansion rate can be simply understood as the number of feature types extracted from the model. The following text will also involve a large number of basic concepts in neural network models. Since the basic concepts in neural network models are well-known and recognized, this disclosure only uses these basic concepts to build a new model. The meaning of these basic concepts has not changed. Therefore, the concepts cited later will not be explained in detail. If you have any questions, please consult the relevant knowledge yourself.
[0105] According to the technical solution provided in this disclosure, a facial image feature extraction model is constructed, which includes: a facial image feature extraction branch and a quality regression branch; a pre-trained model is obtained, and the facial image feature extraction model is initialized using the pre-trained model; the initialized facial image feature extraction model is trained in multiple stages; a pedestrian dataset to be classified is obtained, which includes: multiple images of multiple people; the images in the pedestrian dataset are classified using the facial image feature extraction model trained in multiple stages to obtain one or more images of each person. Therefore, by adopting the above technical means, the problem of low accuracy of extracted facial features in the prior art can be solved, thereby improving the accuracy of extracted facial features.
[0106] Optionally, the first building module 301 is further configured to obtain the expansion rate N of the spatial multi-level mixed style model to be built, where N is a positive integer; when N is even, the number of network layers contained in the spatial multi-level network and the attention mixed style network is determined to be N / 2; when N is odd, the number of network layers contained in the spatial multi-level network is determined to be (N+1) / 2, and the number of network layers contained in the attention mixed style network is determined to be (N-1) / 2; based on the number of network layers contained in the spatial multi-level network and the attention mixed style network, the spatial multi-level network and the attention mixed style network are constructed.
[0107] For example, when N is 5, the spatial multi-level network consists of a first spatial network layer, a second spatial network layer, and a third spatial network layer connected in sequence, while the attention-mixed style network consists of a first mixed style network layer and a second mixed style network layer connected in sequence. When N is 6, the spatial multi-level network consists of a first spatial network layer, a second spatial network layer, and a third spatial network layer connected in sequence, while the attention-mixed style network consists of a first mixed style network layer, a second mixed style network layer, and a third mixed style network layer connected in sequence. N can be any positive integer because the later spatial network layers and the later mixed style network layers have similar characteristics to the earlier ones. For example, the fourth and fifth spatial network layers can be derived from the first, second, and third spatial network layers, and the third mixed style network layer can be derived from the first and second mixed style network layers. To simplify the text and avoid being too verbose, the following explanation uses N=5 as an example to illustrate the spatial multi-level network and the attention-mixed style network.
[0108] When N is 5, the spatial multi-level network consists of a first spatial network layer, a second spatial network layer, and a third spatial network layer connected sequentially. The first spatial network layer performs a first convolution calculation, a batch normalization calculation, and a first activation calculation on the face image sequentially to obtain a first spatial feature. The second spatial network layer performs a second convolution calculation and a batch normalization calculation on the first spatial feature sequentially to obtain a first result, and performs a first activation calculation on the sum of the first result and the first spatial feature to obtain a second spatial feature. The third spatial network layer performs feature stacking calculation, a second convolution calculation, and a batch normalization calculation on the first spatial feature and the second spatial feature sequentially to obtain a second result, and performs a first activation calculation on the sum of the second result and the second spatial feature to obtain a third spatial feature.
[0109] Convolutional computation is equivalent to convolutional layers; one can understand it as convolutional layers providing the convolutional computation. To illustrate this in more detail, the following examples demonstrate the computation of each network layer:
[0110] The first convolution calculation is a 1x1 kernel, C channels, 0 padding, 1 stride, and 1 dilation rate (Conv). The batch normalization calculation is BatchNorm (BN) normalization. The first activation calculation uses the PReLU activation function. The second convolution calculation is a 3x3 kernel, C channels, 1 padding, 1 stride, and 1 dilation rate (Conv). The feature stacking calculation is concat processing, which combines multiple features to obtain a new feature. For example, the first spatial feature and the second spatial feature are stacked sequentially to obtain a new feature. Then, the second convolution calculation and batch normalization calculation are performed on this new feature to obtain the second result.
[0111] The above embodiment is expressed by the following formula: the first spatial feature is f1, the second spatial feature is f2, the third spatial feature is f3, the first result is Δx1, the second result is Δx2, then:
[0112] Δx1=BN(Conv(f1,3x3))
[0113] f2 = PReLU(f1 + Δx1)
[0114] Δx2=BN(Conv(concat(f1,f2),3x3))
[0115] f3 = PReLU(f2 + Δx2)
[0116] If N is 7 or 8, then the fourth spatial feature is f4, and the first result is Δx3:
[0117] Δx3=BN(Conv(BN(Conv(concat(f1,f2,f3),1x1)),3x3))
[0118] f4 = PReLU(f3 + Δx3)
[0119] As can be seen from the above, this disclosure proposes a "spatial multi-order" technique, which includes operations such as residual calculation, convolution calculation, feature connection (feature stacking calculation), batch normalization, and activation calculation. The above is the detailed calculation process of the 3rd order. By repeating this process, the 4th order, 5th order, and so on up to the kth order can be derived (repeating the calculation process of the 3rd order).
[0120] If we define the "multi-order spatial" technique as an M-function, with the order as the parameter, then we have the following formula:
[0121] f2, f3, f4 = M(f1; 3)
[0122] To promote it, there are:
[0123] f2, f3…f k+1 =M(f1;k)
[0124] When N is 5, the attention-based mixed-style network consists of a first mixed-style network layer and a second mixed-style network layer connected in sequence. The first mixed-style network layer performs the following operations: the third convolution and instance normalization are performed on the first spatial features in sequence to obtain the third result; the first convolution and second activation are performed on the third result in sequence to obtain the fourth result; the third result and the fourth result are multiplied to obtain the fifth result; the first spatial features and the fifth result are performed on feature stacking, interactive sorting, the first convolution and group normalization in sequence to obtain the first mixed-style features.
[0125] The third convolution calculation is a convolution calculation (Conv) with a 3x3 kernel, C channels, 2 padding, 1 stride, and 3 dilation. Instance normalization is calculated using IN (InstanceNorm). The second activation calculation uses the sigmoid activation function. The interactive sorting operation is a shuffle process, where the features of the first spatial feature are located in channels 1, 3, 5, and 7, and the features of the fifth result are located in channels 2, 4, 6, and 8. Group normalization is calculated using GN (GroupNorm).
[0126] The above embodiments are expressed by formulas: It is the third result. It is the fourth result. It is the fifth result. This is the first characteristic of mixed styles. If it is the first characteristic of mixed styles, then:
[0127]
[0128]
[0129]
[0130] If we define the "attention-mixed style" technique as a T-function, then the formula is as follows:
[0131]
[0132] Obviously, if the input is f2, then:
[0133]
[0134] If the input is f3, then:
[0135]
[0136] The second mixed-style network layer performs the following operations: the second spatial features are sequentially subjected to the third convolution calculation and instance normalization calculation to obtain the sixth result; the sixth result is sequentially subjected to the first convolution calculation and the second activation calculation to obtain the seventh result; the sixth result and the seventh result are multiplied to obtain the eighth result; the second spatial features and the eighth result are sequentially subjected to feature stacking calculation, interactive sorting operation, first convolution calculation and group normalization calculation to obtain the second mixed-style features.
[0137] The second mixed-style network layer is similar to the first mixed-style network layer, and will not be described further.
[0138] Optionally, the first building module 301 is further configured to perform feature stacking calculation and first activation calculation on the first spatial feature, the second spatial feature, the third spatial feature, the first mixed style feature and the second mixed style feature in sequence to obtain the spatial multi-level mixed style features of the face image output by the spatial multi-level mixed style network, wherein the spatial multi-level mixed style features are related to the high-precision features.
[0139] As mentioned above, the expansion rate can be simply understood as the number of feature types extracted from the model. If the expansion rate N is 5, then the model extracts 5 types of features, namely: first spatial features, second spatial features, third spatial features, first mixed style features, and second mixed style features.
[0140] To make it more convincing that N can be any positive integer, this embodiment of the disclosure takes N as 7 as an example: when N is 7, there are first spatial features, second spatial features, third spatial features and fourth spatial features, first mixed style features, second mixed style features and third mixed style features.
[0141] Spatial multi-level mixed style networks can be represented as:
[0142] f1 = BN(Conv(F,1x1))
[0143] f2, f3, f4 = M(f1; 3)
[0144]
[0145]
[0146]
[0147] F is a picture of a face.
[0148] Because the first spatial feature is the lowest-order spatial feature, it contains relatively little information and can be discarded (or not). Finally, the six features are stacked, and then PReLU activation is calculated; the formula is as follows:
[0149]
[0150] It should be noted that spatial features can also be updated with other more informative features, such as updating the first spatial feature to f. out :
[0151] The first spatial feature is subjected to a convolution operation (Conv) with a 1x1 kernel, C channels, 0 padding, 1 stride, and 1 dilation rate. This is followed by batch normalization to obtain f. out ;
[0152] f out =BN(Conv(f1,1x1))
[0153] It is worth noting that the above is only one possible form of the "spatial multi-level mixed style" module. For example, when the expansion rate is 6 (discarding the first spatial feature), there are many other combinations, such as:
[0154] f2, f3, f4 = M(f1; 3)
[0155]
[0156]
[0157]
[0158] or:
[0159] f2,f3=M(f1;2)
[0160]
[0161]
[0162]
[0163]
[0164] Both are acceptable.
[0165] In other words, various combinations can be formed from "spatial multi-level" technology and "attention mixed style" technology to create a variety of "spatial multi-level mixed style" modules.
[0166] The convolutional layer includes a first convolutional layer, a second convolutional layer, a third convolutional layer, and a fourth convolutional layer. The first convolutional layer, the second convolutional layer, the third convolutional layer, and the fourth convolutional layer perform the fourth convolution calculation, the fifth convolution calculation, the sixth convolution calculation, and the seventh convolution calculation in sequence. The normalization layer performs batch normalization calculation.
[0167] Optionally, the second building module 302 is further configured to sequentially connect the first convolutional layer, the normalization layer, and three spatial multi-level mixed style networks to form a first-stage network; sequentially connect the second convolutional layer, the normalization layer, and three spatial multi-level mixed style networks to form a second-stage network; sequentially connect the third convolutional layer, the normalization layer, and nine spatial multi-level mixed style networks to form a third-stage network; sequentially connect the fourth convolutional layer, the normalization layer, and three spatial multi-level mixed style networks to form a fourth-stage network; and sequentially connect the first-stage network, the second-stage network, the third-stage network, and the fourth-stage network to form a spatial multi-level mixed style model.
[0168] The fourth convolution calculation can be a convolution calculation with a kernel of 3x3, 64 channels, a stride of 2, and padding of 1; the fifth convolution calculation can be a convolution calculation with a kernel of 3x3, 128 channels, a stride of 2, and padding of 1; the sixth convolution calculation can be a convolution calculation with a kernel of 3x3, 256 channels, a stride of 2, and padding of 1; the seventh convolution calculation can be a convolution calculation with a kernel of 3x3, 512 channels, a stride of 2, and padding of 1.
[0169] It should be noted that the number of spatial multi-style hybrid networks in the above four-stage networks can be adjusted.
[0170] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this disclosure.
[0171] Figure 4 This is a schematic diagram of the electronic device 4 provided in an embodiment of this disclosure. Figure 4 As shown, the electronic device 4 of this embodiment includes: a processor 401, a memory 402, and a computer program 403 stored in the memory 402 and executable on the processor 401. When the processor 401 executes the computer program 403, it implements the steps in the various method embodiments described above. Alternatively, when the processor 401 executes the computer program 403, it implements the functions of each module / unit in the various device embodiments described above.
[0172] Electronic device 4 can be a desktop computer, laptop, handheld computer, cloud server, or other electronic device. Electronic device 4 may include, but is not limited to, processor 401 and memory 402. Those skilled in the art will understand that... Figure 4 This is merely an example of electronic device 4 and does not constitute a limitation on electronic device 4. It may include more or fewer components than shown, or different components.
[0173] The processor 401 may be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
[0174] The memory 402 can be an internal storage unit of the electronic device 4, such as a hard disk or RAM of the electronic device 4. The memory 402 can also be an external storage device of the electronic device 4, such as a plug-in hard disk, Smart Media Card (SMC), Secure Digital (SD) card, Flash Card, etc., equipped on the electronic device 4. The memory 402 can also include both internal and external storage units of the electronic device 4. The memory 402 is used to store computer programs and other programs and data required by the electronic device.
[0175] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0176] If an integrated module / unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program may include computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. A computer-readable medium may include: any entity or device capable of carrying computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in a computer-readable medium may be appropriately added to or subtracted according to the requirements of legislation and patent practice in a jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.
[0177] The above embodiments are only used to illustrate the technical solutions of this disclosure, and are not intended to limit it. Although this disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this disclosure, and should all be included within the protection scope of this disclosure.
Claims
1. A method for extracting features of a human face picture, characterized in that, include: Construct a spatial multi-level network and an attention-mixed style network, and then construct a spatial multi-level mixed style network based on the spatial multi-level network and the attention-mixed style network; Based on the aforementioned spatial multi-level mixed style network, convolutional layers, and normalization layers, a multi-stage network is constructed to obtain a spatial multi-level mixed style model. Acquire a face image, and use the spatial multi-level mixed style model to extract high-precision features from the face image; The construction of the spatial multi-level network and the attention-based mixed-style network includes: Obtain the expansion rate N of the spatial multi-level mixed style model to be constructed, where N is a positive integer; When N is even, the number of network layers contained in both the spatial multi-level network and the attention hybrid style network is determined to be N / 2. When N is odd, the number of network layers contained in the spatial multi-level network is determined to be (N+1) / 2, and the number of network layers contained in the attention hybrid style network is determined to be (N-1) / 2. Based on the number of network layers contained in the spatial multi-level network and the attention mixed-style network, a spatial multi-level network and an attention mixed-style network are constructed; wherein, the spatial multi-level network is used to provide spatial level information of multiple face images that are different from each other, and the attention mixed-style network is used to suppress the interference of different styles, backgrounds and noise on face feature extraction.
2. The method of claim 1, wherein, include: When N is 5, the spatial multi-level network consists of a first spatial network layer, a second spatial network layer, and a third spatial network layer connected in sequence; The first spatial network layer sequentially performs a first convolution calculation, a batch normalization calculation, and a first activation calculation on the face image to obtain the first spatial features; The second spatial network layer performs the second convolution calculation and the batch normalization calculation on the first spatial features in sequence to obtain the first result, and performs the first activation calculation on the sum of the first result and the first spatial features to obtain the second spatial features; The third spatial network layer performs feature stacking calculation, second convolution calculation, and batch normalization calculation on the first spatial feature and the second spatial feature in sequence to obtain a second result, and performs the first activation calculation on the sum of the second result and the second spatial feature to obtain the third spatial feature.
3. The method of claim 2, wherein, include: When N is 5, the attention mixed style network consists of a first mixed style network layer and a second mixed style network layer connected in sequence; The first mixed-style network layer performs the following operations: The third spatial feature is then subjected to a third convolution calculation and an instance normalization calculation to obtain a third result. The third result is then subjected to the first convolution calculation and the second activation calculation in sequence to obtain the fourth result; The fifth result is obtained by multiplying the third result by the fourth result; The first spatial feature and the fifth result are sequentially subjected to feature stacking calculation, interactive sorting operation, first convolution calculation and group normalization calculation to obtain the first mixed style feature.
4. The method of claim 3, wherein, include: The second style mixing network layer performs the following operations: The second spatial feature is then subjected to a third convolution calculation and an instance normalization calculation in sequence to obtain the sixth result; The sixth result is then subjected to the first convolution calculation and the second activation calculation in sequence to obtain the seventh result; The sixth result is multiplied by the seventh result to obtain the eighth result; The second spatial feature and the eighth result are sequentially subjected to feature stacking calculation, interactive sorting operation, first convolution calculation and group normalization calculation to obtain the second mixed style feature.
5. The method of claim 4, wherein, include: The first spatial feature, the second spatial feature, the third spatial feature, the first mixed style feature, and the second mixed style feature are sequentially subjected to feature stacking calculation and the first activation calculation to obtain the spatial multi-level mixed style features of the face image output by the spatial multi-level mixed style network, wherein the spatial multi-level mixed style features are related to the high-precision features.
6. The method of claim 1, wherein, The construction of a multi-stage network based on the spatial multi-level mixed style network, convolutional layers, and normalization layers to obtain a spatial multi-level mixed style model includes: The convolutional layer includes a first convolutional layer, a second convolutional layer, a third convolutional layer, and a fourth convolutional layer. The first convolutional layer, the second convolutional layer, the third convolutional layer, and the fourth convolutional layer sequentially perform fourth convolution calculation, fifth convolution calculation, sixth convolution calculation, and seventh convolution calculation. The normalization layer performs batch normalization calculation. The first convolutional layer, the normalization layer, and the three spatial multi-level mixed style networks are connected sequentially to form the first-stage network; The second convolutional layer, the normalization layer, and the three spatial multi-level mixed style networks are connected sequentially to form the second-stage network; The third convolutional layer, the normalization layer, and the nine spatial multi-level mixed style networks are connected sequentially to form the third-stage network; The fourth convolutional layer, the normalization layer, and the three spatial multi-level mixed style networks are connected sequentially to form the fourth-stage network; The first-stage network, the second-stage network, the third-stage network, and the fourth-stage network are connected sequentially to form the spatial multi-level mixed style model.
7. A device for extracting facial image features, characterized in that, include: The first construction module is configured to construct a spatial multi-level network and an attention-mixed style network, and to construct a spatial multi-level mixed style network based on the spatial multi-level network and the attention-mixed style network. The second building module is configured to build a multi-stage network based on the spatial multi-level mixed style network, convolutional layers and normalization layers to obtain a spatial multi-level mixed style model. The extraction module is configured to acquire a face image and extract high-precision features of the face image using the spatial multi-level mixed style model; The first construction module is specifically configured to: obtain the expansion rate N of the spatial multi-level mixed style model to be constructed, where N is a positive integer; when N is even, determine that the number of network layers contained in the spatial multi-level network and the attention mixed style network is N / 2; when N is odd, determine that the number of network layers contained in the spatial multi-level network is (N+1) / 2, and determine that the number of network layers contained in the attention mixed style network is (N-1) / 2; construct the spatial multi-level network and the attention mixed style network based on the number of network layers contained in the spatial multi-level network and the attention mixed style network; wherein, the spatial multi-level network is used to provide spatial order information of multiple different face images, and the attention mixed style network is used to suppress the interference of different styles, backgrounds, and noise on face feature extraction.
8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method as described in any one of claims 1 to 6.
9. A computer-readable storage medium storing a computer program, the computer program comprising instructions that, when executed by a computer, cause the computer to perform the method of any one of claims 1 to 8. When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 6.
Citation Information
Patent Citations
Face recognition method and device and electronic equipment
CN112597941A