Face attribute detection method based on deep self-attention network

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a face attribute detection method based on deep self-attention networks, combined with shared feature learning and attention mechanisms, the problem of low accuracy in complex environments for face attribute detection is solved, achieving higher detection accuracy and robustness while reducing computational costs.

CN115588217BActive Publication Date: 2026-06-19XIDIAN UNIV

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: XIDIAN UNIV
Filing Date: 2022-06-23
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing face attribute detection methods are limited by complex backgrounds, pose variations, and lighting changes when processing different face images, resulting in low detection accuracy and ignoring the inherent relationship between face attributes and identity information.

Method used

A face attribute detection method based on deep self-attention network is adopted. By combining a shared attribute feature learning module and a specific attention feature learning module with a semi-automatic face attribute grouping strategy and an identity-related hierarchical loss function, global and local features of face images are extracted, and attention mechanism is used to focus on different content of the image.

Benefits of technology

It improves the accuracy of face attribute detection, especially when dealing with face attributes with similar heatmaps, significantly enhancing the robustness and discriminativeness of detection, reducing computational costs, and effectively utilizing face identity information.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115588217B_ABST

Patent Text Reader

Abstract

This invention relates to a face attribute detection method based on a deep self-attention network, comprising: Step 1, obtaining a training sample set, wherein the training sample set includes N face images and identity information for each face image, wherein each face image contains A face attribute labels, and N and A are natural numbers greater than 0; Step 2, training a face attribute detection model using the training sample set to obtain a trained deep face attribute detection model, wherein the deep face attribute detection model includes a shared attribute feature learning module and a specific attention feature learning module; Step 3, inputting the face image to be detected into the trained deep face attribute detection model to obtain the detection result. This invention proposes an identity-related hierarchical face attribute loss function. By simultaneously inputting face attributes and face identity, the task of learning the relationship between face attributes and face identity can guide the model to better learn the face attribute detection task, thereby improving the detection accuracy.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of face attribute detection technology, and relates to a face attribute detection method based on deep self-attention networks. Background Technology

[0002] Face attribute detection can analyze semantic information (including age, gender, etc.) in face image data. It has wide applications in video surveillance, face retrieval, and social media. In recent years, with the development of deep learning technologies, image classification tasks have made significant progress. However, due to the large differences between different face attributes, the accuracy of simultaneously analyzing and detecting different attributes in face images still falls short of practical applications. Deep learning-based face attribute detection methods can combine feature extraction and attribute classification to achieve end-to-end learning, thereby improving the detection rate. This has led to a continuous increase in the popularity of face attribute detection in the field of image processing.

[0003] Facial attributes can be used as a standalone task in face recognition, or as supplementary information to assist other tasks. Although current facial attribute detection algorithms have achieved excellent performance, there are still some challenging problems and shortcomings that require attention.

[0004] Facial image analysis plays a crucial role in biometric security and computer vision. Facial attributes, considered key semantic information, can be applied to various real-world scenarios (e.g., surveillance, image retrieval, and facial attribute tampering). The core challenge of facial attribute detection is extracting suitable features that connect visual descriptive words and image pixels—two distinct domains. Despite significant advancements in facial attribute detection thanks to convolutional neural networks, many challenges remain in real-world applications.

[0005] Since facial images are captured in many different scenarios, and these facial images have complex backgrounds, large variations in pose and lighting between images, all of these factors will inevitably affect the accuracy of facial attribute detection. Summary of the Invention

[0006] To address the aforementioned problems in existing technologies, this invention provides a face attribute detection method based on deep self-attention networks. The technical problem to be solved by this invention is achieved through the following technical solution:

[0007] This invention provides a face attribute detection method based on a deep self-attention network, the face attribute detection method comprising:

[0008] Step 1: Obtain a training sample set, which includes N face images and identity information for each face image. Each face image contains A face attribute labels, where N and A are natural numbers greater than 0.

[0009] Step 2: Train the face attribute detection model using the training sample set to obtain a trained deep face attribute detection model. The deep face attribute detection model includes a shared attribute feature learning module and a specific attention feature learning module.

[0010] Step 3: Input the face image to be detected into the trained deep face attribute detection model to obtain the detection result.

[0011] In one embodiment of the present invention, step 2 includes:

[0012] Step 2.1: Segment the face images in the training sample set into several non-overlapping windows;

[0013] Step 2.2: Input the segmented face image and the identity information of the face image into the deep face attribute detection model, so as to establish a hierarchical identity information constraint loss function based on the output of the deep face attribute detection model;

[0014] Step 2.3: Use the stochastic gradient descent algorithm to process the hierarchical identity information restriction loss function to minimize the hierarchical identity information restriction loss function;

[0015] Step 2.4: Obtain the trained deep face attribute detection model by using the loss function that minimizes the hierarchical identity information constraint.

[0016] In one embodiment of the present invention, the shared attribute feature learning module includes a linear embedding layer, m image patch fusion layers, (m+1) Swing Transformer Layers, a pooling layer, and a first fully connected layer. The linear embedding layer, the (m+1) Swing Transformer Layers, and the first fully connected layer are connected sequentially. An image patch fusion layer is set before each of the second to (m+1)th Swing Transformer Layers, and a pooling layer is set after the (m+1)th Swing Transformer Layer.

[0017] In one embodiment of the present invention, the specific attention feature learning module includes: a global attribute branch module and several local region branch modules and an identity branch module.

[0018] In one embodiment of the present invention, the global attribute branch module and each of the local region branch modules each include a second fully connected layer, a third fully connected layer, a first ReLU activation function layer, a first dropout layer and a first batch normalization layer connected in sequence.

[0019] In one embodiment of the present invention, the identity branch module includes a fourth fully connected layer, a fifth fully connected layer, a second ReLU activation function layer, a second dropout layer, and a second batch normalization layer connected in sequence. The outputs of the second fully connected layers of all the local region branch modules are connected and then input to the identity branch module. The output of the identity branch module is used to calculate the global identity loss.

[0020] In one embodiment of the present invention, the hierarchical identity information restriction loss function is:

[0021]

[0022] Where α, λ, and β are weighting parameters;

[0023]

[0024]

[0025]

[0026] Where C is the number of identities in the training sample set, x n For the nth face image, For the identity information of the nth face image, if the identities of the two input face images are the same, then w i,j It is 1 if it is true, otherwise it is 0. Let be the features obtained from the i-th face image through the deep face attribute detection model. Let A be the feature obtained from the j-th face image through the deep face attribute detection model, G be the number of attribute groups according to the face attribute grouping strategy, and A be the feature obtained from the j-th face image through the deep face attribute detection model. g Let g be the number of face attributes in the g-th attribute group. Let a be the a-th attribute label within the g-th attribute group of the i-th face image. Let be the probability of the a-th attribute within the g-th attribute group of the i-th face image.

[0027] In one embodiment of the present invention, the method further includes the following step before step 3:

[0028] Obtain a test sample set, which includes M face images and identity information for each face image, wherein each face image contains A face attribute tags;

[0029] The trained deep face attribute detection model is used to detect the face attributes of the test sample set to obtain the test results.

[0030] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0031] 1. This invention proposes a semi-automatic facial attribute grouping strategy. Since facial attributes are interconnected, this strategy is applied to facial attribute detection tasks. Multi-task facial attribute detection methods based on this strategy group attributes according to their correlations, assigning each group a learning task. However, effectively grouping these diverse facial attributes remains a challenge. Based on the facial attribute visualization heatmap obtained from a deep self-attention neural network, this invention groups facial attributes with similar heatmaps into the same attribute group. Therefore, attributes describing global facial features, such as attractiveness, youthfulness, and weight, are grouped into one group, while attributes primarily depicting local facial features, such as a large nose, baldness, and bangs, are grouped into another. Unlike previous facial attribute grouping methods that always depend on prior knowledge (such as data type, semantics, and subjectivity), the proposed strategy effectively categorizes facial attributes based on their different semantic information. Specifically, the 40 facial attributes were divided into 7 attribute groups, which can further improve the inherent spatial relationship within the same attribute group and help the model learn the inherent connection between different attributes within the same semantic group.

[0032] 2. This invention uses a model based on a deep self-attention mechanism to extract shared facial attribute features. The attention mechanism helps the model prioritize different content within an image during image processing, enabling it to efficiently extract meaningful information from large amounts of image data using limited attention. To enable efficient processing of facial images, they need to be converted into sequential features. Specifically, the original facial image is first segmented into n smaller image blocks. Each block is then flattened, and can be considered a one-dimensional feature. Positional information is embedded in these blocks, and finally, category information is embedded into the total feature to convert the image into sequential features. Unlike traditional convolutional neural network models that only focus on features in adjacent regions, the model used in this invention focuses more on the global features of the image. To better focus on the global features, the model uses not only an attention mechanism but also a shifted window. The shifted window-based attention mechanism allows the model to pay more attention to the global features of the image.

[0033] 3. Considering the inherent relationship between semantic attributes and facial identity, this invention proposes an identity-related hierarchical facial attribute loss function to model the correlation between facial attributes and facial identity. Typically, two facial images with the same identity have an inherent correlation; these images share highly similar facial attributes, such as both having high cheekbones. Conversely, facial images with different identities have less similar facial attributes; for example, one image may have a large nose, while the other does not. This indicates that high-level identity information helps the model learn discriminative features for facial attribute detection. To model the relationship between semantic attributes and facial identity, this invention proposes an identity-related hierarchical facial attribute loss function. By simultaneously inputting facial attributes and facial identity, the task of learning the relationship between facial attributes and facial identity guides the model to better learn the facial attribute detection task, improving detection accuracy.

[0034] Other aspects and features of the invention will become apparent from the following detailed description with reference to the accompanying drawings. However, it should be understood that the drawings are for illustrative purposes only and not as a limitation of the scope of the invention, as reference should be made to the appended claims. It should also be understood that, unless otherwise indicated, the drawings are not necessarily drawn to scale; they are merely intended to conceptually illustrate the structures and processes described herein. Attached Figure Description

[0035] Figure 1 This is a flowchart illustrating a face attribute detection method based on a deep self-attention network provided in an embodiment of the present invention.

[0036] Figure 2 This is a schematic diagram of the structure of a deep face attribute detection model provided in an embodiment of the present invention. Detailed Implementation

[0037] The present invention will be further described in detail below with reference to specific embodiments, but the implementation of the present invention is not limited thereto.

[0038] Example 1

[0039] Please see Figure 1 , Figure 1 This is a flowchart illustrating a face attribute detection method based on a deep self-attention network provided by an embodiment of the present invention. The face attribute detection method includes:

[0040] Step 1: Obtain the training sample set, which includes N face images and identity information for each face image. Each face image contains A facial attribute labels, where N and A are natural numbers greater than 0. The facial attribute labels are used to annotate various attributes on the face image, including head, eyes, nose, mouth, cheeks, and neck. The identity information is used to represent the identity of the face image.

[0041] In addition, this embodiment also includes a test sample set, which includes M face images and identity information for each face image. Each face image contains A face attribute labels, and M is a natural number greater than 0.

[0042] Step 2: Train the face attribute detection model using the training sample set to obtain the trained deep face attribute detection model. The deep face attribute detection model includes a shared attribute feature learning module and an attention-specific feature learning module.

[0043] In one specific embodiment, step 2 may include:

[0044] Step 2.1: Segment the face images in the training sample set into several non-overlapping windows.

[0045] Specifically, given an input face image, the input face image is processed into multiple non-overlapping windows using a conventional window partitioning strategy, for example, each local window is 4×4 in size.

[0046] Step 2.2: Input the segmented face image and the identity information of the face image into the deep face attribute detection model, so as to establish a hierarchical identity information constraint loss function based on the output of the deep face attribute detection model.

[0047] In one specific embodiment, please refer to Figure 2The shared attribute feature learning module includes a linear embedding layer, m image patch fusion layers, (m+1) Swin Transformer Layers (a deep self-attention model based on shifted windows), a pooling layer, and a first fully connected layer. The linear embedding layer, the (m+1) Swin Transformer Layers, and the first fully connected layer are connected sequentially. An image patch fusion layer is placed before each of the 2nd to (m+1)th Swin Transformer Layers, and a pooling layer is placed after the (m+1)th Swin Transformer Layer. The linear embedding layer maps the segmented face image into 96-dimensional features. In the Swin Transformer Layers, standard multi-head self-attention (MSA) is first used to process the input, then a multilayer perceptron (MLP) is designed to improve the transformation capability. Finally, the features output from the first fully connected layer are fed to the various local region branches of the specific attention feature learning module.

[0048] In one specific embodiment, please refer to Figure 2 The specific attention feature learning module includes a global attribute branch module, several local region branch modules, and an identity branch module. The global attribute branch module outputs global attributes, which are overall attributes of the image, such as a round, chubby, heavily made-up, or oval face. The local region branch modules output local attributes, such as eyes and nose. Specifically, the local region branch modules include head region branch module, eye region branch module, nose region branch module, mouth region branch module, cheek region branch module, and neck region branch module. The global attribute branch module and each local region branch module include a second fully connected layer, a third fully connected layer, a first ReLU activation function layer, a first dropout layer, and a first batch normalization layer, all connected in sequence. The identity branch module includes a fourth fully connected layer, a fifth fully connected layer, a second ReLU activation function layer, a second dropout layer, and a second batch normalization layer, all connected in sequence. The outputs of the second fully connected layers of the global attribute branch module and all local region branch modules are concatenated and then input into the identity branch module. The output of the identity branch module is used to calculate the global identity loss.

[0049] The specific attention feature learning module, based on the proposed semi-automatic face attribute grouping strategy, adds corresponding specific local region branches to detect highly correlated face attributes within the same attribute group. In the local region branches, features from the shared attribute feature learning module are first connected to two fully connected layers (FC) of each local region branch module. The number of neurons in the two fully connected layers (i.e., the second and third fully connected layers) are 2048 and 512, respectively. To avoid overfitting and enhance the model's non-linear fitting ability, a ReLU (Linear Rectification Function) activation function, a dropout layer (dropout probability = 0.5), and a batch normalization (BN) layer are added after each fully connected layer. In addition to the local region branches, to model the relationship between face attributes and identity from a global perspective, an identity-related branch is introduced in the specific attention feature learning module. This branch primarily imposes constraints from a global perspective, connecting all features from the first fully connected layer of the local region branch module, and then connecting them to two more fully connected layers (the fourth and fifth fully connected layers), with 4096 and 2048 neurons respectively. A ReLU activation function, a dropout layer (dropout probability = 0.5), and a batch normalization layer are also added after the fully connected layers. The features obtained from the identity-related branch are used to calculate the global identity loss, helping the model further explore the intrinsic relationship between facial attributes and facial identity.

[0050] In one specific embodiment, the hierarchical identity information restriction loss function is:

[0051]

[0052] Where α, λ, and β are weighting parameters;

[0053]

[0054]

[0055]

[0056] Where C is the number of identities in the training sample set, x n For the nth face image, For the identity information of the nth face image, if the identities of the two input face images are the same, then w i,j It is 1 if it is true, otherwise it is 0. Let be the features obtained from the i-th face image through the deep face attribute detection model. Let A be the feature obtained from the j-th face image through the deep face attribute detection model, G be the number of attribute groups according to the face attribute grouping strategy, and A be the feature obtained from the j-th face image through the deep face attribute detection model. g Let g be the number of face attributes in the g-th attribute group. Let a be the a-th attribute label within the g-th attribute group of the i-th face image. Let αLoss be the probability of the a-th attribute within the g-th attribute group of the i-th face image. The probability of the a-th attribute is a value between 0 and 1 output by the deep face attribute detection model for a single attribute. This value is considered the probability; a value greater than 0.5 indicates that the image possesses that face attribute, otherwise it does not. F +(1-α)Loss C This is the global identity loss function. The attribute groups include the global attribute group, head group, eye group, nose group, mouth group, cheek group, and neck group. Please refer to Table 1 for the specific contents of the attribute groups.

[0057] Table 1. Face attributes and their corresponding attribute groups

[0058]

[0059]

[0060] This invention adds a semantic attention-specific region branch based on an attribute grouping strategy to detect strongly correlated facial attributes, grouping attributes with similar attention regions into the same group. Subsequently, facial attributes are divided into seven attention-specific attribute groups according to the proposed semi-automatic grouping strategy.

[0061] Step 2.3: Use the stochastic gradient descent algorithm to process the hierarchical identity information restriction loss function to minimize the hierarchical identity information restriction loss function;

[0062] Step 2.4: Obtain the trained deep face attribute detection model by limiting the loss function to the minimum hierarchical identity information.

[0063] In other words, by using the stochastic gradient descent algorithm to minimize the loss function that restricts hierarchical identity information, the parameters of the deep face attribute detection model at this point are the parameters of the final trained deep face attribute detection model, which can then be used to directly detect face attributes.

[0064] This invention applies a hierarchical identity information constraint loss function to mine facial identity information and the relationship between facial identities from both global and local perspectives. By considering the intrinsic relationship between semantic attributes and facial identities, this function can help the model learn robust discriminative information better.

[0065] Step 3: Use the trained deep face attribute detection model to detect the face attributes of the test sample set and obtain the test results. This will determine the accuracy of the trained deep face attribute detection model in detecting face attributes.

[0066] Step 4: Input the face image to be detected into the trained deep face attribute detection model to obtain the detection results.

[0067] In this embodiment, the face image to be detected is the face image that needs to be detected. After it is segmented, it is input into the trained deep face attribute detection model, and the trained deep face attribute detection model can output the corresponding attributes.

[0068] 1. Currently, most face attribute detection algorithms are based on convolutional neural networks (CNNs). Although deep CNN-based models can achieve good face attribute detection rates, the convolution operation can only focus on a neighborhood, extracting only local features, which limits its ability to capture global feature representations. To address this shortcoming of CNNs in face attribute detection, this invention proposes a face attribute detection method based on a deep self-attention network. This method can better capture long-distance feature dependencies and effectively improve the face attribute detection rate.

[0069] 2. Most existing face attribute detection algorithms treat face attribute detection as an independent multi-label classification task, ignoring the inherent relationship between attributes and facial identity information. To uncover the relationship between face attributes and facial identity information, this invention proposes a hierarchical identity-related loss function. This loss function, by considering the inherent relationship between facial identity and semantic attributes, helps the model better learn features containing robust discriminative information.

[0070] 3. Unlike most previous facial attribute detection methods that rely on prior knowledge (such as data type, semantics, and subjectivity) for attribute classification, this invention employs a semi-automatic attribute grouping strategy: using gradient-weighted class activation map (Grad-CAM) visualization technology to create heatmaps of different facial attributes, and then grouping attributes with similar regions of interest into the same group. Therefore, facial attributes are divided into 7 groups: global attributes, head, eyes, nose, mouth, cheeks, and neck. This grouping strategy helps the model better learn the intrinsic relationships between different attributes within the same attribute group.

[0071] 1. As shown in Table 2, compared with the best existing technology, this invention achieves the best recognition accuracy for 20 out of 40 facial attributes. For example, when detecting the attribute "bald" and the attribute "bangs," the method of this invention achieves recognition accuracies of 91.04% and 96.50%, respectively, which are 6.04% and 5.50% higher than the best existing method, DMTL (Deep Multi-Task Learning). By using a semi-automatic facial attribute grouping strategy, facial attributes with similar visual heatmaps are grouped into the same attribute group, helping the model to better learn the facial attribute features within the same group. These features are more robust and discriminative, thus the model performs better in the detection rate of individual facial attributes.

[0072] Table 2

[0073]

[0074]

[0075] 2. The best existing technology, DMTL, introduces an additional face normalization step to reduce the impact of large facial variations and complex facial images. This method first uses the SeetaFaceEngine engine to detect facial key points and then performs face cropping based on these key points, which increases computational costs. In contrast, this invention directly inputs the facial image into the backbone model without additional facial image processing steps, directly extracting features from the entire input facial image, thus reducing the computational cost of the model when performing facial attribute detection.

[0076] 3. DMTL uses AlexNet, a deep convolutional neural network, as its backbone. Because each convolution only operates on adjacent locations of the region of interest, this model focuses more on local information within the image. In contrast, this invention uses a deep self-attention network to extract shared facial attribute features, emphasizing different parts of the image. This allows the model to efficiently extract meaningful information from large amounts of image data using limited attention. Using attention mechanisms helps the model learn global connections between elements and also pays attention to local connections. Compared to DMTL, the model used in this invention focuses more on global information in the image during feature extraction, rather than just information from adjacent regions.

[0077] 4. In DMTL, only the face attribute detection task is considered, ignoring the correlation between face attributes and identity information. This invention uses a face identity information constraint loss function to limit and guide the face attribute detection task. Typically, two face images with the same identity have an inherent correlation; these images share highly similar face attributes, such as both having high cheekbones. Conversely, face images with different identities have less similar face attributes; for example, one face image may have a large nose, while the other does not. This demonstrates that high-level identity information helps the model learn discriminative features for face attribute detection. To model the relationship between semantic attributes and face identity, this invention proposes an identity-related hierarchical face attribute loss function. This function considers the inherent relationship between semantic attributes and face identity. By simultaneously inputting face attributes and face identity, the task of learning the relationship between face attributes and face identity guides the model to better learn the face attribute detection task.

[0078] In the description of this invention, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified.

[0079] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or feature data point described in connection with that embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or feature data points described may be combined in any suitable manner in one or more embodiments or examples. In addition, those skilled in the art can combine and integrate the different embodiments or examples described in this specification.

[0080] The above description, in conjunction with specific preferred embodiments, provides a further detailed explanation of the present invention. It should not be construed that the specific implementation of the present invention is limited to these descriptions. For those skilled in the art, various simple deductions or substitutions can be made without departing from the concept of the present invention, and all such modifications and substitutions should be considered within the scope of protection of the present invention.

Claims

1. A face attribute detection method based on a deep self-attention network, characterized in that, The face attribute detection method includes: Step 1: Obtain a training sample set, which includes N face images and identity information for each face image. The identity information is used to characterize the identity of the face image. Each face image contains A face attribute labels, which are used to label various attributes on the face image. N and A are natural numbers greater than 0. Step 2: Train the face attribute detection model using the training sample set to obtain a trained deep face attribute detection model. The deep face attribute detection model includes a shared attribute feature learning module and a specific attention feature learning module. The shared attribute feature learning module includes a linear embedding layer, m image patch fusion layers, (m+1) Swing Transformer Layers, a pooling layer, and a first fully connected layer. The linear embedding layer, (m+1) Swing Transformer Layers, and the first fully connected layer are connected sequentially. An image patch fusion layer is set before each of the 2nd to (m+1)th Swing Transformer Layers, and a pooling layer is set after the (m+1)th Swing Transformer Layer. Step 2 includes: Step 2.1: Segment the face images in the training sample set into several non-overlapping windows; Step 2.2: Input the segmented face image and its identity information into the deep face attribute detection model to establish a hierarchical identity information constraint loss function based on the output of the deep face attribute detection model. The hierarchical identity information constraint loss function is as follows: wherein , , are weight parameters; Where C is the number of identities in the training sample set. For the nth face image, This represents the identity information of the nth face image. If the two input face images have the same identity, then... It is 1 if it is true, otherwise it is 0. For the first i The features obtained from an individual's facial image through a deep facial attribute detection model. For the first j The features obtained from a personal face image through a deep face attribute detection model, where G is the number of attribute groups determined by the face attribute grouping strategy. For the first g The number of facial attributes in each attribute group For the first i The first personal face image g The first attribute group within the attribute group a Each attribute tag For the first i The first personal face image g The first attribute group within the attribute group a The probability of each attribute; Step 2.3: Use the stochastic gradient descent algorithm to process the hierarchical identity information restriction loss function to minimize the hierarchical identity information restriction loss function; Step 2.4: Obtain the trained deep face attribute detection model by using the loss function that minimizes the hierarchical identity information constraint. Step 3: Input the face image to be detected into the trained deep face attribute detection model to obtain the detection result.

2. The face attribute detection method based on deep self-attention network according to claim 1, wherein, The specific attention feature learning module includes: a global attribute branch module and several local region branch modules and an identity branch module.

3. The face attribute detection method based on deep self-attention network according to claim 2, characterized in that, The global attribute branch module and each of the local region branch modules each include a second fully connected layer, a third fully connected layer, a first ReLU activation function layer, a first dropout layer, and a first batch normalization layer connected in sequence.

4. The face attribute detection method based on deep self-attention network according to claim 3, characterized in that, The identity branch module includes a fourth fully connected layer, a fifth fully connected layer, a second ReLU activation function layer, a second dropout layer, and a second batch normalization layer connected in sequence. The outputs of the second fully connected layers of all the local region branch modules are connected and then input to the identity branch module. The output of the identity branch module is used to calculate the global identity loss.

5. The face attribute detection method based on deep self-attention network according to claim 1, characterized in that, The steps preceding step 3 also include: Obtain a test sample set, which includes M face images and identity information for each face image, wherein each face image contains A face attribute tags; Detecting a face attribute of the test sample set by using the trained deep face attribute detection model to obtain a test result.