Face liveness detection methods, devices, computer equipment, and storage media

By introducing multi-task joint loss training into the face liveness detection model and combining live face classification and feature detection tasks, the detection accuracy and defense capability of the model are improved, solving the problem of low detection accuracy in existing technologies.

CN115880740BActive Publication Date: 2026-06-30TENCENT TECHNOLOGY (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date
2021-09-27
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing face liveness detection technologies have low accuracy when facing various types of attacks, making it difficult to effectively improve the model's defense capabilities.

Method used

A constrained training method using multi-task joint loss is adopted. By connecting the live face classification task branch and the live feature detection task branch after the backbone network of the face liveness detection model, the training process of the model is optimized by multi-task joint loss, so that the live face classification task branch is affected by the live feature detection task branch, thereby improving the model's ability to defend against different attack types.

Benefits of technology

It improves the accuracy of face liveness detection, enhances the model's defense against different types of attacks, reduces material costs, and does not require additional sensor input.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115880740B_ABST
    Figure CN115880740B_ABST
Patent Text Reader

Abstract

This application relates to a method, apparatus, computer device, and storage medium for face liveness detection, belonging to the field of computer vision technology. The method includes: acquiring a face image to be tested; detecting the face image using a trained face liveness detection model, wherein the face liveness detection model is pre-trained based on constraints of a multi-task joint loss, the multi-task joint loss including at least the joint loss of a liveness face classification task and a liveness feature detection task; and obtaining a liveness face detection result based on the output of the liveness face classification task branch. This method can improve detection accuracy. Embodiments of this invention can be applied to various scenarios such as cloud technology, artificial intelligence, smart transportation, and assisted driving.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the fields of artificial intelligence, computer vision and image processing technology, and in particular to a method, apparatus, computer device and storage medium for human face liveness detection. Background Technology

[0002] With the development of facial recognition technology, it is being applied in more and more business scenarios, such as facial recognition payment. To improve the security performance of facial recognition in various business scenarios, more and more facial liveness detection technologies are being applied to products, effectively ensuring business security.

[0003] Specifically, facial liveness detection technology is the first line of defense for facial recognition security. By collecting digital image data, it determines whether a face is real. If the face is identified as a real person, the process can proceed to subsequent business processes, such as payment and access control. However, if the image is identified as a malicious image (such as a high-resolution photo), an error message will be displayed.

[0004] Currently, the commonly used technique for face liveness detection is a multi-classification model training method based on digital images. The main idea of ​​this method is to constrain the model parameters using a multi-classification loss function. As the classification loss decreases, it indicates that the model's accuracy in face liveness detection is continuously improving. Once the loss converges, the model is considered to have completed training. The drawback of this method is that it only uses digital images as the basis and relies on a single classification task to constrain the model's accuracy. When faced with numerous and complex attack types, the detection accuracy is relatively low. Summary of the Invention

[0005] Therefore, it is necessary to provide a face liveness detection method, device, computer equipment, and storage medium that can improve the detection accuracy in response to the above-mentioned technical problems.

[0006] A face liveness detection method, the method comprising:

[0007] Acquire the face image of the person to be tested;

[0008] The face liveness detection model is trained to detect the face image to be tested. The model has task branches connected to the backbone network of the face liveness detection model. One of the task branches is a live face classification task branch based on face image features, and at least one of the other task branches is a live feature detection task branch based on live features. The face liveness detection model is trained in advance based on the constraints of a multi-task joint loss, which includes at least the joint loss of the live face classification task and the live feature detection task.

[0009] Based on the output of the live face classification task branch, the live face detection result is obtained.

[0010] A face liveness detection device, the device comprising:

[0011] The image acquisition module is used to acquire the face image of the person to be tested;

[0012] The detection module is used to detect the face image to be tested using a trained face liveness detection model. The backbone network of the face liveness detection model is connected to task branches for each task. One task branch is a liveness face classification task branch based on face image features, and at least one of the other task branches is a liveness feature detection task branch based on liveness features. The face liveness detection model is pre-trained based on constraints of a multi-task joint loss, which includes at least the joint loss of the liveness face classification task and the liveness feature detection task.

[0013] The detection output module is used to obtain the live face detection result based on the output of the live face classification task branch.

[0014] A computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program performing the following steps:

[0015] Acquire the face image of the person to be tested;

[0016] The face liveness detection model is trained to detect the face image to be tested. The model has task branches connected to the backbone network of the face liveness detection model. One of the task branches is a live face classification task branch based on face image features, and at least one of the other task branches is a live feature detection task branch based on live features. The face liveness detection model is trained in advance based on the constraints of a multi-task joint loss, which includes at least the joint loss of the live face classification task and the live feature detection task.

[0017] Based on the output of the live face classification task branch, the live face detection result is obtained.

[0018] A computer-readable storage medium having a computer program stored thereon, the computer program performing the following steps when executed by a processor:

[0019] Acquire the face image of the person to be tested;

[0020] The face liveness detection model is trained to detect the face image to be tested. The model has task branches connected to the backbone network of the face liveness detection model. One of the task branches is a live face classification task branch based on face image features, and at least one of the other task branches is a live feature detection task branch based on live features. The face liveness detection model is trained in advance based on the constraints of a multi-task joint loss, which includes at least the joint loss of the live face classification task and the live feature detection task.

[0021] Based on the output of the live face classification task branch, the live face detection result is obtained.

[0022] A computer program product or computer program includes computer instructions stored in a computer-readable storage medium, wherein a processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the following steps:

[0023] Acquire the face image of the person to be tested;

[0024] The face liveness detection model is trained to detect the face image to be tested. The model has task branches connected to the backbone network of the face liveness detection model. One of the task branches is a live face classification task branch based on face image features, and at least one of the other task branches is a live feature detection task branch based on live features. The face liveness detection model is trained in advance based on the constraints of a multi-task joint loss, which includes at least the joint loss of the live face classification task and the live feature detection task.

[0025] Based on the output of the live face classification task branch, the live face detection result is obtained.

[0026] The aforementioned face liveness detection method, apparatus, computer equipment, and storage medium predict the face image to be detected using a trained face liveness detection model. Since the face liveness detection model is a multi-task model, it is trained based on the constraints of the joint loss of the liveness classification task and the liveness feature detection task. During the training process, the liveness detection and classification tasks are trained together, so that the two are mutually constrained. Thus, the liveness classification task branch is affected by the liveness feature detection task branch, so that the output of the liveness classification task branch takes into account the liveness features. Compared with single-dimensional image features, this improves the model's defense against different attack types, thereby improving the detection accuracy. Attached Figure Description

[0027] Figure 1 This is a diagram illustrating the application environment of a face liveness detection method in one embodiment.

[0028] Figure 2 This is a flowchart illustrating a face liveness detection method in one embodiment;

[0029] Figure 3 This is a schematic diagram of the structure of a face liveness detection model in one embodiment;

[0030] Figure 4 This is a schematic diagram illustrating the loss relationships of various tasks in a face liveness detection model with two network branches in one embodiment.

[0031] Figure 5 This is a flowchart illustrating a face liveness detection method in another embodiment;

[0032] Figure 6 This is a schematic diagram of the structure of a face liveness detection model in another embodiment;

[0033] Figure 7 This is a structural block diagram of a face liveness detection device in one embodiment;

[0034] Figure 8 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0035] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0036] Artificial intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.

[0037] Artificial intelligence (AI) is a comprehensive discipline encompassing a wide range of fields, including both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies primarily include computer vision, speech processing, natural language processing, and machine learning / deep learning.

[0038] Computer vision (CV) is a science that studies how to enable machines to "see." More specifically, it refers to machine vision, which uses cameras and computers to replace human eyes in tasks such as target recognition, tracking, and measurement, and further performs image processing to create images more suitable for human observation or transmission to instruments. As a scientific discipline, computer vision studies related theories and technologies, attempting to build artificial intelligence systems capable of extracting information from images or multidimensional data. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content / behavior recognition, 3D object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous localization and mapping (SLAM), and common biometric recognition technologies such as facial recognition and fingerprint recognition.

[0039] The solutions provided in this application involve technologies such as artificial intelligence and computer vision, which are specifically illustrated through the following embodiments:

[0040] The face liveness detection method provided in this application can be applied to, for example... Figure 1The application environment shown is illustrated. Terminal 102 communicates with server 104 via a network, with the server providing face liveness detection services. Terminals include, but are not limited to, mobile phones, computers, smart voice interaction devices, smart home appliances, and vehicle terminals. They can also be terminals set up in fixed business locations, such as self-service terminals equipped with cameras (e.g., ATMs, self-service libraries, etc.). This invention can be applied to various scenarios, including but not limited to cloud technology, artificial intelligence, smart transportation, and assisted driving. The terminal collects the user's facial image and sends it to the server. The server then acquires the face image to be tested. A pre-trained face liveness detection model is used to detect the face image. This model is a multi-task model, with task branches connected to the backbone network of the face liveness detection model. One task branch is a liveness face classification task based on facial image features, and at least one of the other task branches is a liveness feature detection task based on liveness features. The face liveness detection model is pre-trained based on constraints of a multi-task joint loss, which includes at least the joint loss of the liveness face classification task and the liveness feature detection task. The liveness face detection result is obtained based on the output of the liveness face classification task branch. The server 104 can be implemented using a standalone server or a server cluster consisting of multiple servers.

[0041] In one embodiment, such as Figure 2 As shown, a face liveness detection method is provided, which is applied to... Figure 1 Taking the server in the example, the following steps are included:

[0042] Step 202: Obtain the face image of the person to be tested.

[0043] The face image to be tested is a face image captured in real time by the terminal's camera, triggered by face detection in a business scenario. Business scenarios include all face recognition-related business scenarios, including but not limited to mobile phone face unlock, APP (application) face login, remote face verification, face recognition access control system, offline face recognition payment, and automatic face recognition clearance.

[0044] In order to ensure that the acquired face images meet the input requirements of the face liveness detection model, the acquired face images can be processed. For example, if the face liveness detection model requires the input image to be 224*224*3 pixels, the acquired face images can be cropped to a specified size centered on the face region to serve as the face image to be tested.

[0045] Step 204: Detect the face image to be tested using the trained face liveness detection model. The backbone network of the face liveness detection model is connected to the task branches of each task. One of the task branches is a live face classification task branch based on face image features, and at least one of the other task branches is a live feature detection task branch based on live features. The face liveness detection model is pre-trained based on the constraints of multi-task joint loss, which includes at least the joint loss of the live face classification task and the live feature detection task.

[0046] Specifically, the face liveness detection model is pre-trained, and the parameters of the face liveness detection model are continuously adjusted during the training process so that it can accurately predict whether the input face image is a live face.

[0047] In this embodiment, the face liveness detection model is a multi-task model. A multi-task model refers to a neural network model whose structure can perform multiple tasks. Taking this example, it can perform both live face classification and liveness feature detection tasks. Specifically, the structure of the face liveness detection model 30 is as follows: Figure 3 As shown, it includes a backbone network 301 and multiple branch networks connected to the backbone network, each branch network implementing a task. One of the task branches in the multiple branch networks is a live face classification task branch 302 based on face image features, and at least one of the other task branches is a live feature detection task branch 303 based on live features.

[0048] Among them, the live face classification task branch 302, based on facial image features, determines whether a face is live by extracting facial image features from the image to be tested. Its task is to achieve live face classification, and the output results include two categories: one category for live faces and one category for attack images. The facial image features include the external features of the face as shown in the image, such as the features of the facial features.

[0049] Liveness features refer to the characteristics of a living person as reflected in an image. For example, a living person is usually in motion, and when captured in space, they possess depth information. The liveness feature detection task branch 303, based on liveness features, determines whether the subject is alive by extracting liveness features from the face image being tested. Its task is to extract liveness features, and the output varies depending on the type of liveness feature. If the liveness feature is depth information, the output is a depth map; if the liveness feature is an animation, the output is an animation detection classification; if the liveness feature is a specified action, the output is a classification indicating whether the specified action was performed.

[0050] The liveness detection task branch, based on facial image features, identifies the face image to determine if it represents a live person. The liveness feature detection task branch, based on liveness features, identifies the face image to determine if it represents a live person. Since the parameters of the liveness detection model are trained based on the constraints of the joint loss of both the liveness detection and liveness feature tasks, both facial image features and liveness features influence the liveness detection model. Compared to single-task methods that constrain the model through a single classification loss, this embodiment employs multi-task training, with model training based at least on the constraints of the joint loss of the live face classification task and the live feature detection task. This allows for joint optimization of each task. The constraints of the joint loss ensure that the live face classification task and the live feature detection task mutually constrain and promote each other during training. In other words, the live face classification task branch is influenced by the live feature detection task branch, ensuring that the output of the live face classification task branch considers live features. Compared to single-dimensional image features, this enhances the model's defense against different attack types, thereby improving detection accuracy.

[0051] In practical applications, a common approach to training liveness detection models involves assigning a specific input object to each task, combining multiple input objects to obtain multi-dimensional features, and then fusing these features for face liveness detection. For example, given an RGB digital image as input, the detection model, in addition to outputting the classification result and calculating the classification loss, will also output a depth map. This depth map will then be compared with the actual depth map corresponding to the RGB image to calculate a regression loss. This method requires multi-dimensional feature inputs, necessitating the collection of more dimensional features, such as depth maps. Infrared and depth sensing devices are typically more expensive than RGB cameras.

[0052] In this embodiment, apart from the face image to be tested, no other feature information of other dimensions needs to be input, which undoubtedly greatly reduces material costs.

[0053] Step S206: Obtain the live face detection result based on the output of the live face classification task branch.

[0054] The face liveness detection model outputs a prediction result for each task branch. This prediction includes at least the prediction result for liveness face classification (predicting whether it's a live image or an attack image) and at least the prediction result for liveness feature detection (predicting information related to liveness features, such as depth maps). In practical applications, only the prediction result from the liveness face classification branch is used. The category label corresponding to the position with the maximum probability vector in that branch is the final model output prediction result for real person / attack.

[0055] The face liveness detection model can be deployed directly before the face recognition model to detect the face recognition input image. If it is a live face, it will proceed to the subsequent recognition process; if it is an attack image, an error message will be displayed and a retry will be performed.

[0056] The aforementioned face liveness detection method uses a trained face liveness detection model to predict the face image to be tested. Since the face liveness detection model is a multi-task model, it is trained based on the constraints of the joint loss of the liveness classification task and the liveness feature detection task. During the training process, the liveness detection and classification tasks are trained together, so that the two are mutually constrained. Thus, the liveness classification task branch is affected by the liveness feature detection task branch, so that the output of the liveness classification task branch takes into account the liveness features. Compared with single-dimensional image features, this improves the model's defense against different attack types, thereby improving the detection accuracy.

[0057] In another embodiment, the liveness detection task is any one of the following: depth information detection task, animation detection task, and specified action detection task.

[0058] Specifically, liveness features refer to the characteristics of a living body as reflected in an image. For example, a living body is usually in motion. When a living body is photographed in space, it has depth information and dynamic information and can perform a specified action. Therefore, the liveness feature detection task can be any one of the following: depth information detection task, animation detection task, and specified action detection task.

[0059] Specifically, when an object reflects light back, the sensor calculates the distance to the object being photographed by measuring the time difference or phase difference between the emission and reflection of light, thus generating depth information. Therefore, when photographing a living object, depth information can be detected. However, when photographing malicious images, such as high-resolution pieces of paper or copied photographs, the high-resolution paper and copied photographs are on the same plane and do not possess depth information. Therefore, depth information is one of the characteristics for detecting liveness.

[0060] Animated image detection refers to determining whether the subject in a sequence of images has changed motion.

[0061] Specified action detection refers to whether the subject being photographed can perform a specified action, such as blinking or shaking their head.

[0062] In this embodiment, the detection of whether a face image is a live face can be performed from any of the aforementioned dimensions, enriching the detection methods. Multiple feature combinations can be used to train the face liveness detection model, such as jointly training a face liveness detection model using RGB images and depth information, or jointly training a face liveness detection model using RGB images and motion detection, or jointly training a face liveness detection model using RGB images and specified action detection. Furthermore, RGB images can also be replaced with infrared images.

[0063] In another embodiment, the multi-task joint loss is obtained by weighting the weight coefficients of the loss of each task and the loss of each task. In each round of iterative training, the weight coefficients are dynamically determined based on the loss of each task in that round of iterative training.

[0064] The loss for each task is calculated by combining the prediction results of each task with the labeled results of the input samples. It can be the cross-entropy or mean square error between the prediction results and the labeled results.

[0065] In this embodiment, the weight coefficients of the loss of each task are not fixed. The weight coefficients of the loss of each task in each round of iterative training are dynamically determined based on the loss of each task in that round of iterative training. The sum of the weight coefficients of the loss of each task is 1.

[0066] The constraint objective of multi-task joint loss is to optimize each branch of the network to train a more accurate face liveness detection model. Therefore, multi-objective optimization can dynamically determine the weight coefficients of each task's loss in each iteration, thus balancing the tasks. For example, multi-objective optimization can be implemented using uncertainty-based methods, or it can be determined based on the gradient ratio of the losses of each task.

[0067] In traditional multi-objective optimization, the weights for each task are usually fixed and assigned manually. For example, the weights for three tasks might be 0.2, 0.4, and 0.4, or for two tasks, the weights might be 0.5, assuming both tasks are of equal importance. This method, where weights are arbitrarily defined, is both time-consuming and cannot guarantee that the selected weights represent the optimal solution for the model.

[0068] In this embodiment, by learning to reasonably allocate the weights of each task through the loss of each task in each round of iterative training, and through learning from a large number of samples, the importance of the body feature detection task and the live face classification task is balanced, and the real requirements are reflected, thereby improving the detection accuracy.

[0069] In another embodiment, the weight coefficients in each round of training iteration are determined based on the gradient of the loss of each task in that round of training iteration.

[0070] Specifically, in machine learning, evaluating whether an algorithm is good requires defining a loss function beforehand to determine if the algorithm is optimal. Subsequent optimization using gradient descent minimizes the loss function, thus achieving a meaningfully optimal solution. Therefore, there is a correlation between loss, gradient, and model accuracy.

[0071] In this embodiment, utilizing this correlation, the loss of each task is first calculated, the gradient of each task is calculated based on the loss, and the weight coefficients are determined using the gradient. Specifically, the loss of each task is first calculated based on the live face classification task and the live feature detection task, and the gradient is calculated based on the loss. After calculating the gradient of the loss of each task, the weight coefficient of the loss of the live feature detection task is calculated based on the gradient and the loss using the univariate convex quadratic programming theorem, and then the loss of the live feature detection task is further calculated. Taking a face liveness detection model with two tasks as an example, one task is the live face classification task, and the other task is the live feature detection task. After obtaining the two losses from the outputs of the two branches, it is necessary to jointly optimize the loss functions of these two tasks. Since the optimization objectives of the two tasks are different, it is necessary to consider how to better find the optimal weight coefficient ratio between the two objectives. Figure 4 As shown, assume the parameters of the shared layer are Then the optimization directions of the two loss functions in space are as follows: As shown on the two coordinate axes, the optimal weighting coefficients should be located at... Boundary Since it involves a trade-off between two tasks, according to the theorem of univariate convex quadratic programming:

[0072]

[0073] L represents the gradient calculation operation. depth L represents the loss of the liveness detection task. cls This indicates the loss in the live face classification task. The weighting coefficients for the loss in the live face classification task can be obtained from the above formula. Essentially, it's the ratio of opposite sides divided by the perpendicular line of the triangle formed by the gradient directions of the two tasks. This is obtained in each training round. The weighting ratio of the losses from the two tasks is used to calculate the final joint optimization loss function, as shown below:

[0074]

[0075] In this embodiment, the target is optimized by using the joint loss of liveness features and classification tasks, which better balances the importance of different tasks and makes the two tasks complement and promote each other in the training of the face liveness detection model, thereby achieving higher liveness detection accuracy.

[0076] In another embodiment, the method for training the face liveness detection model is as follows: Figure 5 As shown, it includes:

[0077] S502, Obtain the training sample set, which includes live face samples and attack samples, as well as the annotation results of each sample. The annotation results of each sample include at least the live face classification annotation results and live annotation features.

[0078] To train a face liveness detection model, a certain amount of training data is needed. The training sample set includes positive and negative samples. Live face photos are used as positive samples, and attack photos are used as negative samples. Live face samples are face images obtained by capturing live faces. Attack samples are images of non-live faces, including high-resolution images of faces on paper or taken from a screen.

[0079] For each training sample in the training sample set, face classification annotations and liveness detection features are provided. The face classification annotations can be categorized by sample type, including live face samples and attack samples. Liveness detection features include depth maps. These are obtained by capturing images of the sample using a camera with depth map acquisition capabilities, yielding both the face sample and its depth map.

[0080] The model input is a 224*224*3 image. To achieve this, all samples need to be aligned to this scale. Furthermore, since the focus is on face liveness detection, the input image requires face matting preprocessing. The specific steps are as follows:

[0081] 1) First, use a face detection tool to perform face detection on the real person and the original high-definition reproduced image;

[0082] 2) Then, based on the detection results, extract the face region from the original RGB image;

[0083] 3) Align the extracted face area to 256*256*3;

[0084] 4) Finally, a 224*224*3 input image is obtained using a random cropping method.

[0085] S504. Input the samples from the training sample set into the face liveness detection model to be predicted. After processing by the backbone network of the face liveness detection model to be predicted, at least the live face classification prediction result and liveness prediction feature are obtained through the task branches of each task. One of the task branches is the live face classification task branch based on face image features, and at least one of the other task branches is the live feature detection task branch based on liveness features.

[0086] Specifically, the structure of the face liveness detection model is as follows: Figure 3 As shown, the network includes a backbone network and multiple task branch networks connected to the backbone network, each implementing a corresponding task. For example, the liveness detection task branch 303, based on liveness features, determines whether the subject is alive by extracting liveness features from the face image being tested. Its task is to extract liveness features, and the output varies depending on the type of liveness feature. If the liveness feature is depth information, the output is a depth map; if the liveness feature is an animation, the output is an animation detection classification; if the liveness feature is a specified action, the output is a classification indicating whether the specified action was performed.

[0087] Each task branch outputs prediction results. For example, the live face classification task branch outputs live face classification prediction results, and the live feature detection task branch outputs live prediction features.

[0088] S506: Calculate the loss for the live face classification task based on the live face classification annotation results and the live face classification prediction results; calculate the loss for the live feature detection task based on the live annotation features and the live prediction features.

[0089] The loss is calculated based on the loss function. The loss function measures the degree of discrepancy between the model's predicted value f(x) and the true value Y. In neural network models, the backpropagation algorithm can optimize the parameter values ​​in the neural network based on a predefined loss function, thereby minimizing the loss function of the neural network model on the training dataset. Commonly used losses include the mean squared error loss function (MSE) (representing the difference between the predicted and actual values) and the cross-entropy loss function (representing the similarity between the predicted and actual values; for example, p represents the distribution of the true labels, and q is the distribution of the predicted labels of the trained model; the cross-entropy loss function measures the similarity between p and q).

[0090] Specifically, in this embodiment, the loss of the live face classification task can be the difference or similarity between the live face classification annotation result and the live face classification prediction result, and the loss of the live feature detection task can be the difference or similarity between the live labeled features and the live predicted features.

[0091] S508, calculate the multi-task joint loss based at least on the loss of the live face classification task and the loss of the live feature detection task, and adjust the parameters of the face liveness detection model to be predicted based on the multi-task joint loss.

[0092] The liveness detection task branch, based on facial image features, identifies the face image to determine if it represents a live person. The liveness feature detection task branch, based on liveness features, identifies the face image to determine if it represents a live person. Since the parameters of the liveness detection model are trained based on the constraints of the joint loss of both the liveness detection and liveness feature tasks, both facial image features and liveness features influence the liveness detection model. Compared to single-task methods that constrain the model through a single classification loss, this embodiment employs multi-task training, with model training based at least on the constraints of the joint loss of the live face classification task and the live feature detection task. This allows for joint optimization of each task. The constraints of the joint loss ensure that the live face classification task and the live feature detection task mutually constrain and promote each other during training. In other words, the live face classification task branch is influenced by the live feature detection task branch, ensuring that the output of the live face classification task branch considers live features. Compared to single-dimensional image features, this enhances the model's defense against different attack types, thereby improving detection accuracy.

[0093] S510: When the training termination condition is met, a well-trained face liveness detection model is obtained.

[0094] In this embodiment, the model is obtained by constraint training based on multi-task joint loss, which includes live face classification task and live feature detection task. It can train the model together with live detection and classification tasks during the training process, so that the two constrain each other. As a result, the model can be trained from multiple dimensions, which improves the accuracy of the model.

[0095] In another embodiment, calculating a multi-task joint loss based at least on the loss of the live face classification task and the loss of the live feature detection task, and adjusting the parameters of the face liveness detection model to be predicted based on the multi-task joint loss, includes: calculating the gradient of the loss based at least on the loss of the live face classification task and the loss of the live feature detection task; determining the weight coefficients of the loss for each task based on the gradient of the loss; calculating the multi-task joint loss based on the loss for each task and the weight coefficients; and adjusting the parameters of the face liveness detection model to be predicted based on the multi-task joint loss.

[0096] Specifically, in machine learning, evaluating whether an algorithm is good requires defining a loss function beforehand to determine if the algorithm is optimal. Subsequent optimization using gradient descent minimizes the loss function, thus achieving a meaningfully optimal solution. Therefore, there is a correlation between loss, gradient, and model accuracy.

[0097] In this embodiment, this correlation is utilized to first calculate the loss of each task, then calculate the gradient of each task based on the loss, and finally determine the weight coefficients using the gradients. Specifically, the loss of each task is first calculated based on the live face classification task and the live feature detection task, and the gradient is calculated based on the loss. After calculating the gradient of the loss of each task, the weight coefficients of the loss of the live feature detection task are calculated based on the univariate convex quadratic programming theorem, according to the gradient and the loss. Finally, the loss of the live feature detection task is calculated.

[0098] Taking a face liveness detection model with two tasks as an example: one is live face classification, and the other is liveness feature detection. After obtaining the two losses from the two branches, it is necessary to jointly optimize the loss functions for these two tasks. Since the optimization objectives of the two tasks are different, it is necessary to consider how to better find the optimal weight coefficient ratio between the two objectives. Figure 4 As shown, assume the parameters of the shared layer are Then the optimization directions of the two loss functions in space are as follows: As shown on the two coordinate axes, the optimal weighting coefficients should be located at... Boundary Since it involves a trade-off between two tasks, according to the theorem of univariate convex quadratic programming:

[0099]

[0100] L represents the gradient calculation operation. depth L represents the loss of the liveness detection task. cls This indicates the loss in the live face classification task. The weighting coefficients for the loss in the live face classification task can be obtained from the above formula. Essentially, it's the ratio of opposite sides divided by the perpendicular line of the triangle formed by the gradient directions of the two tasks. This is obtained in each training round. The weighting ratio of the losses from the two tasks is used to calculate the final joint optimization loss function, as shown below:

[0101]

[0102] Based on the loss function of joint optimization, backpropagation is performed on the model to adjust the model parameters.

[0103] In this embodiment, the target is optimized by using the joint loss of liveness features and classification tasks, which better balances the importance of different tasks and makes the two tasks complement and promote each other in the training of the face liveness detection model, thereby achieving higher liveness detection accuracy.

[0104] In another embodiment, the liveness annotation feature includes an annotated depth map; the liveness prediction feature includes a predicted depth map; and the liveness feature detection task is a depth information detection task based on depth information.

[0105] Based on the live face classification annotation results and the live face classification prediction results, the loss of the live face classification task is calculated, including: calculating the cross-entropy loss based on the live face classification annotation results and the live face classification prediction results to obtain the loss of the live face classification task.

[0106] Based on the liveness annotation features and liveness prediction features, the loss of the liveness feature detection task is calculated, including: calculating the mean squared error loss based on the predicted depth map and the annotation depth map to obtain the loss of the liveness feature detection task.

[0107] Specifically, when an object reflects light back, the sensor calculates the distance to the object being photographed by measuring the time difference or phase difference between the emission and reflection of light, thus generating depth information. Therefore, when photographing a living object, depth information can be detected. However, when photographing malicious images, such as high-resolution pieces of paper or copied photographs, the high-resolution paper and copied photographs are on the same plane and do not possess depth information. Therefore, depth information is one of the characteristics for detecting liveness.

[0108] In this embodiment, a face liveness detection model is jointly trained using face image features and depth information features.

[0109] Specifically, such as Figure 6 As shown, a dual-branch network framework for a face liveness detection model is designed, using ResNet18 as the backbone. Besides ResNet18, other network models can also be selected as the backbone. Of course, to ensure the timeliness of forward inference, a network model with fewer parameters can be searched using methods such as NAS to serve as the backbone.

[0110] Meanwhile, its relatively shallow network layers also ensure the timeliness of forward inference. The dual-branch structure is split at layer 10 of ResNet18, and then the framework from layer 11 to the last layer is copied to build dual branches to handle different tasks.

[0111] After the network structure is built, the samples (cropped images of size 224*224*3) are input into the network. Through forward propagation, the two branches of the network will eventually produce two outputs: the probability of the multi-class task (logits_p in the figure, a vector of length 1*c, where c is the total number of classes) and the depth map (DM in the figure). p (This is a 24x24 matrix).

[0112] The classification loss and depth regression loss are calculated separately. For the former, the traditional multi-class cross-entropy loss function is used. Using the obtained probability logits_p, a probability vector of length 1*c is calculated through a softmax layer. The position corresponding to the maximum value of this vector is the network's predicted class label. The cross-entropy is then calculated using this vector and the true class label to obtain the classification loss. Minimizing this loss function will constrain the output probability vector of the classification branch to behave as follows: the class label corresponding to the position with the maximum probability vector value should be consistent with the true class label. For depth information, the Mean Square Error (MSE) loss function is used to calculate the depth regression loss. As shown above, the depth regression branch outputs a 24*24 real matrix, and the predicted depth map (ground-truth) is also a matrix of the same size. Therefore, directly calculating the MSE loss of both can measure the depth regression loss. Minimizing this loss function will constrain the output depth matrix of the depth regression branch to behave as follows: DM p The error between the value of each element in the ground-truth matrix and the value of the ground-truth matrix becomes smaller and smaller.

[0113] Finally, after obtaining the two losses from the two branches, it is necessary to jointly optimize the loss functions for these two tasks. Since the optimization objectives of the two tasks are different, it is necessary to consider how to better find the optimal weight coefficient ratio between the two objectives. For example... Figure 6 As shown, assume the parameters of the shared layer are Then the optimization directions of the two loss functions in space are as follows: As shown on the two coordinate axes, the optimal weighting coefficients should be located at... Boundary Due to the trade-off between the two tasks of optimizing the objective, according to the theorem of univariate convex quadratic programming:

[0114]

[0115] This indicates the gradient calculation operation, as can be seen from the above. Essentially, it's the ratio of opposite sides divided by the perpendicular line of the triangle formed by the gradient directions of the two tasks. This is obtained in each training round. The weighting ratio of the losses from the two tasks is used to calculate the final joint optimization loss function, as shown below:

[0116]

[0117] After training the dual-branch model, the face liveness detection model is tested. Suppose an online application sends a face image of unknown type (i.e., whether it's a real person or a high-resolution re-image attack). Using the same method as training, a cropped face image (224*224*4 pixels, centered on the face region) is obtained and input into the model. Although the model outputs both classification predictions and depth map regression results, only the classification branch prediction is used here. The class label corresponding to the position with the maximum probability vector is the final model output prediction of real person / attack.

[0118] This application proposes a multi-task, multi-objective optimization method and applies it to the joint optimization of classification and depth regression tasks in face liveness detection. This method better balances the importance of different tasks, allowing them to complement and promote each other in the model training process for face liveness detection, thereby achieving higher liveness detection accuracy. Simultaneously, the dual-branch multi-task structure eliminates the need for additional images from different sensors as input, enhances the model's defense capabilities against different attack types, significantly reduces camera costs, and improves user experience due to its shorter processing time.

[0119] It should be understood that, although Figure 2 and Figure 5 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 2 and Figure 5 At least some of the steps in the process may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but may be executed at different times. The execution order of these steps or stages is not necessarily sequential, but may be executed in turn or alternately with other steps or at least some of the steps or stages in other steps.

[0120] In one embodiment, such as Figure 7 As shown, a face liveness detection device is provided. This device can be a software module, a hardware module, or a combination of both as part of a computer device. Specifically, the device includes:

[0121] Image acquisition module 602 is used to acquire the face image of the person to be tested;

[0122] The detection module 604 is used to detect the face image to be tested using a pre-trained face liveness detection model. The face liveness detection model has task branches connected after the backbone network of the face liveness detection model. One of the task branches is a liveness face classification task branch based on face image features, and at least one of the other task branches is a liveness feature detection task branch based on liveness features. The face liveness detection model is pre-trained based on constraints of multi-task joint loss, which includes at least the joint loss of the liveness face classification task and the liveness feature detection task.

[0123] The detection output module 606 is used to obtain the live face detection result based on the output of the live face classification task branch.

[0124] The aforementioned face liveness detection device uses a trained face liveness detection model to predict the face image to be tested. Since the face liveness detection model is a multi-task model, it is trained based on the constraints of the joint loss of the liveness classification task and the liveness feature detection task. During the training process, the liveness detection and classification tasks are trained together, so that the two are mutually constrained. Thus, the liveness classification task branch is affected by the liveness feature detection task branch, so that the output of the liveness classification task branch takes into account the liveness features. Compared with single-dimensional image features, this improves the model's defense against different attack types and thus improves the detection accuracy.

[0125] In one embodiment, the joint loss is obtained by weighting the weight coefficients of the loss of each task and the loss of each task. In each round of iterative training, the weight coefficients are dynamically determined based on the loss of each task in that round of iterative training.

[0126] In one embodiment, the weight coefficients in each round of iterative training are determined based on the gradient of the loss of each task in that round of iterative training.

[0127] In one embodiment, the liveness detection task is any one of the following: depth information detection task, animation detection task, and specified action detection task.

[0128] In one embodiment, the image acquisition module is used to acquire a training sample set, which includes live face samples and attack samples, as well as the annotation results of each sample. The annotation results of each sample include at least the live face classification annotation results and live annotation features.

[0129] The detection module is also used to input the samples in the training sample set into the face liveness detection model to be predicted. After processing by the backbone network of the face liveness detection model to be predicted, at least the live face classification prediction result and liveness prediction feature are obtained through the task branches of each task. One of the task branches is the live face classification task branch based on face image features, and at least one of the other task branches is the live feature detection task branch based on liveness features.

[0130] Also includes:

[0131] The loss calculation module is used to calculate the loss of the live face classification task based on the live face classification annotation results and the live face classification prediction results, and to calculate the loss of the live feature detection task based on the live annotation features and the live prediction features.

[0132] The adjustment module is used to calculate the joint loss based on at least the loss of the live face classification task and the loss of the live feature detection task, and adjust the parameters of the face liveness detection model to be predicted based on the joint loss.

[0133] The training module is used to obtain a trained face liveness detection model when the training termination condition is met.

[0134] In another embodiment, the adjustment module is used to calculate the gradient of the loss based at least on the loss of the live face classification task and the loss of the live feature detection task; determine the weight coefficient of the loss of each task based on the gradient of the loss; calculate the multi-task joint loss based on the loss of each task and the weight coefficient; and adjust the parameters of the face liveness detection model to be predicted based on the multi-task joint loss.

[0135] In another embodiment, the liveness feature includes depth information, the liveness annotation feature includes an annotated depth map, and the liveness prediction feature includes a predicted depth map; the liveness feature detection task is a depth information detection task based on depth information.

[0136] The loss calculation module is used to calculate the cross-entropy loss based on the live face classification annotation results and the live face classification prediction results, thus obtaining the loss for the live face classification task; and to calculate the mean squared error loss based on the predicted depth map and the labeled depth map, thus obtaining the loss for the live feature detection task.

[0137] Specific limitations regarding the face liveness detection device can be found in the limitations of the face liveness detection method described above, and will not be repeated here. Each module in the aforementioned face liveness detection device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0138] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 8 As shown, the computer device includes a processor, memory, and a network interface connected via a system bus. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database stores images of faces to be tested and a training sample set. The network interface communicates with external terminals via a network connection. When executed by the processor, the computer program implements a face liveness detection method.

[0139] Those skilled in the art will understand that Figure 8 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0140] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.

[0141] In one embodiment, a computer-readable storage medium is provided storing a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0142] In one embodiment, a computer program product or computer program is provided, the computer program product or computer program including computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, causing the computer device to perform the steps in the above method embodiments.

[0143] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, or optical storage, etc. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM), etc.

[0144] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0145] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A method for face liveness detection, the method comprising: To obtain the face image to be tested, no other feature information is required besides the face image to be tested; The face liveness detection model is trained to detect the face image to be tested. The model has task branches connected to the backbone network of the face liveness detection model. One of the task branches is a live face classification task branch based on face image features, and at least one of the other task branches is a live feature detection task branch based on live features. The face liveness detection model is trained in advance based on the constraints of a multi-task joint loss, which includes at least the joint loss of the live face classification task and the live feature detection task. Based on the output of the live face classification task branch, the live face detection result is obtained; The face liveness detection model is trained using a training sample set, which includes live face samples, attack samples, and annotation results for each sample. The annotation results for each sample include at least live face classification annotation results and liveness annotation features. Live face samples are face images captured from live faces; attack samples are images of non-live faces. During training, samples from the training sample set are input into the face liveness detection model to be trained. The live face classification task branch outputs a live face classification prediction result, and the liveness feature detection task branch outputs a liveness prediction feature. The live face classification annotation result and the live face classification prediction result are used to calculate the loss of the live face classification task. The liveness annotation features and the liveness prediction features are used to calculate the loss of the liveness feature detection task. The losses of the live face classification task and the liveness feature detection task are used to calculate the multi-task joint loss. The multi-task joint loss is used to adjust the parameters of the face liveness detection model to be trained.

2. The method of claim 1, wherein, The multi-task joint loss is obtained by weighting the weight coefficients of the loss of each task and the loss of each task. In each round of iterative training, the weight coefficients are dynamically determined based on the loss of each task in that round of iterative training.

3. The method according to claim 2, characterized in that, In each round of iterative training, the weight coefficients are determined based on the gradient of the loss of each task in that round of iterative training.

4. The method according to claim 1, characterized in that, The liveness detection task can be any one of the following: depth information detection task, animation detection task, and specified action detection task.

5. The method according to any one of claims 1 to 4, characterized in that, The methods for training the face liveness detection model include: Obtain a training sample set, which includes live face samples and attack samples, as well as the annotation results of each sample. The annotation results of each sample include at least the live face classification annotation results and live annotation features. The samples in the training sample set are input into the face liveness detection model to be trained. After being processed by the backbone network of the face liveness detection model to be trained, at least the live face classification prediction result and liveness prediction feature are obtained through the task branches of each task. One of the task branches is the live face classification task branch based on face image features, and at least one of the other task branches is the live feature detection task branch based on liveness features. Based on the live face classification and annotation results and the live face classification prediction results, calculate the loss of the live face classification task; based on the live annotation features and the live prediction features, calculate the loss of the live feature detection task. At least the loss of the live face classification task and the loss of the live feature detection task are used to calculate the multi-task joint loss, and the parameters of the face liveness detection model to be trained are adjusted based on the multi-task joint loss. When the training termination condition is met, a well-trained face liveness detection model is obtained.

6. The method according to claim 5, characterized in that, The multi-task joint loss is calculated based on at least the loss of the live face classification task and the loss of the live feature detection task. The parameters of the face liveness detection model to be trained are then adjusted based on the multi-task joint loss, including: The gradient of the loss is calculated based on at least the loss from the live face classification task and the loss from the live feature detection task. The weight coefficients of the loss for each task are determined based on the gradient of the loss, and the joint loss of multiple tasks is calculated based on the loss of each task and the weight coefficients. The parameters of the face liveness detection model to be trained are adjusted based on the multi-task joint loss.

7. The method according to claim 5, characterized in that, The liveness feature includes depth information, the liveness annotation feature includes an annotation depth map; the liveness prediction feature includes a prediction depth map; The liveness detection task is a depth information detection task based on depth information. The step of calculating the loss of the live face classification task based on the live face classification annotation results and the live face classification prediction results includes: calculating the cross-entropy loss based on the live face classification annotation results and the live face classification prediction results to obtain the loss of the live face classification task. The step of calculating the loss of the liveness detection task based on the liveness annotation features and the liveness prediction features includes: calculating the mean squared error loss based on the predicted depth map and the annotation depth map to obtain the loss of the liveness detection task.

8. A face liveness detection device, characterized in that, The device includes: The image acquisition module is used to acquire the face image of the subject. Apart from the face image, no other feature information of other dimensions needs to be input. The detection module is used to detect the face image to be tested using a trained face liveness detection model. The face liveness detection model has task branches connected to its backbone network. One of these task branches is a liveness face classification task branch based on face image features, and at least one of the other task branches is a liveness feature detection task branch based on liveness features. The face liveness detection model is pre-trained based on constraints of a multi-task joint loss, which includes at least the joint loss of the liveness face classification task and the liveness feature detection task. The detection output module is used to obtain the live face detection result based on the output of the live face classification task branch; The face liveness detection model is trained using a training sample set, which includes live face samples, attack samples, and annotation results for each sample. The annotation results for each sample include at least live face classification annotation results and liveness annotation features. Live face samples are face images captured from live faces; attack samples are images of non-live faces. During training, samples from the training sample set are input into the face liveness detection model to be trained. The live face classification task branch outputs a live face classification prediction result, and the liveness feature detection task branch outputs a liveness prediction feature. The live face classification annotation result and the live face classification prediction result are used to calculate the loss of the live face classification task. The liveness annotation features and the liveness prediction features are used to calculate the loss of the liveness feature detection task. The losses of the live face classification task and the liveness feature detection task are used to calculate the multi-task joint loss. The multi-task joint loss is used to adjust the parameters of the face liveness detection model to be trained.

9. The face liveness detection device according to claim 8, characterized in that, The multi-task joint loss is obtained by weighting the weight coefficients of the loss of each task and the loss of each task. In each round of iterative training, the weight coefficients are dynamically determined based on the loss of each task in that round of iterative training.

10. The face liveness detection device according to claim 9, characterized in that, In each round of iterative training, the weight coefficients are determined based on the gradient of the loss of each task in that round of iterative training.

11. The face liveness detection device according to claim 8, characterized in that, The liveness detection task can be any one of the following: depth information detection task, animation detection task, and specified action detection task.

12. The face liveness detection device according to any one of claims 8 to 11, characterized in that... The image acquisition module is also used to acquire a training sample set, which includes live face samples and attack samples, as well as the annotation results of each sample. The annotation results of each sample include at least the live face classification annotation results and live annotation features. The detection module is also used to input the samples in the training sample set into the face liveness detection model to be trained, and after processing by the backbone network of the face liveness detection model to be trained, at least the live face classification prediction result and liveness prediction feature are obtained through the task branches of each task. One of the task branches is the live face classification task branch based on face image features, and at least one of the other task branches is the live feature detection task branch based on liveness features. The device further includes: The loss calculation module is used to calculate the loss of the live face classification task based on the live face classification annotation result and the live face classification prediction result, and to calculate the loss of the live feature detection task based on the live annotation features and the live prediction features. The adjustment module is used to calculate the multi-task joint loss based at least on the loss of the live face classification task and the loss of the live feature detection task, and adjust the parameters of the face liveness detection model to be trained based on the multi-task joint loss. The training module is used to obtain a trained face liveness detection model when the training termination condition is met.

13. The face liveness detection device according to claim 12, characterized in that, The adjustment module is further configured to calculate the gradient of the loss based at least on the loss of the live face classification task and the loss of the live feature detection task; determine the weight coefficient of the loss of each task based on the gradient of the loss; calculate the joint loss of multiple tasks based on the loss of each task and the weight coefficient; and adjust the parameters of the face liveness detection model to be trained based on the joint loss of multiple tasks.

14. The face liveness detection device according to claim 12, characterized in that, The liveness feature includes depth information, the liveness annotation feature includes an annotated depth map; the liveness prediction feature includes a predicted depth map; the liveness feature detection task is a depth information detection task based on depth information. The loss calculation module is further configured to calculate the cross-entropy loss based on the live face classification annotation results and the live face classification prediction results to obtain the loss of the live face classification task; and to calculate the mean square error loss based on the predicted depth map and the labeled depth map to obtain the loss of the live feature detection task.

15. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 7.

16. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.

17. A computer program product comprising computer instructions, characterized in that, The computer instructions are stored in a computer-readable storage medium, and the processor of the computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the steps of the method according to any one of claims 1 to 7.