Living body detection method, device, system and medium based on fusion features
By combining mouth region images and feature point images in face recognition, the reliability and robustness issues of face liveness detection in existing technologies are solved, and the accuracy and stability of lip recognition are improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PING AN BANK CO LTD
- Filing Date
- 2023-06-09
- Publication Date
- 2026-06-12
AI Technical Summary
In existing technologies, lip recognition methods based on facial recognition have low reliability and robustness in face liveness detection due to differences in the color and texture of the user's mouth and the influence of external shooting factors.
By collecting video of the face to be identified while the user reads the prompt text, the image frames to be identified are extracted, and the mouth region image and mouth feature point image are obtained by using facial key points and grid structure information. Combined with the trained lip shape recognition model, the lip shape recognition result is obtained and matched with the prompt text to determine the face liveness detection result.
It effectively avoids the influence of mouth color and texture or external factors on lip shape recognition, and improves the accuracy and robustness of face liveness detection.
Smart Images

Figure CN116665317B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of financial technology, and in particular to a liveness detection method, apparatus, system and medium based on fusion features. Background Technology
[0002] With the development of mobile internet, smartphones have changed people's work and lifestyles, allowing them to enjoy various convenient services without leaving home. The banking industry is no exception; online services enable users to conduct banking inquiries, transfers, and wealth management anytime, anywhere. During these transactions, identity verification is a necessary process for user account security, and facial recognition is one of the commonly used methods.
[0003] Facial recognition works by recording video with a camera to collect facial information and comparing it with a pre-registered photo to verify identity. However, with advancements in attack techniques, conventional facial recognition methods, such as silent liveness detection and random motion liveness detection, are becoming increasingly insecure. Therefore, lip-reading-based liveness detection is receiving increasing attention due to its security implications.
[0004] Current solutions only utilize facial images for lip-reading recognition. However, due to variations in the color and texture of different users' mouths, as well as the influence of external shooting factors, the resulting video sequences can differ significantly even when different users are speaking the same text or the same user is speaking the same text in different environments. This can negatively impact the performance and robustness of lip-reading recognition, reducing the reliability of liveness detection. Summary of the Invention
[0005] In view of the shortcomings of the prior art, the purpose of this invention is to provide a liveness detection method, device, system and medium based on fusion features that can be applied to financial technology or other related fields, aiming to improve the reliability of face liveness detection based on lip recognition.
[0006] The technical solution of the present invention is as follows:
[0007] A liveness detection method based on fused features, comprising:
[0008] Collect a video of the face to be identified while the user reads the prompt text, and extract the image frames to be identified from the video of the face to be identified;
[0009] Face detection is performed on the image frame to be identified to obtain facial key points and grid structure information;
[0010] Based on the facial key points and grid structure information, obtain the mouth region image and mouth feature point image of the image frame to be identified;
[0011] The mouth region image and the mouth feature point image are input into the trained mouth shape recognition model for feature fusion recognition to obtain the mouth shape recognition result;
[0012] The face liveness detection result is obtained by matching the lip-sync result with the prompt text.
[0013] In one embodiment, the step of acquiring the video of the face to be identified from the user reading the prompt text, and extracting the image frames to be identified from the video of the face to be identified, includes:
[0014] Collect video of the face to be recognized as the user reads the prompt text aloud;
[0015] Extract the audio data from the video of the face to be identified;
[0016] The audio data is used to determine the spoken portion of the face video to be identified.
[0017] The image frame of the read-out speech portion is extracted as the image frame to be identified.
[0018] In one embodiment, determining the spoken portion of the face video to be identified based on the audio data includes:
[0019] Determine the start and end times when the volume in the audio data exceeds a preset threshold;
[0020] The video of the face to be identified is segmented according to the start and end times to obtain the reading voice portion.
[0021] In one embodiment, obtaining the mouth region image and mouth feature point image of the image frame to be identified based on the facial key points and mesh structure information includes:
[0022] Calculate the transformation matrix based on the position of the facial key points in the image frame to be identified;
[0023] After performing an affine transformation on the image frame to be identified based on the transformation matrix, the mouth region image is extracted.
[0024] Based on the grid structure information in the image frame to be identified and the preset average frontal face grid information, face grid alignment is performed on the image frame to be identified.
[0025] An output mouth feature point image is constructed based on the aligned grid points of the image frame to be identified.
[0026] In one embodiment, the step of inputting the mouth region image and the mouth feature point image into a trained lip shape recognition model for feature fusion recognition to obtain the lip shape recognition result includes:
[0027] Feature extraction is performed on the mouth region image and the mouth feature point image respectively to obtain the first feature and the second feature;
[0028] The first feature and the second feature are spliced and fused to obtain the fused feature;
[0029] After classifying the lip shape based on the fusion features, the lip shape recognition result is output.
[0030] In one embodiment, the step of inputting the mouth region image and the mouth feature point image into a trained lip shape recognition model for feature fusion recognition to obtain the lip shape recognition result includes:
[0031] Feature extraction is performed on the mouth region image and the mouth feature point image respectively to obtain the first feature and the second feature;
[0032] After classifying mouth shapes based on the first feature and the second feature respectively, a first classification score and a second classification score are obtained;
[0033] After fusing the first classification score and the second classification score, the lip shape type with the highest score is determined, and the lip shape recognition result is output.
[0034] In one embodiment, the step of matching the lip-reading recognition result with the prompt text to obtain the face liveness detection result includes:
[0035] The lip shape type in the lip shape recognition result is compared with the standard lip shape type corresponding to the prompt text to determine whether they are consistent;
[0036] If they match, the results of the face liveness detection that passed the test will be displayed; otherwise, the results of the face liveness detection that failed the test will be displayed.
[0037] A liveness detection device based on fusion features, comprising:
[0038] The acquisition and extraction module is used to acquire the video of the face to be identified when the user reads the prompt text, and to extract the image frames to be identified from the video of the face to be identified;
[0039] The face detection module is used to perform face detection on the image frame to be identified and obtain facial key points and grid structure information;
[0040] The image extraction module is used to obtain the mouth region image and mouth feature point image of the image frame to be identified based on the facial key points and grid structure information.
[0041] The lip shape recognition module is used to input the mouth region image and the mouth feature point image into the trained lip shape recognition model for fusion feature recognition to obtain the lip shape recognition result;
[0042] The detection output module is used to match the lip-sync result with the prompt text to obtain the face liveness detection result.
[0043] A liveness detection system based on fused features, the system comprising at least one processor; and,
[0044] A memory communicatively connected to the at least one processor; wherein,
[0045] The memory stores instructions that can be executed by the at least one processor, which enables the at least one processor to perform the above-described liveness detection method based on fused features.
[0046] A non-volatile computer-readable storage medium storing computer-executable instructions, which, when executed by one or more processors, cause the one or more processors to perform the above-described liveness detection method based on fused features.
[0047] Beneficial effects: This invention discloses a liveness detection method, device, system, and medium based on fusion features. Compared with the prior art, the embodiments of this invention obtain lip shape recognition results by fusing features from mouth region images and mouth feature point maps, effectively avoiding the influence of mouth color and texture or external factors on lip shape recognition, and enhancing the accuracy and robustness of face liveness detection based on lip shape recognition. Attached Figure Description
[0048] The present invention will be further described below with reference to the accompanying drawings and embodiments. In the accompanying drawings:
[0049] Figure 1 A flowchart of a liveness detection method based on fused features provided in an embodiment of the present invention;
[0050] Figure 2 A flowchart of step S100 in the liveness detection method based on fusion features provided in an embodiment of the present invention;
[0051] Figure 3 A flowchart of step S103 in the liveness detection method based on fusion features provided in an embodiment of the present invention;
[0052] Figure 4 A flowchart of step S300 in the liveness detection method based on fusion features provided in an embodiment of the present invention;
[0053] Figure 5 This is a flowchart of step S400 in the liveness detection method based on fused features provided in an embodiment of the present invention;
[0054] Figure 6 Another flowchart of step S400 in the liveness detection method based on fusion features provided in the embodiments of the present invention;
[0055] Figure 7 A schematic diagram of the functional modules of the liveness detection device based on fusion features provided in an embodiment of the present invention;
[0056] Figure 8 This is a schematic diagram of the hardware structure of a liveness detection system based on fusion features provided in an embodiment of the present invention. Detailed Implementation
[0057] To make the objectives, technical solutions, and effects of this invention clearer and more explicit, the invention is further described in detail below. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. The embodiments of the invention are described below in conjunction with the accompanying drawings.
[0058] With the development of mobile internet, smartphones have changed people's work and lifestyles, allowing them to enjoy various convenient services without leaving home. The banking industry is no exception; online services enable users to conduct banking inquiries, transfers, and wealth management anytime, anywhere. During these transactions, identity verification is a necessary process for user account security, and facial recognition is one of the commonly used methods.
[0059] Facial recognition works by recording video with a camera to collect facial information and comparing it with a pre-registered photo to verify identity. However, with advancements in attack techniques, conventional facial recognition methods, such as silent liveness detection and random motion liveness detection, are becoming increasingly insecure. Therefore, lip-reading-based liveness detection is receiving increasing attention due to its security implications.
[0060] Current solutions only utilize facial images for lip-reading recognition. However, due to variations in the color and texture of different users' mouths, as well as the influence of external shooting factors, the resulting video sequences can differ significantly even when different users are speaking the same text or the same user is speaking the same text in different environments. This can negatively impact the performance and robustness of lip-reading recognition, reducing the reliability of liveness detection.
[0061] To address the aforementioned problems, this invention proposes a liveness detection method based on fused features. Please refer to [link to relevant documentation]. Figure 1 , Figure 1This is a flowchart of an embodiment of the liveness detection method based on fusion features provided by the present invention. The liveness detection method based on fusion features provided in this embodiment is applied to a system comprising a terminal device, a network, and a server. The network is the medium used to provide a communication link between the terminal device and the server, and it can include various connection types, such as wired, wireless communication links, or fiber optic cables. The operating system on the terminal device can include a handheld device operating system (iPhone operating system, iOS system), Android system, or other operating systems. The terminal device connects to the server through the network to achieve interaction, thereby performing operations such as receiving or sending data. Specifically, it can be various electronic devices with a display screen and supporting web browsing, including but not limited to smartphones, tablets, portable computers, and desktop servers. Figure 1 As shown, the method specifically includes the following steps:
[0062] S100: Collect the video of the face to be identified as the user reads the prompt text, and extract the image frame to be identified from the video of the face to be identified;
[0063] S200. Perform face detection on the image frame to be identified, and obtain facial key points and grid structure information;
[0064] S300. Obtain the mouth region image and mouth feature point image of the image frame to be identified based on the facial key points and grid structure information.
[0065] S400. Input the mouth region image and mouth feature point image into the trained mouth shape recognition model for fusion feature recognition to obtain the mouth shape recognition result;
[0066] S500: The lip-sync result is matched with the prompt text to obtain the face liveness detection result.
[0067] In this embodiment, when a user triggers the identity authentication function while using a bank's app, such as during login, payment, or fund transfer, facial liveness detection is required to ensure asset security. At this time, a prompt text can be displayed on the front-end interface, prompting the user to read it aloud. This prompt text is randomly generated from a preset dictionary and can include text or numbers to ensure the reliability of lip-reading-based liveness detection.
[0068] When displaying the prompt text, the terminal's camera is activated to capture the audio and video of the user reading the prompt text. After the capture is completed, a video of the face to be recognized is obtained. The image frames of the user speaking part are extracted from the video of the face to be recognized for subsequent recognition, avoiding the participation of other invalid image frames in the recognition and reducing the efficiency of liveness detection.
[0069] Face detection is performed on the extracted image frames to be recognized during user reading. It can be understood that the image frames to be recognized are image sequences including multiple frames. Therefore, face detection is performed on each frame of the image to be recognized. For example, the face detection algorithm in MediaPipe is used to obtain the user's face region. Then, the face mesh estimation algorithm in MediaPipe is used on the face region image to obtain the face key point information (such as the positions of the eyes, nose and mouth) and the mesh structure information of the entire face, which serves as the data basis for lip recognition.
[0070] Based on facial key points and mesh structure information, the mouth region image and mouth feature point image are further obtained from the image frame to be recognized. These two images are then input into a pre-trained lip shape recognition model for feature fusion recognition to obtain the lip shape recognition result. This lip shape recognition model can be an existing model, such as a temporal convolutional neural network model; this embodiment does not limit this. The lip shape recognition result is then matched with the prompt text to obtain the face liveness detection result.
[0071] Since a single mouth region image may be affected by the color and texture of the mouth of different users, as well as external shooting factors, affecting the accuracy of lip shape recognition, this embodiment combines the mouth feature point image, which is not affected by external factors, for fusion feature recognition. This effectively avoids the influence of mouth color and texture or external factors on lip shape recognition, and enhances the accuracy and robustness of face liveness detection based on lip shape recognition.
[0072] In one embodiment, such as Figure 2 As shown, step S100 includes:
[0073] S101. Collect the video of the face to be recognized as the user reads the prompt text aloud;
[0074] S102. Extract the audio data from the video of the face to be identified;
[0075] S103. Determine the reading speech portion in the video of the face to be identified based on the audio data;
[0076] S104. Extract the image frame of the read-out speech portion as the image frame to be identified.
[0077] In this embodiment, the terminal device's camera can be activated upon receiving a user's confirmation command for data collection, or a countdown timer can be displayed on the screen. The camera is activated when the countdown ends to capture a video of the user reading the prompt text. This ensures the user is prepared before video capture, reducing the collection and recognition of invalid videos and avoiding resource waste. Audio data is then extracted from the video to identify the user's spoken portion. Image frames are extracted only from the spoken portion of the video data to obtain the corresponding image frames to be recognized, reducing the data processing load for face detection and lip-reading recognition, thereby improving the efficiency of face liveness detection.
[0078] In one embodiment, such as Figure 3 As shown, step S103 includes:
[0079] S1031. Determine the start and end times of the volume exceeding a preset threshold in the audio data;
[0080] S1032. The video of the face to be identified is segmented according to the start time and end time to obtain the reading voice part.
[0081] In this embodiment, when determining the spoken portion of the face video to be recognized, the determination is specifically based on the volume changes in the audio data. After extracting the audio data, noise reduction processing is preferably performed to eliminate background noise interference with the spoken portion audio. Then, in all the audio data, the start and end times when the volume exceeds a preset threshold are determined. Specifically, since there may be pauses when the user reads multiple words in the prompt text, the first time the volume exceeds the preset threshold is taken as the start time, and the last time the volume exceeds the preset threshold is taken as the end time. By using these start and end times to extract the spoken portion from the face video to be recognized, the integrity of the video extraction is ensured, avoiding errors in lip recognition and matching detection caused by incomplete extraction.
[0082] In one embodiment, such as Figure 4 As shown, step S300 includes:
[0083] S301. Calculate the transformation matrix based on the position of the facial key points in the image frame to be identified;
[0084] S302. After performing an affine transformation on the image frame to be identified according to the transformation matrix, the mouth region image is extracted.
[0085] S303. Based on the grid structure information in the image frame to be identified and the preset average frontal face grid information, perform face grid alignment on the image frame to be identified.
[0086] S304. Construct an output mouth feature point image based on the grid points of the aligned image frame to be identified.
[0087] In this embodiment, for each image frame to be recognized, a transformation matrix is calculated based on the positions of facial key points within it. An affine transformation is then performed on the image frame to be recognized based on this transformation matrix to achieve face alignment and minimize the impact of factors such as the user's shooting angle on lip recognition. Subsequent image frames to be recognized are then partially cropped, for example, based on the positions of several facial key points at the mouth, to obtain an image of the mouth region.
[0088] Furthermore, for each frame of the image to be identified, the face grid of the current frame is aligned based on the estimated face grid structure information and the pre-calculated average frontal face grid information. The mouth feature point image is then constructed based on the aligned grid points, ensuring that the mouth feature point image is also constructed under the premise of face alignment, thereby improving the accuracy of the mouth feature point image.
[0089] In one embodiment, such as Figure 5 As shown, step S400 includes:
[0090] S411. Extract features from the mouth region image and the mouth feature point image respectively to obtain the first feature and the second feature;
[0091] S412. The first feature and the second feature are spliced and fused to obtain the fused feature;
[0092] S413. After classifying the lip shape based on the fusion features, output the lip shape recognition result.
[0093] In this embodiment, when performing fusion feature recognition on two input images using the trained lip-shape recognition model, features are first extracted from the input mouth region image and mouth feature point image to obtain the first feature and the second feature. Then, the extracted features are concatenated and fused to obtain the fused feature. For example, the features extracted by the temporal convolutional neural network are concatenated and then subjected to a convolution operation to obtain the fused feature. Then, the lip-shape classification is performed on the fused feature through a fully connected network. Since the prompt text usually includes several words, the lip shape will change during reading. Therefore, when classifying the lip shape based on the fused feature, each reading action in the fused feature can be identified sequentially as an open mouth or a closed mouth shape, such as open mouth-closed mouth-open mouth, thereby obtaining the lip-shape recognition result, which serves as a reliable matching basis for face liveness detection.
[0094] In one embodiment, such as Figure 6 As shown, step S400 includes:
[0095] S421. Extract features from the mouth region image and the mouth feature point image respectively to obtain the first feature and the second feature;
[0096] S422. After classifying mouth shapes according to the first feature and the second feature respectively, a first classification score and a second classification score are obtained;
[0097] S423. After fusing the first classification score and the second classification score, determine the lip shape type with the highest score and output the lip shape recognition result.
[0098] In this embodiment, similar to the previous embodiment, when implementing fusion feature lip shape recognition, features are first extracted from the input mouth region image and mouth feature point image to obtain the first feature and the second feature. Unlike the previous embodiment, this embodiment performs lip shape classification separately for the first feature and the second feature, obtaining two sets of classification results, namely the first classification score and the second classification score. The classification score includes the probability value of the lip shape type corresponding to several reading actions. For example, the first classification score is 0.8 for open mouth and 0.2 for closed mouth; 0.5 for open mouth and 0.5 for closed mouth; 0.9 for open mouth and 0.1 for closed mouth. The second classification score is 0.9 for open mouth and 0.1 for closed mouth; 0.2 for open mouth and 0.8 for closed mouth; 0.9 for open mouth and 0.1 for closed mouth. Thus, the classification judgment results of lip shape under the two images are obtained.
[0099] The two classification results are then fused to determine the lip shape type with the highest score. This can be done by averaging or adding the scores from both classifications. This determines the lip shape type with the highest probability after combining the two classifications for each reading action. For example, averaging the scores yields scores of 0.85 for open mouth and 0.15 for closed mouth; 0.35 for open mouth and 0.65 for closed mouth; and 0.9 for open mouth and 0.1 for closed mouth. This results in a lip shape recognition sequence of open mouth, closed mouth, open mouth. By using classification results from mouth feature point images that are unaffected by skin color, texture, or shooting conditions, the classification results of the mouth region image are optimized, ensuring the accuracy of lip shape recognition.
[0100] In one embodiment, step S500 includes:
[0101] The lip shape type in the lip shape recognition result is compared with the standard lip shape type corresponding to the prompt text to determine whether they are consistent;
[0102] If they match, the results of the face liveness detection that passed the test will be displayed; otherwise, the results of the face liveness detection that failed the test will be displayed.
[0103] In this embodiment, the lip-shape recognition results output by the model, i.e., the sequentially arranged lip-shape types, are compared with the standard lip-shape types corresponding to the prompt text to determine whether the lip-shape types and order are consistent. For example, if the standard lip-shape types corresponding to the current prompt text are open mouth, closed mouth, open mouth, then the current face liveness detection is confirmed to be successful only if the user's lip-shape action is also open mouth, closed mouth, open mouth; otherwise, it fails. Based on the detection results, the corresponding prompt page is displayed on the terminal device's display interface, achieving efficient, accurate, and intuitive liveness detection and result display.
[0104] Another embodiment of the present invention provides a liveness detection device based on fusion features, such as... Figure 7 As shown, device 1 includes:
[0105] The acquisition and extraction module 11 is used to acquire the video of the face to be identified when the user reads the prompt text, and extract the image frames to be identified from the video of the face to be identified.
[0106] The face detection module 12 is used to perform face detection on the image frame to be identified and obtain face key points and grid structure information;
[0107] Image extraction module 13 is used to obtain the mouth region image and mouth feature point image of the image frame to be identified based on the facial key points and grid structure information;
[0108] The lip shape recognition module 14 is used to input the mouth region image and the mouth feature point image into the trained lip shape recognition model for fusion feature recognition to obtain the lip shape recognition result;
[0109] The detection output module 15 is used to match the lip shape recognition result with the prompt text to obtain the face liveness detection result.
[0110] The module referred to in this invention is a series of computer program instruction segments that can perform specific functions. It is more suitable than a program for describing the execution process of liveness detection based on fusion features. For specific implementation methods of each module, please refer to the corresponding method embodiments above, which will not be repeated here.
[0111] In one embodiment, the acquisition and extraction module 11 includes:
[0112] The acquisition unit is used to acquire video of the face to be recognized when the user reads the prompt text.
[0113] An audio extraction unit is used to extract audio data from the video of the face to be identified;
[0114] The video extraction unit is used to determine the reading speech portion in the video of the face to be identified based on the audio data;
[0115] An image extraction unit is used to extract image frames of the read-out speech portion as the image frames to be identified.
[0116] In one embodiment, the video extraction unit includes:
[0117] A time determination unit is used to determine the start and end times when the volume of the audio data exceeds a preset threshold.
[0118] The video segmentation unit is used to segment the video of the face to be identified according to the start time and end time to obtain the reading speech portion.
[0119] In one embodiment, the image extraction module 13 includes:
[0120] The calculation unit is used to calculate the transformation matrix based on the position of the facial key points in the image frame to be identified;
[0121] The transformation and cropping unit is used to perform an affine transformation on the image frame to be identified according to the transformation matrix and then crop the mouth region image.
[0122] The grid alignment unit is used to align the face grid of the image frame to be identified based on the grid structure information in the image frame to be identified and the preset average frontal face grid information.
[0123] An output unit is constructed to construct an output mouth feature point image based on the grid points of the aligned image frame to be identified.
[0124] In one embodiment, the lip-reading recognition module 14 includes:
[0125] The feature extraction unit is used to extract features from the mouth region image and the mouth feature point image respectively to obtain the first feature and the second feature;
[0126] The feature fusion unit is used to splice and fuse the first feature and the second feature to obtain the fused feature;
[0127] The lip shape classification unit is used to classify lip shapes based on the fusion features and then output the lip shape recognition result.
[0128] In one embodiment, the lip-reading recognition module 14 includes:
[0129] The feature extraction unit is used to extract features from the mouth region image and the mouth feature point image respectively to obtain the first feature and the second feature;
[0130] The mouth shape classification unit is used to classify mouth shapes according to the first feature and the second feature respectively to obtain a first classification score and a second classification score.
[0131] The classification fusion unit is used to fuse the first classification score and the second classification score to determine the lip shape type with the highest score and output the lip shape recognition result.
[0132] In one embodiment, the detection output module 15 includes:
[0133] The lip shape comparison unit is used to compare the lip shape type in the lip shape recognition result with the standard lip shape type corresponding to the prompt text to determine whether they are consistent.
[0134] The detection output unit is used to display the face liveness detection result if the match is consistent, and otherwise to display the face liveness detection result if the match is not consistent.
[0135] Another embodiment of the present invention provides a liveness detection system based on fused features, such as... Figure 8 As shown, system 10 includes:
[0136] One or more processors 110 and memory 120, Figure 8 The following description uses a processor 110 as an example. The processor 110 and the memory 120 can be connected via a bus or other means. Figure 8 Taking the example of a connection between China and Israel via a bus.
[0137] Processor 110 is used to perform various control logics of system 10, and can be a general-purpose processor, digital signal processor (DSP), application-specific integrated circuit (ASIC), field-programmable gate array (FPGA), microcontroller, ARM (Acorn RISC Machine) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. Furthermore, processor 110 can also be any conventional processor, microprocessor, or state machine. Processor 110 can also be implemented as a combination of computing devices, such as a combination of DSP and microprocessor, multiple microprocessors, one or more microprocessors combined with DSP and / or any other such configuration.
[0138] The memory 120, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the program instructions corresponding to the liveness detection method based on fusion features in the embodiments of the present invention. The processor 110 executes various functional applications and data processing of the system 10 by running the non-volatile software programs, instructions, and units stored in the memory 120, thereby implementing the liveness detection method based on fusion features in the above method embodiments.
[0139] The memory 120 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created according to the use of the system 10. Furthermore, the memory 120 may include high-speed random access memory and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, the memory 120 may optionally include memory remotely located relative to the processor 110, and these remote memories may be connected to the system 10 via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
[0140] One or more units are stored in memory 120, and when executed by one or more processors 110, perform the following steps:
[0141] Collect a video of the face to be identified while the user reads the prompt text, and extract the image frames to be identified from the video of the face to be identified;
[0142] Face detection is performed on the image frame to be identified to obtain facial key points and grid structure information;
[0143] Based on the facial key points and grid structure information, obtain the mouth region image and mouth feature point image of the image frame to be identified;
[0144] The mouth region image and the mouth feature point image are input into the trained mouth shape recognition model for feature fusion recognition to obtain the mouth shape recognition result;
[0145] The face liveness detection result is obtained by matching the lip-sync result with the prompt text.
[0146] In one embodiment, the step of acquiring the video of the face to be identified from the user reading the prompt text, and extracting the image frames to be identified from the video of the face to be identified, includes:
[0147] Collect video of the face to be recognized as the user reads the prompt text aloud;
[0148] Extract the audio data from the video of the face to be identified;
[0149] The audio data is used to determine the spoken portion of the face video to be identified.
[0150] The image frame of the read-out speech portion is extracted as the image frame to be identified.
[0151] In one embodiment, determining the spoken portion of the face video to be identified based on the audio data includes:
[0152] Determine the start and end times when the volume in the audio data exceeds a preset threshold;
[0153] The video of the face to be identified is segmented according to the start and end times to obtain the reading voice portion.
[0154] In one embodiment, obtaining the mouth region image and mouth feature point image of the image frame to be identified based on the facial key points and mesh structure information includes:
[0155] Calculate the transformation matrix based on the position of the facial key points in the image frame to be identified;
[0156] After performing an affine transformation on the image frame to be identified based on the transformation matrix, the mouth region image is extracted.
[0157] Based on the grid structure information in the image frame to be identified and the preset average frontal face grid information, face grid alignment is performed on the image frame to be identified.
[0158] An output mouth feature point image is constructed based on the aligned grid points of the image frame to be identified.
[0159] In one embodiment, the step of inputting the mouth region image and the mouth feature point image into a trained lip shape recognition model for feature fusion recognition to obtain the lip shape recognition result includes:
[0160] Feature extraction is performed on the mouth region image and the mouth feature point image respectively to obtain the first feature and the second feature;
[0161] The first feature and the second feature are spliced and fused to obtain the fused feature;
[0162] After classifying the lip shape based on the fusion features, the lip shape recognition result is output.
[0163] In one embodiment, the step of inputting the mouth region image and the mouth feature point image into a trained lip shape recognition model for feature fusion recognition to obtain the lip shape recognition result includes:
[0164] Feature extraction is performed on the mouth region image and the mouth feature point image respectively to obtain the first feature and the second feature;
[0165] After classifying mouth shapes based on the first feature and the second feature respectively, a first classification score and a second classification score are obtained;
[0166] After fusing the first classification score and the second classification score, the lip shape type with the highest score is determined, and the lip shape recognition result is output.
[0167] In one embodiment, the step of matching the lip-reading recognition result with the prompt text to obtain the face liveness detection result includes:
[0168] The lip shape type in the lip shape recognition result is compared with the standard lip shape type corresponding to the prompt text to determine whether they are consistent;
[0169] If they match, the results of the face liveness detection that passed the test will be displayed; otherwise, the results of the face liveness detection that failed the test will be displayed.
[0170] This invention provides a non-volatile computer-readable storage medium storing computer-executable instructions that are executed by one or more processors, for example, to perform the operations described above. Figure 1 The method steps S100 to S500.
[0171] As examples, non-volatile storage media can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) as external cache memory. By way of illustration and not limitation, RAM can be obtained in many forms such as synchronous RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and direct Rambus RAM (DRRAM). The memory components or memories disclosed in the operating environment described herein are intended to include one or more of these and / or any other suitable types of memory.
[0172] In summary, the liveness detection method, apparatus, system, and medium based on fusion features disclosed in this invention involve: acquiring a video of a face to be identified while a user reads a prompt text; extracting image frames to be identified from the video; performing face detection on the image frames to be identified to obtain facial key points and grid structure information; obtaining mouth region images and mouth feature point images of the image frames to be identified based on the facial key points and grid structure information; inputting the mouth region images and mouth feature point images into a trained lip shape recognition model for fusion feature recognition to obtain lip shape recognition results; and matching the lip shape recognition results with the prompt text to obtain the face liveness detection result. By using mouth region images and mouth feature point images for fusion feature recognition to obtain lip shape recognition results, the influence of mouth color and texture or external factors on lip shape recognition is effectively avoided, enhancing the accuracy and robustness of face liveness detection based on lip shape recognition.
[0173] Of course, those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware (such as a processor, controller, etc.). The computer program can be stored in a non-volatile, computer-readable storage medium, and when executed, it can include the processes described in the above method embodiments. The storage medium can be a memory, magnetic disk, floppy disk, flash memory, optical storage, etc.
[0174] It should be understood that the application of the present invention is not limited to the examples above. Those skilled in the art can make improvements or modifications based on the above description, and all such improvements and modifications should fall within the protection scope of the appended claims.
Claims
1. A liveness detection method based on fusion features, characterized in that, include: Collect a video of the face to be identified while the user reads the prompt text, and extract the image frames to be identified from the video of the face to be identified; Face detection is performed on the image frame to be identified to obtain facial key points and grid structure information; Based on the facial key points and grid structure information, obtain the mouth region image and mouth feature point image of the image frame to be identified; The mouth region image and the mouth feature point image are input into the trained mouth shape recognition model for feature fusion recognition to obtain the mouth shape recognition result; The face liveness detection result is obtained by matching the lip shape recognition result with the prompt text; The step of obtaining the mouth region image and mouth feature point image of the image frame to be identified based on the facial key points and grid structure information includes: Calculate the transformation matrix based on the position of the facial key points in the image frame to be identified; After performing an affine transformation on the image frame to be identified based on the transformation matrix, the mouth region image is extracted. Based on the grid structure information in the image frame to be identified and the preset average frontal face grid information, face grid alignment is performed on the image frame to be identified. Construct an output mouth feature point image based on the aligned grid points of the image frame to be identified; The step of inputting the mouth region image and the mouth feature point image into a trained lip shape recognition model for feature fusion recognition to obtain the lip shape recognition result includes: Feature extraction is performed on the mouth region image and the mouth feature point image respectively to obtain the first feature and the second feature; The first feature and the second feature are spliced and fused to obtain the fused feature; After classifying the lip shape based on the fusion features, the lip shape recognition result is output. or, Feature extraction is performed on the mouth region image and the mouth feature point image respectively to obtain the first feature and the second feature; After classifying the mouth shape based on the first feature and the second feature respectively, a first classification score and a second classification score are obtained; both the first classification score and the second classification score include probability values of mouth shape types corresponding to several reading actions. After fusing the first classification score and the second classification score, the mouth shape type with the highest score is determined. The classification results of the mouth region image are optimized based on the classification results of the mouth feature point image, and the mouth shape recognition result is output.
2. The liveness detection method based on fusion features according to claim 1, characterized in that, The process of collecting the video of the face to be identified from the user's reading of the prompt text, and extracting the image frames to be identified from the video of the face to be identified, includes: Collect video of the face to be recognized as the user reads the prompt text aloud; Extract the audio data from the video of the face to be identified; The audio data is used to determine the spoken portion of the face video to be identified. The image frame of the read-out speech portion is extracted as the image frame to be identified.
3. The liveness detection method based on fusion features according to claim 2, characterized in that, The step of determining the spoken portion of the face video to be identified based on the audio data includes: Determine the start and end times when the volume in the audio data exceeds a preset threshold; The video of the face to be identified is segmented according to the start and end times to obtain the reading voice portion.
4. The liveness detection method based on fusion features according to claim 1, characterized in that, The step of matching the lip-reading recognition result with the prompt text to obtain the face liveness detection result includes: The lip shape type in the lip shape recognition result is compared with the standard lip shape type corresponding to the prompt text to determine whether they are consistent; If they match, the results of the face liveness detection that passed the test will be displayed; otherwise, the results of the face liveness detection that failed the test will be displayed.
5. A liveness detection device based on fusion features, characterized in that, include: The acquisition and extraction module is used to acquire the video of the face to be identified when the user reads the prompt text, and to extract the image frames to be identified from the video of the face to be identified; The face detection module is used to perform face detection on the image frame to be identified and obtain facial key points and grid structure information; The image extraction module is used to obtain the mouth region image and mouth feature point image of the image frame to be identified based on the facial key points and grid structure information. The lip shape recognition module is used to input the mouth region image and the mouth feature point image into the trained lip shape recognition model for fusion feature recognition to obtain the lip shape recognition result; The detection output module is used to match the lip shape recognition result with the prompt text to obtain the face liveness detection result; The image extraction module includes: The calculation unit is used to calculate the transformation matrix based on the position of the facial key points in the image frame to be identified; The transformation and cropping unit is used to crop the mouth region image after performing an affine transformation on the image frame to be identified according to the transformation matrix. The grid alignment unit is used to align the face grid of the image frame to be identified based on the grid structure information in the image frame to be identified and the preset average frontal face grid information. An output unit is constructed to construct an output mouth feature point image based on the grid points of the aligned image frame to be identified; The lip-reading recognition module includes: The feature extraction unit is used to extract features from the mouth region image and the mouth feature point image respectively to obtain the first feature and the second feature; The feature fusion unit is used to splice and fuse the first feature and the second feature to obtain the fused feature; A mouth shape classification unit is used to classify mouth shapes based on the fusion features and then output mouth shape recognition results. The lip shape classification unit is also used to classify lip shapes according to the first feature and the second feature respectively to obtain a first classification score and a second classification score; the first classification score and the second classification score both include probability values of lip shape types corresponding to several reading actions. The lip shape recognition module also includes a classification fusion unit, which is used to fuse the first classification score and the second classification score to determine the lip shape type with the highest score, optimize the classification result of the mouth region image based on the classification result of the mouth feature point image, and output the lip shape recognition result.
6. A liveness detection system based on fusion features, characterized in that, The system includes at least one processor; and, A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the liveness detection method based on fused features as described in any one of claims 1-4.
7. A non-volatile computer-readable storage medium, characterized in that, The non-volatile computer-readable storage medium stores computer-executable instructions that, when executed by one or more processors, cause the one or more processors to perform the liveness detection method based on fused features as described in any one of claims 1-4.