A face anti-counterfeiting identification method, device, equipment and storage medium

By fusing binocular image acquisition and neural network model features for extraction and identification, the problem of distinguishing between genuine and fake faces in existing technologies has been solved. This achieves high-precision, low-cost anti-counterfeiting identification capabilities and is applicable to various face recognition systems.

CN122244929APending Publication Date: 2026-06-19JILIN UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JILIN UNIVERSITY
Filing Date
2026-05-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing facial recognition technologies struggle to effectively distinguish between real and fake faces, especially when faced with attacks from high-precision simulation masks and 3D-printed masks, resulting in insufficient security. Furthermore, relying on single-modal feature acquisition and recognition or multi-modal data stitching is not explicit enough to meet the authentication requirements of high-security scenarios.

Method used

Binocular images are acquired using a binocular acquisition device, and a neural network model consisting of a transformer encoder, a CLIP model, and a classification head is used for training. Through semantic feature extraction, disparity cost volume construction, and geometric embedding feature fusion, the authenticity of human faces can be identified.

Benefits of technology

It significantly improves the accuracy of anti-counterfeiting identification, reduces the rate of missed detections and false detections, and has high anti-counterfeiting identification capabilities against various counterfeiting attacks, reducing identification costs and making it suitable for various facial recognition systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244929A_ABST
    Figure CN122244929A_ABST
Patent Text Reader

Abstract

This application discloses a face anti-counterfeiting authentication method, apparatus, device, and storage medium, relating to the field of binocular vision technology. The method includes: inputting binocular images into a trained face anti-counterfeiting authentication model to extract semantic features from the binocular images using a CLIP model to obtain facial semantic features; constructing a disparity cost volume based on the binocular images and aggregating the disparity cost volume to generate a disparity cost space; encoding the disparity cost space into word embeddings of the same dimension as the facial semantic features using a transformer encoder to obtain geometric embedding features; concatenating the geometric embedding features with the facial semantic features using a bidirectional modulation attention mechanism, and inputting the concatenated features into a classification head to authenticate faces in the binocular images. This application can improve the accuracy of face anti-counterfeiting authentication, reduce authentication costs, and possess high anti-counterfeiting authentication capabilities against various types of forgery attacks.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of binocular vision technology, and in particular to a method, device, equipment and storage medium for anti-counterfeiting identification of human faces. Background Technology

[0002] Currently, facial recognition has become deeply integrated into daily life due to its convenience and efficiency, and is widely used in high-frequency scenarios such as unlocking smart terminals, corporate attendance tracking, and mobile payments, becoming a mainstream biometric authentication technology. This technology primarily acquires two-dimensional facial images and extracts visual features such as facial structure, contours, and texture details to achieve rapid identity verification and identification. However, with the continuous development of simulation and forgery technology, high-precision simulation masks, 3D-printed masks, and silicone skin masks can highly replicate real facial features, easily bypassing traditional facial recognition verification logic. This makes it difficult for the system to distinguish between real faces and forged devices, easily leading to problems such as facial information misuse and identity authentication failure, resulting in privacy leaks, account theft, and financial losses, and failing to meet the authentication requirements of high-security scenarios.

[0003] To address the security shortcomings of traditional solutions, improved methods such as 2D liveness detection, 3D vision technology, and infrared / multispectral recognition have been proposed. However, significant drawbacks remain: these solutions often rely on single-modal feature acquisition and recognition, or simply stitch together multimodal data heuristically. They fail to explicitly model the inherent consistency of real faces in geometric, physiological, and textural dimensions, resulting in limited defense against novel mask attacks and insufficient generalization against forged faces generated by unknown models. Consequently, they cannot fundamentally solve the core problems of facial recognition's vulnerability to forgery and insufficient security. Furthermore, identification methods relying on dedicated acquisition equipment such as dot projectors, infrared cameras, and depth sensors are costly to repair in case of hardware damage and are difficult to reuse and promote on conventional equipment. Summary of the Invention

[0004] In view of this, the purpose of this application is to provide a face anti-counterfeiting authentication method, device, equipment, and storage medium, which can significantly improve the accuracy of anti-counterfeiting authentication, avoid missed detections and false detections, and reduce authentication costs, while possessing high anti-counterfeiting authentication capabilities against various types of counterfeiting attacks. The specific solution is as follows: In a first aspect, this application discloses a face anti-counterfeiting authentication method, applied to a face recognition system, comprising: Acquire the binocular image captured by the current binocular acquisition device to obtain the current binocular image; the current binocular image includes a left view and a right view; The current binocular image is input into the trained face anti-spoofing identification model; the face anti-spoofing identification model is a model obtained by training a neural network model containing a transformer encoder, a CLIP model and a classification head using historical binocular images; the CLIP model contains a visual encoder based on the ViT architecture; The semantic features of the current face are obtained by extracting semantic features from the left and right views of the current binocular image using the CLIP model. Based on the current binocular image, a disparity cost volume is constructed, and the disparity cost volume is aggregated to generate a disparity cost space. The disparity cost space is encoded into word embeddings of the same dimension as the current face semantic features by the transformer encoder, thus obtaining geometric embedding features; The geometric embedding features and the current facial semantic features are concatenated along the feature dimension using a bidirectional modulation attention mechanism to obtain the current concatenated features; The currently stitched features are input into the classification head to distinguish between real and fake target faces in the binocular image, and the face identification result is obtained.

[0005] Optionally, inputting the current binocular image into the trained face anti-spoofing identification model includes: Obtain the binocular stereo calibration parameters of the binocular acquisition device; Based on the binocular stereo calibration parameters, distortion correction and epipolar correction are performed on the left and right views in the current binocular image respectively, so that the corresponding pixels of the left and right views are located on the same scan line, and the corrected binocular image is obtained. The corrected binocular image is input into the trained face anti-spoofing identification model.

[0006] Optionally, the step of extracting semantic features from the left and right views of the current binocular image using the CLIP model to obtain the current facial semantic features includes: The left and right views in the corrected binocular images are segmented according to a preset resolution to obtain multiple segmented image blocks corresponding to a single view. The CLIP model is used to extract semantic features from each segmented image block corresponding to a single view to obtain initial local semantic features. The initial local semantic features corresponding to a single view are concatenated with a learnable first classification label to obtain concatenated semantic features. The concatenated semantic features are then input into the visual encoder in the CLIP model to encode the concatenated semantic features using a self-attention mechanism to obtain the current global semantic features. The first classification marker is removed from the initial local semantic features corresponding to a single view to obtain the current local semantic features; the current local semantic features contain the location information of the image patch.

[0007] Optionally, the step of constructing a disparity cost volume based on the current binocular image and performing cost aggregation on the disparity cost volume to generate a disparity cost space includes: The matching cost of each pixel in the left and right views of the corrected binocular image within a preset disparity range is calculated using a convolutional network to generate a disparity cost volume. The disparity cost volume is aggregated using a differentiable semi-global matching algorithm to obtain the disparity cost space.

[0008] Optionally, the step of encoding the disparity cost space into word embeddings of the same dimension as the current face semantic features through the transformer encoder to obtain geometric embedding features includes: The disparity cost space is projected onto the word embedding with the same dimension as the current local semantic feature in the disparity dimension to obtain the feature map; The feature map is flattened into a one-dimensional sequence to obtain a first feature sequence, and a second classification marker and position code are added to the first feature sequence to obtain a second feature sequence; The second feature sequence is input into the transformer encoder to encode the second feature sequence, thereby obtaining global geometric features representing the entire face and local geometric features representing different face regions.

[0009] Optionally, encoding the second feature sequence to obtain global geometric features representing the entire face and local geometric features representing different face regions includes: Based on the second feature sequence and using a self-attention mechanism, a long-range dependency relationship of the three-dimensional structure of the face in the corrected binocular image is established, and the second feature sequence is encoded based on the long-range dependency relationship to obtain global geometric features representing the three-dimensional geometric structure of the entire face and local geometric features representing different face regions; wherein, the dimension of the global geometric features is the same as the dimension of the current global semantic features.

[0010] Optionally, the step of using a bidirectional modulation attention mechanism to concatenate the geometric embedding features and the current facial semantic features along the feature dimension to obtain the current concatenated features includes: The global geometric features and the current global semantic features are concatenated along the feature dimension to obtain the concatenated global features. The local geometric features and the current local semantic features are concatenated along the feature dimension to obtain the locally concatenated features. The global and local concatenated features are enhanced using a bidirectional modulation attention mechanism to obtain enhanced global and local features. Perform global average pooling on the enhanced local features to obtain local pooled features; The enhanced global features and the local pooled features are concatenated along the feature dimension to obtain the current concatenated features.

[0011] Secondly, this application discloses a face anti-counterfeiting authentication device, applied to a face recognition system, comprising: The image acquisition module is used to acquire the binocular image captured by the current binocular acquisition device to obtain the current binocular image; the current binocular image includes a left view and a right view; The image input module is used to input the current binocular image into the trained face anti-spoofing identification model; the face anti-spoofing identification model is a model obtained by training a neural network model containing a transformer encoder, a CLIP model and a classification head using historical binocular images; the CLIP model contains a visual encoder based on the ViT architecture; The feature extraction module is used to extract semantic features from the left and right views of the current binocular image using the CLIP model to obtain the current facial semantic features. The construction module is used to construct the disparity cost volume based on the current binocular image; An aggregation module is used to aggregate the disparity cost volume to generate a disparity cost space. The encoding module is used to encode the disparity cost space into word embeddings of the same dimension as the current face semantic features through the transformer encoder, thereby obtaining geometric embedding features; The splicing module is used to splice the geometric embedding features and the current facial semantic features in the feature dimension using a bidirectional modulation attention mechanism to obtain the current spliced ​​features; The face recognition module is used to input the currently stitched features into the classification head to identify the authenticity of the target face in the binocular image and obtain the face recognition result.

[0012] Thirdly, this application discloses an electronic device, including a processor and a memory; wherein, when the processor executes a computer program stored in the memory, it implements the aforementioned face anti-counterfeiting identification method.

[0013] Fourthly, this application discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned face anti-counterfeiting identification method.

[0014] As can be seen, this application is applied to a face recognition system. First, it acquires the binocular image captured by the current binocular acquisition device to obtain the current binocular image. Then, the current binocular image is input into a face anti-spoofing identification model obtained by training a neural network model containing a transformer encoder, a CLIP model (containing a visual encoder based on the ViT architecture), and a classification head using historical binocular images. The CLIP model extracts semantic features from the left and right views of the current binocular image to obtain the current face semantic features. Then, a disparity cost body is constructed based on the current binocular image, and the disparity cost body is aggregated to generate a disparity cost space. The transformer encoder encodes the disparity cost space into word embeddings of the same dimension as the current face semantic features to obtain geometric embedding features. A bidirectional modulation attention mechanism is used to concatenate the geometric embedding features and the current face semantic features in the feature dimension to obtain the current concatenated features. Finally, the current concatenated features are input into the classification head to identify the authenticity of the target face in the binocular image and obtain the face identification result.

[0015] This application utilizes a neural network model comprising a transformer encoder, a CLIP model (including a visual encoder based on the ViT architecture), and a classification head to distinguish between real and fake faces appearing in binocular images. The specific identification process is as follows: First, the CLIP model extracts semantic features from the left and right views of the binocular image to obtain facial semantic features. Then, a disparity cost volume is constructed based on the current binocular image, and cost aggregation is performed on the disparity cost volume to generate a disparity cost space. Next, the transformer encoder encodes the disparity cost space into word embeddings of the same dimension as the current facial semantic features, obtaining geometric embedding features. Finally, a bidirectional modulation attention mechanism is used to fuse the two different-dimensional features (geometric embedding features and facial semantic features), and the fused multi-dimensional features are then used for further analysis. The above-mentioned method for authenticating facial features integrates features from multiple dimensions for facial recognition. These features are complementary and do not overlap, thus covering attack blind spots. It can capture both planar textures and forgery artifacts of the face, as well as the true three-dimensional structure, enabling physical verification of authenticity. Compared to single-dimensional facial recognition methods, this significantly improves the accuracy of anti-counterfeiting authentication, avoiding missed and false detections. Furthermore, this application uses binocular images acquired by a binocular acquisition device (such as a binocular RGB camera) for facial recognition, resulting in higher recognition / authentication accuracy compared to monocular image-based methods. Moreover, this application does not require dedicated image acquisition equipment, thereby reducing authentication costs. It can be widely applied to various types of facial recognition systems and possesses high anti-counterfeiting authentication capabilities against multiple types of forgery attacks. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0017] Figure 1 This is a flowchart of a face anti-counterfeiting authentication method disclosed in this application; Figure 2 This is a flowchart of a specific face anti-counterfeiting authentication method disclosed in this application; Figure 3 This is a schematic diagram of the structure of a face anti-counterfeiting identification device disclosed in this application; Figure 4 This is a structural diagram of an electronic device disclosed in this application. Detailed Implementation

[0018] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0019] This application discloses a face anti-counterfeiting authentication method, applied to a face recognition system. See [link to relevant documentation]. Figure 1 As shown, the method includes: Step S11: Obtain the current binocular image acquired by the current binocular acquisition device to obtain the current binocular image; the current binocular image includes a left view and a right view.

[0020] It should be noted that the face anti-counterfeiting identification scheme proposed in this application can be applied to face recognition systems in scenarios such as robot navigation, mobile phone unlocking, corporate attendance check-in, and mobile payment. The recognition system can monitor the binocular acquisition device in real time. When it detects that the binocular acquisition device has acquired a binocular image containing a left view and a right view, it can obtain the currently acquired binocular image from the binocular acquisition device.

[0021] The binocular acquisition device includes, but is not limited to, a binocular RGB camera (a visual sensor that acquires images based on the three primary colors of red, green, and blue), a binocular camera, and a binocular vision sensor.

[0022] It should be noted that the binocular image acquisition in different scenarios is conducted under preset authorization conditions, and does not infringe on personal privacy. These preset authorization conditions mean that the facial recognition system has pre-obtained authorization from each user in the current frame for facial information collection and recognition operations. For example, in a corporate attendance tracking scenario, after obtaining prior consent from employees, binocular acquisition devices are deployed in designated public areas of the company to collect facial information and perform facial verification on employees passing by the devices on their way to and from get off work.

[0023] Step S12: Input the current binocular image into the trained face anti-spoofing identification model; the face anti-spoofing identification model is a model obtained by training a neural network model containing a transformer encoder, CLIP model and a classification head using historical binocular images; the CLIP model contains a visual encoder based on the ViT architecture.

[0024] In this embodiment, after acquiring the current binocular image captured by the binocular acquisition device, the current binocular image is input into a face anti-spoofing identification model that has been trained using a large number of historical binocular images (which may be labeled binocular images). This model includes a transformer encoder, a CLIP model (Contrastive Language-Image Pre-training), and a classification head. This model is then used to perform anti-spoofing identification on the faces appearing in the current binocular image, i.e., to determine whether the faces appearing in the image are real or fake.

[0025] It should be noted that the CLIP model includes a visual encoder based on the ViT (Vision Transformer, a deep learning model based on the Transformer architecture) architecture.

[0026] In this embodiment, inputting the current binocular image into the trained face anti-spoofing identification model may specifically include: acquiring the binocular stereo calibration parameters of the binocular acquisition device; performing distortion correction and epipolar correction on the left and right views of the current binocular image based on the binocular stereo calibration parameters, so that the corresponding pixels of the left and right views are located on the same scan line, obtaining a corrected binocular image; and inputting the corrected binocular image into the trained face anti-spoofing identification model. In this embodiment, in order to improve the consistency of subsequent facial semantic features and the accuracy of disparity cost space construction, before inputting the current binocular image into the face anti-spoofing identification model, corresponding preprocessing and correction operations can be performed on it, so that the corresponding pixels of the left and right views in the current binocular image fall on the same horizontal line, without vertical offset and geometric distortion. Specifically, we can first utilize the pre-calibrated stereo calibration parameters of the binocular acquisition device (such as a binocular camera) (e.g., the camera's intrinsic and extrinsic parameters), and then call the `stereoRectify` function (the core function for binocular camera calibration) and the `initUndistortRectifyMap` function (used to generate a mapping for correcting distortion and reshaping the image) of OpenCV (an open-source computer vision library) to perform distortion correction and epipolar correction on the left and right views synchronously acquired by the binocular camera, obtaining the corrected binocular image. It should be noted that the left and right views should meet preset image acquisition requirements, such as a resolution of at least 640×480, and the face occupying more than 30% of the entire image. The corresponding pixels in the left and right views of the corrected binocular image should be located on the same scan line, providing standard input for the subsequent construction of the disparity cost space. The camera intrinsic parameters include, but are not limited to, focal length, principal point coordinates, and distortion coefficients, while the camera extrinsic parameters include, but are not limited to, rotation matrix and translation vector.

[0027] In addition, the corrected binocular image can be output according to a preset output method. For example, the corrected binocular image can be output as two images of different scales to obtain a first corrected binocular image of 224×224 (adapted to the subsequent CLIP model for global feature extraction) and a second corrected binocular image of 256×256 (preserving the original disparity information for constructing the disparity cost space). Then, the two images of different scales (i.e., the first corrected binocular image and the second corrected binocular image) are input into the face anti-spoofing identification model to perform face authenticity identification.

[0028] Step S13: Extract semantic features from the left and right views of the current binocular image using the CLIP model to obtain the current facial semantic features.

[0029] In this embodiment, the CLIP model (such as the 3CLIP-ViT-L / 14 model) in the face anti-spoofing identification model can be used to extract global and local semantic features from the left and right views of the current binocular image, respectively, to obtain the current face semantic features. The 3CLIP-ViT-L / 14 model can be pre-trained by OpenAI on a large number of image-text pairs, with the weights frozen throughout the training process without any fine-tuning, thus avoiding overfitting.

[0030] In this embodiment, the step of extracting semantic features from the left and right views of the current binocular image using the CLIP model to obtain current facial semantic features specifically includes: segmenting the left and right views of the corrected binocular image according to a preset resolution to obtain multiple segmented image blocks corresponding to a single view; extracting semantic features from each segmented image block corresponding to a single view using the CLIP model to obtain initial local semantic features; concatenating the initial local semantic features corresponding to a single view with a learnable first classification marker to obtain concatenated semantic features, and inputting the concatenated semantic features into the visual encoder in the CLIP model to encode the concatenated semantic features using a self-attention mechanism to obtain current global semantic features; removing the first classification marker from the initial local semantic features corresponding to a single view to obtain current local semantic features; the current local semantic features include the position information of the image blocks. In this embodiment, the left and right views of the first 224×224 corrected binocular image are first divided into multiple image patches according to a preset resolution (i.e., pixel size, such as 14×14), resulting in 256 segmented image patches corresponding to a single view. Then, the CLIP model is used to extract semantic features from each segmented image patch corresponding to a single view, obtaining initial local semantic features corresponding to each segmented image patch. These features contain local details such as facial features and skin texture. Next, the initial local semantic features corresponding to the segmented image patches are compared with learnable... <cls>The tokens (Classification Token, a special label used to represent the global semantic information of the entire input sequence) are concatenated, and then the concatenated semantic features are input into the visual encoder of the CLIP model. This allows for encoding of the concatenated semantic features using a multi-layer self-attention mechanism, resulting in the current global semantic features. It should be noted that after encoding through a multi-layer self-attention mechanism... <cls>The token aggregates global information; that is, it integrates / summarizes the initial local semantic features of multiple segmented image patches corresponding to a single view. Furthermore, the output layer of the visual encoder in the CLIP model outputs the aforementioned current global semantic feature clip_cls (dimension [1, 768]), which can represent the global semantic information of the entire face. Additionally, the aforementioned features can be removed from the multiple initial local semantic features corresponding to a single view. <cls>The token yields a set of 256 768-dimensional digital features (i.e., the current local semantic feature clip_patch, with dimensions [1, 256, 768]). This feature accurately describes what the corresponding image patch is (e.g., an eye, a piece of skin, a strand of hair, etc.), and retains the positional information of the image patch. In other words, the current local semantic feature clip_patch retains the spatial structure of the image patch. For example, the feature of the image patch in the upper left corner corresponds to the left eye region; while the feature of the image patch in the lower right corner corresponds to the mouth region. Furthermore, the current local semantic feature clip_patch with dimensions [1, 256, 768] can be reshaped into a 16×16×768 feature, which can characterize the semantic distribution features of local facial regions (e.g., the semantic features of the eyes, nose, and mouth).

[0031] Step S14: Construct a disparity cost volume based on the current binocular image, and perform cost aggregation on the disparity cost volume to generate a disparity cost space.

[0032] In this embodiment, after semantic feature extraction of the left and right views in the current binocular image, the matching degree of each pixel in the left and right views under different disparity assumptions can be quantitatively calculated based on the binocular vision imaging principle to construct a disparity cost volume. Then, cost aggregation and disparity regression are performed on the disparity cost volume to generate a pixel-level three-dimensional feature space (i.e., disparity cost space) to characterize the three-dimensional depth distribution characteristics of the face target.

[0033] It is understandable that when observing a real 3D human face, there are significant differences in the horizontal parallax produced by different depth regions under left and right perspectives. For example, the parallax is large in protruding areas such as the tip of the nose and the forehead, while the parallax is small in concave or flat areas such as the eye sockets and cheeks. The parallax cost space can quantify this pixel-level positional deviation caused by depth changes into a computable feature representation, which essentially constitutes a 3D depth feature map of the human face.

[0034] In this embodiment, the step of constructing a disparity cost body based on the current binocular image and performing cost aggregation on the disparity cost body to generate a disparity cost space may specifically include: calculating the matching cost of each pixel in the left and right views of the corrected binocular image within a preset disparity range using a convolutional network to generate a disparity cost body; and performing cost aggregation on the disparity cost body using a differentiable semi-global matching algorithm to obtain a disparity cost space. In this embodiment, a 3×3 convolutional network (fitting the fine structure of facial features to avoid blurring details) is first used to calculate the pixel matching relationship of each pixel in the left and right views of the 256×256 second corrected binocular image within a preset disparity range (e.g., 0-64, covering the complete depth span of the face from the forehead to the chin), generating a five-dimensional cost volume [B, C, D, H, W], where B is the batch size (default is 1), C is the number of feature channels (default is 1), D is the disparity dimension (default is 64), and H and W are the height and width, respectively (both are 256). Next, the five-dimensional disparity cost volume [B, C, D, H, W] is reshaped, merging the number of feature channels C and the disparity dimension D into a single channel dimension C×D. At this point, the five-dimensional cost volume [B, C, D, H, W] is compressed into a standard four-dimensional tensor form cost volume, where each dimension value corresponds to the matching confidence of the pixel under the disparity condition, fully preserving the depth differences and structural details of each region of the face. Furthermore, the cost aggregation of the four-dimensional tensor cost volume is performed using the differentiable semi-global matching (SGM) algorithm, resulting in a disparity cost space with dimensions [1, 64, 256, 256]. This space fully records the matching cost of each pixel under different disparities and constitutes a stable three-dimensional depth feature carrier.

[0035] It should be noted that during the generation of the disparity cost space, the smoothness and global constraints of the cost space can be achieved through multi-path energy optimization. In this way, the ability of the semi-global matching algorithm to finely depict the three-dimensional structure of the face can be preserved, while ensuring that the entire process is differentiable and can be trained end-to-end.

[0036] Step S15: Encode the disparity cost space into word embeddings of the same dimension as the current face semantic features using the transformer encoder to obtain geometric embedding features.

[0037] In this embodiment, after generating the disparity cost space, the disparity cost space with dimensions [1, 64, 256, 256] can be encoded into word embeddings of the same dimension as the current face semantic features by the transformer encoder in the model, thus obtaining geometric embedding features.

[0038] In this embodiment, the step of encoding the disparity cost space into word embeddings of the same dimension as the current facial semantic features through the transformer encoder to obtain geometric embedding features may specifically include: projecting the disparity cost space onto word embeddings of the same dimension as the current local semantic features in the disparity dimension to obtain a feature map; flattening the feature map into a one-dimensional sequence to obtain a first feature sequence, and adding a second classification marker and position encoding to the first feature sequence to obtain a second feature sequence; inputting the second feature sequence into the transformer encoder to encode the second feature sequence to obtain global geometric features representing the entire face and local geometric features representing different face regions. In this embodiment, the disparity cost space of dimensions [1, 64, 256, 256] is first projected onto the disparity dimension (64 channels) to obtain a feature map with a resolution of 16×16 and 768 channels (the same dimension as the current local semantic features, thus achieving dimension alignment). Then, the feature map is flattened into a one-dimensional sequence to obtain a token sequence (i.e., the first feature sequence) with a shape of [1, 256, 768], where 256 = 16×16, corresponding to the number of local regions (i.e., patches) that the face image is divided into. Next, a learnable geometric classification token (geo_cls) is added to the beginning of the first feature sequence, and its function is the same as described above. <cls>The tokens are consistent and used to aggregate the global geometric information of the entire sequence. Then, a learnable positional encoding (Position Embedding) is added to the sequence with added geometric classification markers to inject spatial position information, resulting in the second feature sequence. Further, the second feature sequence is input into the transformer encoder to encode the second feature sequence, obtaining the global geometric features geo_cls (corresponding to the output of the first token in the second feature sequence, with dimensions [1, 768], which can represent the overall three-dimensional geometric structure of the entire face) and the local geometric features geo_patch (corresponding to the token outputs of the other 256 patches in the second feature sequence excluding the first token, with dimensions [1, 256, 768], used to represent the fine geometric distribution features of each local region of the face).

[0039] As can be seen, this application first constructs the disparity cost volume through the differentiable SGM algorithm to capture the original depth information of the face, and then performs global relation modeling and feature enhancement through the Transformer encoder, finally generating global geometric features (geo_cls) and local geometric features (geo_patch) with the same dimension as the semantic features. This approach achieves the unification of semantic features and geometric features in terms of dimension and structure, which can lay the foundation for efficient feature fusion in the future.

[0040] In this embodiment, encoding the second feature sequence to obtain global geometric features representing the entire face and local geometric features representing different face regions may specifically include: establishing a long-range dependency relationship of the three-dimensional structure of the face in the corrected binocular image based on the second feature sequence and using a self-attention mechanism, and encoding the second feature sequence based on the long-range dependency relationship to obtain global geometric features representing the three-dimensional geometric structure of the entire face and local geometric features representing different face regions; wherein, the dimension of the global geometric features is the same as the dimension of the current global semantic features. In this embodiment, during the encoding of the second feature sequence, the Transformer encoder can establish long-range dependencies and spatial relationships of the three-dimensional structure of the face through a self-attention mechanism, such as the overall contour from the forehead to the jaw and the spatial positional relationship between facial features. Then, based on the long-range dependencies and spatial relationships, the second feature sequence is encoded to obtain local geometric features geo_patch (dimension [1, 256, 768]) representing different facial regions, and global geometric features geo_cls (dimension [1, 768]) representing the three-dimensional geometric structure of the entire face and having the same dimension as the current global semantic features.

[0041] It should be noted that by establishing the long-range dependencies and spatial relationships of the three-dimensional structure of the face through the self-attention mechanism, that is, the correlation between different local geometric features, isolated deep features can be aggregated into a three-dimensional representation containing global structural constraints of the face. This mechanism not only enables the global geometric feature geo_cls to effectively learn the overall three-dimensional structural information of the face, but also enables the local geometric feature geo_patch to fuse cross-regional contextual information, thereby improving the robustness and structural consistency of the geometric features. This can provide high-quality geometric features with structural matching and information alignment for subsequent multi-dimensional feature splicing and cross-modal fusion.

[0042] Step S16: Use a bidirectional modulation attention mechanism to concatenate the geometric embedding features and the current facial semantic features along the feature dimension to obtain the current concatenated features.

[0043] In this embodiment, after obtaining the geometric embedding features (used to characterize the planar appearance features of the face, such as texture, contour, forgery traces, etc.) and the current facial semantic features (used to characterize the 3D three-dimensional structural features of the face, such as depth, concavity, convexity, spatial position, etc.), the two different-dimensional features (i.e., geometric embedding features and current facial semantic features) can be spliced ​​(i.e. fused) in the feature dimension using a bidirectional modulation attention mechanism (used to achieve fine-grained cross-modal fusion and generate complementary fusion features) to obtain a current spliced ​​feature.

[0044] In this embodiment, the step of using a bidirectional modulation attention mechanism to concatenate the geometric embedding features and the current facial semantic features along the feature dimension to obtain the current concatenated features includes: concatenating the global geometric features and the current global semantic features along the feature dimension to obtain the global concatenated features; concatenating the local geometric features and the current local semantic features along the feature dimension to obtain the local concatenated features; performing feature enhancement on the global concatenated features and the local concatenated features using a bidirectional modulation attention mechanism to obtain enhanced global features and enhanced local features; performing a global average pooling operation on the enhanced local features to obtain local pooled features; and concatenating the enhanced global features and the local pooled features along the feature dimension to obtain the current concatenated features. In this embodiment, after obtaining the local geometric feature geo_patch (dimension [1, 256, 768], which describes the fine-grained three-dimensional depth geometric information of the local area of ​​the face), the global geometric feature geo_cls (dimension [1, 768], which aggregates the global three-dimensional depth structure information of the entire face), the current global semantic feature clip_cls (dimension [1, 768], which aggregates the global semantic information of the entire face), and the current local semantic feature clip_patch (dimension [1, 256, 768], which describes the fine-grained semantic information of the local facial features and textures), a bidirectional modulation attention mechanism can be used to concatenate the above four features in the feature dimension (i.e., the channel dimension) to obtain a fused current concatenated feature. Since the two local features and the two global features have the same dimension, they can be directly spliced ​​in the channel dimension without additional dimension mapping. This fusion method preserves the complete and independent information of the semantic and geometric features, without compressing the feature dimension or losing detailed discrimination information. Compared with element-wise addition and attention-weighted fusion, it can better take into account the dual-dimensional information of semantic artifact discrimination and three-dimensional geometric authenticity verification, thus ensuring the complete preservation of the two types of complementary features.

[0045] Specifically, firstly, two global features (global geometric feature geo_cls and current global semantic feature clip_cls) are concatenated and fused in the feature dimension (i.e., channel dimension) to achieve information fusion of the overall semantic attributes and overall geometric structure of the face. This takes into account both the semantic anomalies of forgery at the global level and the authenticity of the three-dimensional structure, and can form a two-dimensional global discrimination basis. After fusion, a global spliced ​​feature fused_cls (dimension [1,1536]) is obtained. Similarly, two local features (i.e., local geometric feature geo_patch and current local semantic feature clip_patch) are concatenated and fused on a token-by-token basis in the feature dimension (i.e., channel dimension) to achieve the fusion of semantic details and geometric details of local facial regions, and to ensure accurate matching of semantic features and geometric features of local regions such as facial features and skin. This facilitates the accurate localization of forgery traces in the future. After fusion, a local spliced ​​feature fused_patch (dimension [1,256, 1536]) is obtained.

[0046] Furthermore, a bidirectional modulation attention mechanism is used to integrate semantic and geometric information into the aforementioned globally and locally stitched features. This preserves the independent discriminative ability of both types of features while also establishing their correlation, resulting in the enhanced global feature `fused_global` and the enhanced local feature `fused_local`. It should be noted that the bidirectional modulation attention mechanism constructs bidirectional interaction pathways between semantics and geometry, and between geometry and semantics, during the bidirectional modulation of the two types of features. Specifically, in the semantic-geometric attention calculation, semantic features serve as the query vector to guide geometric features to focus on high-risk areas of forgery. Conversely, in the geometry-semantic attention calculation, geometric features serve as the query vector to filter artifact information in semantic features that does not conform to the three-dimensional physical structure, thereby achieving bidirectional enhancement and isomorphic fusion of cross-modal features.

[0047] Next, to balance macroscopic discriminative power and fine-grained local recognition capabilities, a global average pooling operation can be performed on the enhanced local feature `fused_local`. This involves averaging the values ​​along the sequence dimension to obtain a local pooled feature (dimension [1, 1536]). Then, the enhanced global feature `fused_global` (dimension [1, 1536]) and the local pooled feature are concatenated along the feature dimension (channel dimension) to obtain a dimensionally regular, information-complementary multi-scale fusion feature (i.e., the currently concatenated feature). Through these feature enhancement, global average pooling, and multi-scale feature concatenation operations, the limitations of traditional single-modal feature representation and the lack of feature interaction in simple concatenation fusion are effectively addressed. Furthermore, it provides high-quality discriminative features with physical realism, high robustness, and high interpretability for subsequent classification heads.

[0048] Step S17: Input the currently stitched features into the classification head to identify the authenticity of the target face in the binocular image and obtain the face identification result.

[0049] In this embodiment, after fusing features from two different dimensions, the fused 3072-dimensional current stitched features can be input into a lightweight classifier, such as an MLP (Multilayer Perceptron), SVM (Support Vector Machine), or lightweight CNN (Convolutional Neural Network), to identify the authenticity of target faces appearing in the binocular images and obtain the face identification result. The number of target faces can be one or more. When using a binary classifier, the authenticity of the current target face can be directly determined from the output result. For example, when it is a real face, the binary classifier output is 1; when it is a fake face, the binary classifier output is 0.

[0050] As can be seen, this embodiment utilizes a neural network model including a transformer encoder, a CLIP model (including a visual encoder based on the ViT architecture), and a classification head to distinguish between real and fake faces appearing in binocular images. The specific identification process is as follows: First, the CLIP model extracts semantic features from the left and right views of the binocular image to obtain facial semantic features. Then, a disparity cost body is constructed based on the current binocular image, and cost aggregation is performed on the disparity cost body to generate a disparity cost space. Next, the transformer encoder encodes the disparity cost space into word embeddings of the same dimension as the current facial semantic features to obtain geometric embedding features. Finally, a bidirectional modulation attention mechanism is used to fuse the two different-dimensional features (geometric embedding features and facial semantic features), and face authenticity is determined based on the fused multi-dimensional features. Through this method, because multiple different-dimensional features are fused... This face recognition method utilizes complementary and non-overlapping features to cover attack blind spots. It can capture both planar textures and forgery artifacts of the face, as well as the true three-dimensional structure, enabling physical verification of authenticity. Compared to single-dimensional face recognition methods, this significantly improves the accuracy of anti-counterfeiting identification, avoiding missed and false detections. Furthermore, this application uses binocular images acquired by binocular acquisition devices (such as binocular RGB cameras) for face recognition, resulting in higher recognition / identification accuracy compared to monocular image-based methods. Moreover, this application does not require dedicated image acquisition equipment; a common binocular RGB camera can be used to achieve multiple defenses against different types of forgery attacks (such as attacks involving photos, screen captures, 3D masks, and unknown generated models). This reduces identification costs while maintaining high security, making it widely applicable to various types of face recognition systems and providing high anti-counterfeiting identification capabilities against multiple types of forgery attacks.

[0051] For details, see Figure 2 As shown, the binocular images acquired by the current binocular acquisition device are first preprocessed and corrected (such as distortion correction and epipolar correction) to ensure that corresponding pixels in the left and right views of the binocular images fall on the same horizontal row. Then, the corrected images are input into the CLIP model for semantic feature extraction to obtain facial semantic features. At the same time, a disparity cost volume is constructed based on the corrected images, and the disparity cost volume is aggregated to generate a disparity cost space. The disparity cost space is then input into the transformer encoder to encode the disparity cost space into word embeddings of the same dimension as the facial semantic features, resulting in geometric embedding features. Then, a bidirectional modulation attention mechanism is used to concatenate the geometric embedding features and facial semantic features in the feature dimension to obtain the concatenated features. Finally, the concatenated features are input into the classification head to identify the authenticity of faces in the binocular images. This application uses dual-modal features (i.e., geometric embedding features and facial semantic features) that are complementary, non-overlapping, and cover attack blind spots for face anti-counterfeiting identification. It not only considers the spatial depth information of the face, but also recognizes the subtle physiological features inherent in living beings, such as skin physiological details and micro-movements. It can effectively distinguish between real faces and fake masks, thereby achieving accurate anti-counterfeiting detection capabilities against various advanced forgery attacks.

[0052] Accordingly, this application also discloses a face anti-counterfeiting identification device, applied to a face recognition system, see [link to relevant documentation]. Figure 3 As shown, the device includes: The image acquisition module 11 is used to acquire the binocular image captured by the current binocular acquisition device to obtain the current binocular image; the current binocular image includes a left view and a right view; Image input module 12 is used to input the current binocular image into the trained face anti-spoofing identification model; the face anti-spoofing identification model is a model obtained by training a neural network model containing a transformer encoder, CLIP model and a classification head using historical binocular images; the CLIP model contains a visual encoder based on the ViT architecture; Feature extraction module 13 is used to extract semantic features from the left and right views in the current binocular image using the CLIP model to obtain the current facial semantic features. Construction module 14 is used to construct a disparity cost volume based on the current binocular image; The aggregation module 15 is used to perform cost aggregation on the disparity cost volume to generate a disparity cost space; Encoding module 16 is used to encode the disparity cost space into word embeddings of the same dimension as the current face semantic features through the transformer encoder, thereby obtaining geometric embedding features; The splicing module 17 is used to splice the geometric embedding features and the current face semantic features in the feature dimension using a bidirectional modulation attention mechanism to obtain the current spliced ​​features; The face recognition module 18 is used to input the currently stitched features into the classification head to identify the authenticity of the target face in the binocular image and obtain the face recognition result.

[0053] The specific workflow of each of the above modules can be found in the relevant content disclosed in the foregoing embodiments, and will not be repeated here.

[0054] Furthermore, embodiments of this application also disclose an electronic device, Figure 4 This is a structural diagram of an electronic device 20 according to an exemplary embodiment. The content of the diagram should not be construed as limiting the scope of this application.

[0055] Figure 4 This is a schematic diagram of the structure of an electronic device 20 provided in an embodiment of this application. Specifically, the electronic device 20 may include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input / output interface 25, and a communication bus 26. The memory 22 stores a computer program, which is loaded and executed by the processor 21 to implement the relevant steps in the face anti-spoofing identification method disclosed in any of the foregoing embodiments. Furthermore, the electronic device 20 in this embodiment may specifically be an electronic computer.

[0056] In this embodiment, the power supply 23 is used to provide operating voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and external devices, and the communication protocol it follows can be any communication protocol applicable to the technical solution of this application, and is not specifically limited here; the input / output interface 25 is used to acquire external input data or output data to the outside world, and its specific interface type can be selected according to specific application needs, and is not specifically limited here.

[0057] In addition, the memory 22, as a carrier for resource storage, can be a read-only memory, random access memory, disk or optical disk, etc. The resources stored thereon can include operating system 221, computer program 222, etc., and the storage method can be temporary storage or permanent storage.

[0058] The operating system 221 is used to manage and control the various hardware devices on the electronic device 20 and the computer program 222, which may be Windows Server, Netware, Unix, Linux, etc. In addition to including a computer program capable of performing the face anti-spoofing authentication method disclosed in any of the foregoing embodiments executed by the electronic device 20, the computer program 222 may further include a computer program capable of performing other specific tasks.

[0059] Furthermore, this application also discloses a computer-readable storage medium for storing a computer program; wherein, when the computer program is executed by a processor, it implements the aforementioned face anti-spoofing identification method. Specific steps of this method can be found in the corresponding content disclosed in the foregoing embodiments, and will not be repeated here.

[0060] Furthermore, embodiments of this application also disclose a computer program product, including a computer program / instructions, which, when executed by a processor, implement the steps of the aforementioned face anti-counterfeiting identification method.

[0061] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to in the method section.

[0062] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0063] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software module executed by a processor, or a combination of both. The software module can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.

[0064] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0065] The foregoing has provided a detailed description of a face anti-counterfeiting identification method, device, equipment, and storage medium provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.< / cls> < / cls> < / cls> < / cls>

Claims

1. A method for facial recognition anti-counterfeiting, characterized in that, Applications in facial recognition systems include: Acquire the binocular image captured by the current binocular acquisition device to obtain the current binocular image; the current binocular image includes a left view and a right view; The current binocular image is input into the trained face anti-spoofing identification model; the face anti-spoofing identification model is a model obtained by training a neural network model containing a transformer encoder, a CLIP model and a classification head using historical binocular images; the CLIP model contains a visual encoder based on the ViT architecture; The semantic features of the current face are obtained by extracting semantic features from the left and right views of the current binocular image using the CLIP model. Based on the current binocular image, a disparity cost volume is constructed, and the disparity cost volume is aggregated to generate a disparity cost space. The disparity cost space is encoded into word embeddings of the same dimension as the current face semantic features by the transformer encoder, thus obtaining geometric embedding features; The geometric embedding features and the current facial semantic features are concatenated along the feature dimension using a bidirectional modulation attention mechanism to obtain the current concatenated features; The currently stitched features are input into the classification head to distinguish between real and fake target faces in the binocular image, and the face identification result is obtained.

2. The face anti-counterfeiting identification method according to claim 1, characterized in that, The step of inputting the current binocular image into the trained face anti-spoofing identification model includes: Obtain the binocular stereo calibration parameters of the binocular acquisition device; Based on the binocular stereo calibration parameters, distortion correction and epipolar correction are performed on the left and right views in the current binocular image respectively, so that the corresponding pixels of the left and right views are located on the same scan line, and the corrected binocular image is obtained. The corrected binocular image is input into the trained face anti-spoofing identification model.

3. The face anti-counterfeiting identification method according to claim 2, characterized in that, The step of extracting semantic features from the left and right views of the current binocular image using the CLIP model to obtain the current facial semantic features includes: The left and right views in the corrected binocular images are segmented according to a preset resolution to obtain multiple segmented image blocks corresponding to a single view. The CLIP model is used to extract semantic features from each segmented image block corresponding to a single view to obtain initial local semantic features. The initial local semantic features corresponding to a single view are concatenated with a learnable first classification label to obtain concatenated semantic features. The concatenated semantic features are then input into the visual encoder in the CLIP model to encode the concatenated semantic features using a self-attention mechanism to obtain the current global semantic features. The first classification marker is removed from the initial local semantic features corresponding to a single view to obtain the current local semantic features; the current local semantic features contain the location information of the image patch.

4. The face anti-counterfeiting identification method according to claim 3, characterized in that, The step of constructing a disparity cost volume based on the current binocular image and performing cost aggregation on the disparity cost volume to generate a disparity cost space includes: The matching cost of each pixel in the left and right views of the corrected binocular image within a preset disparity range is calculated using a convolutional network to generate a disparity cost volume. The disparity cost volume is aggregated using a differentiable semi-global matching algorithm to obtain the disparity cost space.

5. The face anti-counterfeiting identification method according to claim 3, characterized in that, The step of encoding the disparity cost space into word embeddings of the same dimension as the current face semantic features using the transformer encoder to obtain geometric embedding features includes: The disparity cost space is projected onto the word embedding with the same dimension as the current local semantic feature in the disparity dimension to obtain the feature map; The feature map is flattened into a one-dimensional sequence to obtain a first feature sequence, and a second classification marker and position code are added to the first feature sequence to obtain a second feature sequence; The second feature sequence is input into the transformer encoder to encode the second feature sequence, thereby obtaining global geometric features representing the entire face and local geometric features representing different face regions.

6. The face anti-counterfeiting identification method according to claim 5, characterized in that, The encoding of the second feature sequence to obtain global geometric features representing the entire face and local geometric features representing different face regions includes: Based on the second feature sequence and using a self-attention mechanism, a long-range dependency relationship of the three-dimensional structure of the face in the corrected binocular image is established, and the second feature sequence is encoded based on the long-range dependency relationship to obtain global geometric features representing the three-dimensional geometric structure of the entire face and local geometric features representing different face regions; wherein, the dimension of the global geometric features is the same as the dimension of the current global semantic features.

7. The face anti-counterfeiting identification method according to claim 5, characterized in that, The step of using a bidirectional modulation attention mechanism to concatenate the geometric embedding features and the current facial semantic features along the feature dimension to obtain the current concatenated features includes: The global geometric features and the current global semantic features are concatenated along the feature dimension to obtain the concatenated global features. The local geometric features and the current local semantic features are concatenated along the feature dimension to obtain the locally concatenated features. The global and local concatenated features are enhanced using a bidirectional modulation attention mechanism to obtain enhanced global and local features. Perform global average pooling on the enhanced local features to obtain local pooled features; The enhanced global features and the local pooled features are concatenated along the feature dimension to obtain the current concatenated features.

8. A facial recognition anti-counterfeiting device, characterized in that, Applications in facial recognition systems include: The image acquisition module is used to acquire the binocular image captured by the current binocular acquisition device to obtain the current binocular image; the current binocular image includes a left view and a right view; The image input module is used to input the current binocular image into the trained face anti-spoofing identification model; the face anti-spoofing identification model is a model obtained by training a neural network model containing a transformer encoder, a CLIP model and a classification head using historical binocular images; the CLIP model contains a visual encoder based on the ViT architecture; The feature extraction module is used to extract semantic features from the left and right views of the current binocular image using the CLIP model to obtain the current facial semantic features. The construction module is used to construct the disparity cost volume based on the current binocular image; An aggregation module is used to aggregate the disparity cost volume to generate a disparity cost space. The encoding module is used to encode the disparity cost space into word embeddings of the same dimension as the current face semantic features through the transformer encoder, thereby obtaining geometric embedding features; The splicing module is used to splice the geometric embedding features and the current facial semantic features in the feature dimension using a bidirectional modulation attention mechanism to obtain the current spliced ​​features; The face recognition module is used to input the currently stitched features into the classification head to identify the authenticity of the target face in the binocular image and obtain the face recognition result.

9. An electronic device, characterized in that, It includes a processor and a memory; wherein, when the processor executes a computer program stored in the memory, it implements the face anti-counterfeiting identification method as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, Used to store computer programs; wherein, when the computer programs are executed by a processor, they implement the face anti-counterfeiting identification method as described in any one of claims 1 to 7.