Method and device for generating face pictures of multiple attack styles

By generating face images with various attack styles through generative adversarial networks and multimodal models, the problem of training new material masks in face recognition systems is solved, the system's defense capabilities are improved and costs are reduced, while protecting the privacy of the data collectors.

CN116912902BActive Publication Date: 2026-06-16BEIJING XUEZHITU NETWORK TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING XUEZHITU NETWORK TECH
Filing Date
2023-06-07
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

In existing technologies, facial recognition systems face the challenge of using attack masks made of new materials or with new processes, which makes them difficult to train effectively. This leads to a decrease in the reliability of facial liveness detection models, and the production of attack masks is costly and infringes on the privacy of the data collectors.

Method used

By training a generative adversarial network and a multimodal model based on model fine-tuning, face images with various attack styles are generated to enrich the training data of the face liveness detection model, including printing attacks, plaster mask attacks, resin mask attacks, and replay attacks.

🎯Benefits of technology

This greatly enriches the training data for face liveness detection models, reduces training costs, improves the defense capabilities of face recognition systems, and protects the privacy of data collectors.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116912902B_ABST
    Figure CN116912902B_ABST
Patent Text Reader

Abstract

The application relates to the field of face recognition technology and discloses a generation method and device for face pictures of multiple attack styles, wherein the generation method comprises the following steps: training a generative adversarial network according to a preset face picture data set and a high-definition face data set and reserving a generator in the generative adversarial network; training a multi-modal model based on model fine-tuning according to multiple face pictures in the face picture data set and description texts corresponding to each face picture; and generating face pictures corresponding to multiple attack styles respectively through the generator and the multi-modal model according to description texts corresponding to the multiple attack styles respectively.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of facial recognition technology, for example to a method and apparatus for generating facial images with various attack styles. Background Technology

[0002] Currently, facial recognition systems have been applied in many areas of people's lives, such as unlocking electronic devices, access control, and financial payments. At the same time, facial recognition systems also face many security risks, such as attacks using printed face images, playing face videos on electronic devices, and wearing masks. To address this challenge, training a face liveness detection model is essential. For a real face, the face liveness detection model outputs a true result; for an attacked face, the face liveness detection model outputs a false result. In addition, training a face liveness detection model usually requires a large amount of real face datasets and attack face data. In related technologies, the process of obtaining attack face data is generally as follows: (1) Obtain 3D point cloud data of the subject's face; (2) Print a corresponding 3D mask using a specific material; (3) Have the attacker wear the mask and then take a photo of the person to obtain a face image of the masked person.

[0003] In the process of implementing the embodiments of this disclosure, at least the following problems were found in the related art:

[0004] In practical applications, the attack challenges faced by facial recognition systems are an open-set problem, meaning that new attack masks made of new materials or using new processes are constantly emerging on the market, such as resin masks, plaster masks, and silicone headgear masks. This means that the data from these attacked faces is not used in the training of facial liveness detection models. Furthermore, the cost of producing a single resin mask is typically over two thousand RMB, and it does not adequately protect the privacy of the user's facial biometric information. Therefore, generating facial images with various attack styles is of significant importance for facial liveness detection models.

[0005] It should be noted that the information disclosed in the background section above is only used to enhance the understanding of the background of this application, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention

[0006] To provide a basic understanding of some aspects of the disclosed embodiments, a brief summary is given below. This summary is not intended as a general commentary, nor is it intended to identify key / important components or describe the scope of protection of these embodiments, but rather as a prelude to the detailed description that follows.

[0007] This disclosure provides a method and apparatus for generating face images with various attack styles, a computing device and a storage medium, which can generate face images with various attack styles, greatly enriching the training data of the face liveness detection model and improving the security of the entire face recognition system.

[0008] In some embodiments, the method for generating face images for multiple attack styles includes:

[0009] Based on a pre-defined dataset of face images and a high-resolution face dataset, a generative adversarial network is trained and the generator in the generative adversarial network is retained.

[0010] Based on multiple face images in the face image dataset and the corresponding descriptive text for each face image, a multimodal model based on model fine-tuning is trained.

[0011] Based on the descriptive text corresponding to various attack styles, a generator and a multimodal model are used to generate face images corresponding to different attack styles.

[0012] Optionally, a generative adversarial network (GAN) is trained based on a pre-defined dataset of face images and a high-resolution face dataset, and the generator in the GAN is retained, including:

[0013] Collect multiple real face images from the same batch, as well as multiple face images with preset attack styles corresponding to each real face image, to form a face image dataset;

[0014] A generative adversarial network (GAN) is trained based on the StyleGAN architecture using multiple real face images, multiple face images with multiple preset attack styles, and multiple high-definition face images from a high-definition face dataset, while retaining the generator in the GAN.

[0015] Optionally, based on multiple face images in the face image dataset and the corresponding descriptive text for each face image, a multimodal model based on model fine-tuning is trained, including:

[0016] Obtain the text description corresponding to each real face image in the face image dataset;

[0017] Obtain the text description corresponding to each face image with a preset attack style in the face image dataset;

[0018] The multimodal model is trained using real face images and their corresponding text descriptions, as well as face images and their corresponding text descriptions with preset attack styles, until the training stops.

[0019] Optionally, based on the descriptive text corresponding to different attack styles, a generator and a multimodal model are used to generate face images corresponding to different attack styles, including:

[0020] Based on the randomly generated initial potential code and the descriptive text corresponding to the target attack style, obtain the model loss value output by the multimodal model;

[0021] Backpropagation is performed using the model loss value to update the initial latent encoding, thereby obtaining the latent encoding corresponding to the target attack style;

[0022] Based on the latent encoding corresponding to the target attack style, a generator is used to generate a face image corresponding to the target attack style.

[0023] Optionally, based on the randomly generated initial latent encoding and the descriptive text corresponding to the target attack style, the model loss value output by the multimodal model is obtained, including:

[0024] The randomly generated initial latent code is input into the generator to obtain the initial face image;

[0025] The text description corresponding to the target attack style and the initial face image are input into the multimodal model, and the model loss value output by the multimodal model is obtained.

[0026] Optionally, backpropagation is performed using the model loss value to update the initial latent encoding, obtaining the latent encoding corresponding to the target attack style, including:

[0027] Obtain the current attacking face image output by the multimodal model through inverse mapping based on the current model loss value;

[0028] The current attacking face image is input into the generator, and the updated potential encoding is obtained through inverse mapping;

[0029] The updated latent codes are repeatedly obtained by generating the generator and the multimodal model until the model loss value of the multimodal model converges, thus obtaining the latent codes corresponding to the target attack style.

[0030] Optionally, attack styles include print attacks on paper, face mask attacks made of plaster, face mask attacks made of resin, and replay attacks displayed on electronic screens.

[0031] In some embodiments, an apparatus for generating facial images with various attack styles includes:

[0032] The data collection module is configured to train a generative adversarial network based on a preset face image dataset and a high-resolution face dataset, and retain the generator in the generative adversarial network;

[0033] The model training module is configured to train a multimodal model based on model fine-tuning using multiple face images from the face image dataset and the corresponding descriptive text for each face image.

[0034] The image generation module is configured to generate face images corresponding to various attack styles based on descriptive text corresponding to different attack styles, using a generator and a multimodal model.

[0035] In some embodiments, a computing device includes a processor and a memory storing program instructions, the processor being configured to execute, when running the program instructions, a method for generating face images for various attack styles as described in this application.

[0036] In some embodiments, the storage medium stores program instructions that, when executed, perform the method for generating face images for various attack styles as described in this application.

[0037] The method and apparatus for generating face images with various attack styles, computing devices, and storage media provided in this disclosure can achieve the following technical effects:

[0038] This application employs techniques from the field of machine learning technology. By training a generator in an adversarial network and a multimodal model based on model fine-tuning, and using descriptive text corresponding to various attack styles, the generator and multimodal model generate face images corresponding to various attack styles. This enables the output of face images with multiple attack styles, greatly enriching the training data of the face liveness detection model, significantly reducing the cost of training the face liveness detection model, and improving the defense capability of the face recognition system.

[0039] The above general description and the description below are exemplary and illustrative only and are not intended to limit this application. Attached Figure Description

[0040] One or more embodiments are illustrated by way of example with reference to the accompanying drawings. These illustrations and drawings do not constitute a limitation on the embodiments. Elements having the same reference numerals in the drawings are shown as similar elements. The drawings are not to be scaled. And wherein:

[0041] Figure 1 This is a schematic diagram of the system architecture of a generative adversarial network;

[0042] Figure 2 This is a schematic diagram illustrating the working principle of an encoder;

[0043] Figure 3 This is a schematic diagram of a method for generating face images with various attack styles provided in an embodiment of this disclosure;

[0044] Figure 4 This is a schematic diagram of another method for generating face images with various attack styles provided in this disclosure embodiment;

[0045] Figure 5 This is a schematic diagram of another method for generating face images with various attack styles provided in this disclosure embodiment;

[0046] Figure 6 This is a schematic diagram of another method for generating face images with various attack styles provided in this disclosure embodiment;

[0047] Figure 7 This is a schematic diagram of a specific application process provided in an embodiment of this disclosure;

[0048] Figure 8 This is a schematic diagram of another method for generating face images with various attack styles provided in this disclosure embodiment;

[0049] Figure 9 This is a schematic diagram of an apparatus for generating face images with various attack styles provided in an embodiment of this disclosure;

[0050] Figure 10 This is a schematic diagram of a computing device provided in an embodiment of this disclosure. Detailed Implementation

[0051] To provide a more detailed understanding of the features and technical content of the embodiments of this disclosure, the implementation of the embodiments of this disclosure will be described in detail below with reference to the accompanying drawings. The accompanying drawings are for illustrative purposes only and are not intended to limit the embodiments of this disclosure. In the following technical description, for ease of explanation, several details are used to provide a full understanding of the disclosed embodiments. However, one or more embodiments may still be implemented without these details. In other cases, well-known structures and devices may be simplified in their depiction to simplify the drawings.

[0052] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate for the embodiments of this disclosure described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion.

[0053] Unless otherwise stated, the term "multiple" means two or more.

[0054] In this embodiment of the disclosure, the character " / " indicates that the objects before and after it are in an "or" relationship. For example, A / B means: A or B.

[0055] The term "and / or" describes an association between objects, indicating that three relationships can exist. For example, A and / or B means: A or B, or A and B.

[0056] The term "correspondence" can refer to an association or binding relationship. The correspondence between A and B means that there is an association or binding relationship between A and B.

[0057] Combination Figure 1 As shown, Generative Adversarial Networks (GANs) are an unsupervised learning method that uses a generator and a discriminator to continuously engage in a game, allowing the generator to learn a neural network model of the data distribution. During training, the generator's goal is to generate realistic images to deceive the discriminator. The discriminator's goal is to distinguish between real and fake images generated by the generator. This creates a dynamic "game" between the generator and the discriminator. Ultimately, the result of this game is that the fake samples generated by the generator become almost indistinguishable from real samples, causing the discriminator to be unable to differentiate between them. At this point, the generator and discriminator reach equilibrium, and the training process ends.

[0058] In related technologies, numerous GAN models, such as PCGAN, BigGAN, and StyleGAN, have been developed to generate high-quality, diverse images from random noise input. Recent research has shown that GANs can effectively encode rich semantic information in intermediate features and hidden space. These theories suggest that images with diverse characteristics can be synthesized by altering the encoding in the hidden space. However, because GANs lack inference capabilities and encoders, this processing can only be applied to images generated by GANs, not to real-world images.

[0059] Combination Figure 2 As shown, the goal of an encoder is to encode input images, text, or audio into low-dimensional latent codes or feature representations. Encoders are typically implemented using neural networks, including convolutional layers, pooling layers, and batch normalization layers. Convolutional layers are responsible for acquiring local features of the image, pooling layers downsample the image and pass scale-invariant features to the next layer, and batch normalization layers primarily normalize the distribution of training images and accelerate learning. Taking the encoding of a face image as an example, the encoder extracts features from the face image to form a latent code, which contains the main information of the face image. For example, the elements of this vector might represent skin color, eyebrow position, eye size, etc.

[0060] In related technologies, facial recognition systems face numerous security risks. To address this challenge, training a facial liveness detection model is objectively valuable. For a real face, the liveness detection model outputs a true result; however, for an attacked face, the model outputs a false result. Therefore, training a facial liveness detection model typically requires a large dataset of both real and attacked faces. However, the cost of producing a single resin mask is usually over 2,000 RMB, and it cannot adequately protect the privacy of the user's facial biometric information. This means that once new materials or processes for attacking masks emerge, the reliability of existing facial liveness detection models will decrease. Therefore, how to generate new attack-style facial mask images at low cost using existing attack-style face images is of significant importance for facial liveness detection tasks.

[0061] Therefore, in combination Figure 3 As shown, this disclosure provides a method for generating face images for various attack styles, including:

[0062] Step 301: Based on the preset face image dataset and high-definition face dataset, train a generative adversarial network and retain the generator in the generative adversarial network.

[0063] Step 302: Based on multiple face images in the face image dataset and the corresponding descriptive text for each face image, a multimodal model based on model fine-tuning is trained.

[0064] Step 303: Based on the descriptive text corresponding to each of the various attack styles, generate face images corresponding to each of the various attack styles using the generator and the multimodal model.

[0065] The method for generating face images with multiple attack styles provided in this disclosure generates face images corresponding to multiple attack styles by training a generator in an adversarial network and a multimodal model based on model fine-tuning, and generating face images corresponding to multiple attack styles according to the descriptive text corresponding to each attack style through the generator and the multimodal model. This enables the output of face images with multiple attack styles, greatly enriching the training data of the face liveness detection model, greatly saving the cost of training the face liveness detection model, and improving the defense capability of the face recognition system.

[0066] Optionally, combined Figure 4 As shown, the step of training a generative adversarial network (GAN) based on a preset face image dataset and a high-resolution face dataset, and retaining the generator in the GAN, includes:

[0067] Step 401: Collect multiple real face images from the same batch, and multiple face images with preset attack styles corresponding to each real face image, to form the face image dataset.

[0068] Step 402: Using the multiple real face images, multiple face images with preset attack styles, and multiple high-definition face images from the high-definition face dataset, a generative adversarial network is trained based on the StyleGAN architecture, and the generator in the generative adversarial network is retained.

[0069] In the embodiments of this application, the face recognition system of this application can collect a batch of face images with multiple preset attack styles, including real face images M1 of the same group of people, face images M2 printed on paper in the style of print attack, face images M3 made of plaster mask, face images M4 made of resin mask, and face images M5 displayed on an electronic screen in the style of replay attack.

[0070] Furthermore, the face recognition system of this application is based on the StyleGAN architecture. Using data M1 to M5 from step 402, as well as the high-definition face dataset (Flickr-Faces-Hight-Quality, FFHQ), a generative adversarial network is trained. FFHQ is a high-quality face dataset containing 70,000 high-definition PNG face images with a resolution of 1024x1024. It is rich and diverse in terms of age, race, and image background, and also has a great deal of variation in face attributes, including different ages, genders, races, skin colors, expressions, face shapes, hairstyles, and face poses. It also includes various face accessories such as ordinary glasses, sunglasses, hats, hair accessories, and scarves. Therefore, this dataset can also be used to develop some face attribute classification or face semantic segmentation models.

[0071] This generative adversarial network consists of a generator G and a discriminator D. After training, the generator G is retained. The generator G is a neural network whose input is a latent code, for example, a matrix with a dimension of 16×512. The output of the generator G is a face image, such as a 3×512×512 RGB image.

[0072] This ensures better generalization ability of the generator and guarantees the quality and reliability of face images with various attack styles.

[0073] Optionally, combined Figure 5As shown, the step of training a multimodal model based on model fine-tuning using multiple face images from the face image dataset and the corresponding descriptive text for each face image includes:

[0074] Step 501: Obtain the text description corresponding to each real face image in the face image dataset.

[0075] Step 502: Obtain the text description corresponding to each face image with a preset attack style in the face image dataset.

[0076] Step 503: Train the multimodal model using the real face images and their corresponding text descriptions, and face images and their corresponding text descriptions with preset attack styles, until the training stops.

[0077] In the embodiments of this application, the Contrastive Language-Image Pre-training (CLIP) multimodal model based on fine-tuning has two inputs: a face image and a text description. The output is the similarity between the face image and the text description. The closer the content of the given face image and the text description are, the smaller the model loss value. For example, given a face image printed on paper and the description "a printfac eimage", if the CLIP model returns a small loss value, it can be considered that the CLIP model has completed transfer learning. At this point, the real face image and its corresponding text description, as well as face images with a preset attack style and their corresponding text descriptions, are input into the multimodal model after transfer learning is completed, and the multimodal model is trained until the training stopping condition is met.

[0078] This approach can improve the computational efficiency and accuracy of multimodal models while reducing computational resources and time.

[0079] Optionally, combined Figure 6 As shown, the step of generating face images corresponding to various attack styles based on descriptive text corresponding to different attack styles, through the generator and the multimodal model, includes:

[0080] Step 601: Obtain the model loss value output by the multimodal model based on the randomly generated initial potential code and the descriptive text corresponding to the target attack style.

[0081] Step 602: Backpropagation is performed using the model loss value to update the initial latent encoding, thereby obtaining the latent encoding corresponding to the target attack style.

[0082] Step 603: Generate a face image corresponding to the target attack style using the generator based on the latent encoding corresponding to the target attack style.

[0083] In the embodiments of this application, combined with Figure 7 As shown, this application provides a random initial latent code and a textual description corresponding to the target attack style, such as a printed plaster mask face image, which are input into the CLIP model for forward computation to obtain the model loss value. Then, the backpropagation algorithm is used to repeatedly update the latent code corresponding to the attack face image until the model loss value of the multimodal model converges. Finally, the generator outputs the attack face image with the target attack style corresponding to the latent code.

[0084] In this way, by using algorithms that generate various styles of attack faces, the training data for the face liveness detection model can be greatly enriched, improving the accuracy of the face liveness detection model and thus ensuring the security of the entire face recognition system.

[0085] In the embodiments of this application, specifically, in conjunction with Figure 8 As shown, the step of generating face images corresponding to various attack styles based on descriptive text corresponding to different attack styles, through the generator and the multimodal model, includes:

[0086] Step 801: Input the randomly generated initial latent code into the generator to obtain the initial face image.

[0087] Step 802: Input the text description corresponding to the target attack style and the initial face image into the multimodal model to obtain the model loss value output by the multimodal model.

[0088] Step 803: Obtain the current attacking face image output by the multimodal model through inverse mapping based on the current model loss value.

[0089] Step 804: Input the current attacking face image into the generator and obtain the updated potential code through inverse mapping.

[0090] Step 805: Repeatedly obtain the updated latent encoding through the generator and the multimodal model until the model loss value of the multimodal model converges to obtain the latent encoding corresponding to the target attack style.

[0091] Step 806: Generate a face image corresponding to the target attack style using the generator based on the latent encoding corresponding to the target attack style.

[0092] In this way, generating target face images with multiple attack styles through a generator can greatly reduce the cost of obtaining target face images with multiple attack styles. It eliminates the need to print corresponding 3D masks using materials such as plaster, resin, and silicone, have attackers wear such masks, and then take photos of the person to obtain face images of the masked person. This makes the application highly feasible and versatile, and can effectively protect the privacy of the collector's facial biometric information.

[0093] Combination Figure 9 As shown, this disclosure provides an apparatus for generating facial images with various attack styles, including:

[0094] The data collection module 901 is configured to train a generative adversarial network based on a preset face image dataset and a high-definition face dataset, and retain the generator in the generative adversarial network;

[0095] The model training module 902 is configured to train a multimodal model based on model fine-tuning based on multiple face images in the face image dataset and the descriptive text corresponding to each face image.

[0096] Image generation module 903 is configured to generate face images corresponding to various attack styles based on description text corresponding to different attack styles, through the generator and the multimodal model.

[0097] Optionally, the data collection module 901 is specifically configured as follows:

[0098] Collect multiple real face images from the same batch, as well as multiple face images with preset attack styles corresponding to each real face image, to form the face image dataset;

[0099] Using multiple real face images, multiple face images with preset attack styles, and multiple high-definition face images from a high-definition face dataset, a generative adversarial network is trained based on the StyleGAN architecture, and the generator in the generative adversarial network is retained.

[0100] Optionally, the model training module 902 is specifically configured as follows:

[0101] Obtain the text description corresponding to each real face image in the face image dataset;

[0102] Obtain the text description corresponding to each face image with a preset attack style in the face image dataset;

[0103] The multimodal model is trained using real face images and their corresponding text descriptions, as well as face images and their corresponding text descriptions with preset attack styles, until the training stops.

[0104] Optionally, the image generation module 903 is specifically configured as follows:

[0105] Based on the randomly generated initial potential code and the descriptive text corresponding to the target attack style, the model loss value output by the multimodal model is obtained;

[0106] Backpropagation is performed using the model loss value to update the initial latent encoding, thereby obtaining the latent encoding corresponding to the target attack style;

[0107] Based on the latent encoding corresponding to the target attack style, the generator generates a face image corresponding to the target attack style.

[0108] The apparatus for generating face images with multiple attack styles provided in this disclosure generates face images corresponding to multiple attack styles by training a generator in an adversarial network and a multimodal model based on model fine-tuning, according to the descriptive text corresponding to each attack style. This allows for the output of face images with multiple attack styles, greatly enriching the training data of the face liveness detection model, significantly reducing the cost of training the face liveness detection model, and improving the defense capability of the face recognition system.

[0109] Combination Figure 10 As shown, this embodiment of the present disclosure provides a computing device including a processor 100 and a memory 101. Optionally, the device may further include a communication interface 102 and a bus 103. The processor 100, communication interface 102, and memory 101 can communicate with each other via the bus 103. The communication interface 102 can be used for information transmission. The processor 100 can call logical instructions in the memory 101 to execute the method for generating face images with various attack styles described in the above embodiment.

[0110] Furthermore, the logic instructions in the aforementioned memory 101 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium.

[0111] The memory 101, as a computer-readable storage medium, can be used to store software programs and computer-executable programs, such as program instructions / modules corresponding to the methods in the embodiments of this disclosure. The processor 100 executes functional applications and data processing by running the program instructions / modules stored in the memory 101, that is, it implements the method for generating face images with various attack styles in the above embodiments.

[0112] The memory 101 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function; the data storage area may store data created based on the use of the terminal device. Furthermore, the memory 101 may include high-speed random access memory and may also include non-volatile memory.

[0113] This disclosure provides a computer-readable storage medium storing computer-executable instructions configured to execute the above-described method for generating face images for various attack styles.

[0114] The aforementioned computer-readable storage medium may be a transient computer-readable storage medium or a non-transitory computer-readable storage medium.

[0115] The technical solutions of this disclosure can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes one or more instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the method described in this disclosure. The aforementioned storage medium can be a non-transitory storage medium, including: a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and other media capable of storing program code; it can also be a transient storage medium.

[0116] The foregoing description and accompanying drawings fully illustrate embodiments of this disclosure to enable those skilled in the art to practice them. Other embodiments may include structural, logical, electrical, procedural, and other changes. The embodiments represent only possible variations. Individual components and functions are optional unless explicitly required, and the order of operation may vary. Parts and features of some embodiments may be included in or replace parts and features of other embodiments. Moreover, the terminology used in this application is for describing embodiments only and is not intended to limit the claims. As used in the description of embodiments and claims, the singular forms “a,” “an,” and “the” are intended to equally include the plural forms unless the context clearly indicates otherwise. Similarly, the term “and / or” as used in this application means including one or more of the associated listed items and all possible combinations thereof. Additionally, when used in this application, the term "comprise" and its variations "comprises" and / or "comprising" refer to the presence of stated features, integrals, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof. Without further limitations, an element defined by the phrase "comprises a..." does not exclude the presence of other identical elements in the process, method, or apparatus that includes said element. In this document, each embodiment may focus on the differences from other embodiments, and similar or identical parts between embodiments can be referred to mutually. For methods, products, etc., disclosed in the embodiments, if they correspond to the method section disclosed in the embodiments, the relevant parts can be referred to the description of the method section.

[0117] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the embodiments of this disclosure. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0118] The methods and products (including but not limited to devices and equipment) disclosed in the embodiments herein can be implemented in other ways. For example, the device embodiments described above are merely illustrative. For instance, the division of units may be merely a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. In addition, the coupling or direct coupling or communication connection between the shown or discussed units may be through some interfaces, and the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the units may be selected to implement this embodiment according to actual needs. Furthermore, the functional units in the embodiments of this disclosure may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

[0119] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than that shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. In the descriptions corresponding to the flowcharts and block diagrams in the accompanying drawings, the operations or steps corresponding to different blocks may also occur in a different order than disclosed in the description, and sometimes there is no specific order between different operations or steps. For example, two consecutive operations or steps may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. Each block in a block diagram and / or flowchart, and combinations of blocks in a block diagram and / or flowchart, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

Claims

1. A method for generating face images for various attack styles, characterized in that, include: Collect multiple real face images from the same batch, and multiple face images with preset attack styles corresponding to each real face image to form the face image dataset; wherein, the face images with preset attack styles include face images with printed attack styles printed on paper, face images with face masks made of plaster, face images with face masks made of resin, and face images with replay attacks displayed on an electronic screen. Using multiple real face images, multiple face images with preset attack styles, and multiple high-definition face images from a high-definition face dataset, a generative adversarial network is trained based on the Style GAN architecture, and the generator in the generative adversarial network is retained. Based on multiple face images in the face image dataset and the corresponding descriptive text for each face image, a multimodal model based on model fine-tuning is trained. Based on the descriptive text corresponding to various attack styles, the generator and the multimodal model generate face images corresponding to various attack styles.

2. The generation method according to claim 1, characterized in that, The step of training a multimodal model based on model fine-tuning using multiple face images from the face image dataset and the corresponding descriptive text for each face image includes: Obtain the text description corresponding to each real face image in the face image dataset; Obtain the text description corresponding to each face image with a preset attack style in the face image dataset; The multimodal model is trained using real face images and their corresponding text descriptions, as well as face images and their corresponding text descriptions with preset attack styles, until the training stops.

3. The generation method according to claim 1, characterized in that, The process of generating face images corresponding to various attack styles based on descriptive texts corresponding to different attack styles, using the generator and the multimodal model, includes: Based on the randomly generated initial potential code and the descriptive text corresponding to the target attack style, the model loss value output by the multimodal model is obtained; Backpropagation is performed using the model loss value to update the initial latent encoding, thereby obtaining the latent encoding corresponding to the target attack style; Based on the latent encoding corresponding to the target attack style, the generator generates a face image corresponding to the target attack style.

4. The generation method according to claim 3, characterized in that, The step of obtaining the model loss value output by the multimodal model based on the randomly generated initial latent code and the descriptive text corresponding to the target attack style includes: The randomly generated initial latent code is input into the generator to obtain an initial face image; The text description corresponding to the target attack style and the initial face image are input into the multimodal model to obtain the model loss value output by the multimodal model.

5. The generation method according to claim 4, characterized in that, The step of backpropagating through the model loss value to update the initial latent encoding and obtain the latent encoding corresponding to the target attack style includes: Obtain the current attacking face image output by the multimodal model through inverse mapping based on the current model loss value; The current attacking face image is input into the generator, and the updated potential code is obtained through inverse mapping; The updated latent encoding is repeatedly obtained through the generator and the multimodal model until the model loss value of the multimodal model converges, thereby obtaining the latent encoding corresponding to the target attack style.

6. The generation method according to any one of claims 1 to 5, characterized in that, The attack styles include print attacks on paper, attacks using plaster masks, attacks using resin masks, and replay attacks displayed on electronic screens.

7. An apparatus for generating facial images with various attack styles, characterized in that, include: The data collection module is configured to collect multiple real face images from the same batch, as well as multiple face images with preset attack styles corresponding to each real face image, to form the face image dataset; wherein, the face images with preset attack styles include face images with printed attack styles printed on paper, face images with face masks made of plaster, face images with face masks made of resin, and face images with replay attacks displayed on an electronic screen. Using multiple real face images, multiple face images with preset attack styles, and multiple high-definition face images from a high-definition face dataset, a generative adversarial network is trained based on the Style GAN architecture, and the generator in the generative adversarial network is retained. The model training module is configured to train a multimodal model based on model fine-tuning using multiple face images in the face image dataset and the descriptive text corresponding to each face image. The image generation module is configured to generate face images corresponding to various attack styles based on descriptive text corresponding to different attack styles, through the generator and the multimodal model.

8. A computing device, comprising a processor and a memory storing program instructions, characterized in that, The processor is configured to, when executing the program instructions, perform the method for generating face images for multiple attack styles as described in any one of claims 1 to 6.

9. A storage medium storing program instructions, characterized in that, When the program instructions are executed, they perform the method for generating face images for various attack styles as described in any one of claims 1 to 6.