Image recognition method and device
By using unsupervised training methods, a gaze recognition model is trained using face images with and without occlusion of the eyes. This solves the problem that gaze estimation technology relies on labeled data and achieves efficient gaze recognition and accurate estimation even in the absence of labeled samples.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- YINWANG INTELLIGENT TECHNOLOGIES CO LTD
- Filing Date
- 2022-08-09
- Publication Date
- 2026-06-26
AI Technical Summary
Existing gaze estimation techniques based on deep neural network learning rely on a large number of labeled samples with gaze data, resulting in high costs and low efficiency, making it difficult to apply effectively in various scenarios, especially in vehicle scenarios where it is difficult to obtain labeled data and the training cycle is long.
An unsupervised training method is adopted to train a gaze recognition model using face images with and without occlusion of the eyes. The model extracts gaze features from the unoccluded images and combines them with facial information to reconstruct human eye information, thereby improving the robustness and accuracy of the model.
Even in the absence of labeled samples, it can accurately identify gaze information in face images, improve the performance and estimation accuracy of gaze recognition models, adapt to large-angle head poses, and reduce training costs and time.
Smart Images

Figure CN115424318B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence, and more particularly to an image recognition method and device. Background Technology
[0002] In the process of human interaction with the outside world, a person's gaze can usually intuitively and naturally reflect the objects of interest, purposes, and needs. Therefore, gaze recognition can be applied to various aspects such as behavior understanding, intent analysis, and human-computer interaction. For example, in an in-vehicle scenario, driver gaze recognition can detect whether the driver is distracted.
[0003] With the continuous development of deep neural networks, gaze estimation technology based on deep neural network learning has gradually been widely used as a gaze recognition technique. However, current gaze estimation techniques based on deep neural network learning rely on a large number of labeled samples with gaze data, and the labeling of this gaze data requires a lot of human and material resources, resulting in high cost and low efficiency in current gaze recognition. Summary of the Invention
[0004] This application provides an image recognition method and device that can train a gaze recognition model for recognizing gaze information in face images even when there are no labeled samples with gaze data. It can also improve the stability and accuracy of the gaze recognition model's recognition results. Even when a face image with a large-angle head posture is input, the gaze recognition model of this application can accurately recognize the gaze information in the face image.
[0005] To achieve the above objectives, this application adopts the following technical solution:
[0006] In a first aspect, this application provides an image recognition method, which includes: acquiring a face image to be recognized; analyzing the face image through a pre-trained gaze recognition model to obtain a gaze recognition result, wherein the gaze recognition model is trained on a neural network model based on training data, and the training data includes a first image and a second image, the first image being an unobstructed face sample image, and the second image being a face sample image with obstructed eyes.
[0007] In other words, this application can obtain a gaze recognition model for identifying gaze information in face images through unsupervised training using unobstructed face sample images (i.e., the first image) and their corresponding face sample images with occluded eyes (i.e., the second image), without relying on a large number of labeled samples with gaze data for training. Furthermore, after obtaining the aforementioned pre-trained gaze recognition model, this application can also directly utilize this gaze recognition model to accurately identify gaze information in the face image to be identified. Moreover, since this application trains the gaze recognition model directly using face images, the trained gaze recognition model can combine full-face information such as eye information and head posture information to accurately learn the gaze characteristics of a person. Thus, even if the face image to be identified has a large-angle head posture, the gaze recognition model trained by this application can still have good robustness and accurately identify gaze information in the face image. This improves the performance of the gaze recognition model and also increases the accuracy of gaze estimation.
[0008] In one possible implementation, the image recognition method provided in this application may further include: analyzing a first image and a second image through a preset neural network model to obtain the human eye reconstruction result of the second image; calculating the error loss between the human eye reconstruction result and the original human eye; and when the error loss does not meet preset conditions, training a preset neural network model based on the error loss to obtain a trained gaze recognition model.
[0009] It is understood that this application reconstructs the human eye in a second image where the human eye information is obscured using a pre-defined neural network model. This forces the pre-defined neural network model to learn the gaze characteristics of a person during the eye reconstruction process, thereby obtaining a well-trained gaze recognition model. Furthermore, this application uses a pre-defined neural network model to analyze the first and second images to complete the human eye reconstruction of the second image. Because the human eye information is obscured in the second image, the pre-defined neural network model can perceive facial information (such as head posture information) other than the human eye from the second image. Since the human eye information in the first image is not obscured, the pre-defined neural network model can perceive the human eye information from the first image. Therefore, the pre-defined neural network model can combine human eye information and facial information to complete the human eye reconstruction of the second image. This forces the pre-defined neural network model to learn the gaze characteristics of a person during the eye reconstruction process, thereby obtaining a well-trained gaze recognition model. This gaze recognition model is robust to head posture and gaze information. This improves the performance of the gaze recognition model and also increases the accuracy of gaze estimation.
[0010] It should be noted that the image recognition method provided in this application includes a method for training a gaze recognition model and a method for performing image recognition using the gaze recognition model trained by the training method. These are inventions based on the same concept and can be understood as two parts of a system or two stages of a whole process: such as the model training stage and the model application stage.
[0011] In one possible implementation, the aforementioned preset neural network model may include a first neural network and a second neural network. The process of analyzing the first and second images using the preset neural network model to obtain the human eye reconstruction result of the second image may include: extracting gaze features from the first image using the second neural network and injecting these gaze features into the first neural network; extracting facial features from the second image using the first neural network, and generating the human eye reconstruction result based on the facial features and the gaze features injected by the second neural network.
[0012] In other words, the pre-defined neural network model provided in this application can include two neural networks. The first neural network can be used to attempt to reconstruct the human eye information from a second image that has obscured the human eye information. The second neural network can be used to extract gaze features containing gaze information from an unobstructed first image that contains human eye information, and inject these gaze features into the first neural network to help the first neural network reconstruct the human eye information. Thus, this application can train the pre-defined neural network model to learn the gaze features of a person.
[0013] In one possible implementation, the first neural network may include an encoder and a decoder. The process described above, which extracts facial features from the second image using the first neural network and generates a human eye reconstruction result based on these facial features and gaze features injected into the second neural network, may include: extracting facial features from the second image using the encoder; concatenating the facial features with the gaze features injected into the second neural network to obtain a concatenated feature vector; and reconstructing the human eye image from the feature vector using the decoder to obtain the human eye reconstruction result.
[0014] In other words, the pre-defined neural network model provided in this application includes an autoencoder structure for reconstructing eye occlusion and a gaze injection structure. The encoder in the autoencoder structure can extract facial features from a first image that occludes human eye information, and the decoder in the autoencoder structure can reconstruct the eye image based on the facial features and the gaze features injected by the second neural network. Thus, this application can train the pre-defined neural network model to learn the gaze features of a person.
[0015] In one possible implementation, the second neural network shares an encoder with the first neural network. The above-mentioned extraction of gaze features of the first image through the second neural network may include: extracting gaze features of the first image through the encoder.
[0016] It is understandable that the network weights of the encoder in the first neural network can reflect the learning ability of facial information. When the network weights of the encoder in the second neural network are shared with those of the encoder in the first neural network, the encoder in the second neural network can learn facial information as well as human eye gaze information. Thus, the encoder in the second neural network can learn more accurate human gaze features based on full-face information.
[0017] In one possible implementation, the pre-defined neural network model may further include an embedding module. The above-described extraction of gaze features from the first image via the encoder may include: converting the first image into a first embedding block sequence via the embedding module; and extracting gaze features from the first embedding block sequence via the encoder. The above-described extraction of facial features from the second image via the encoder may include: converting the second image into a second embedding block sequence via the embedding module; and extracting facial features from the second embedding block sequence via the encoder.
[0018] It is understandable that since the input and output of the encoder are both one-dimensional vector sequences, and the data format of image data is clearly incompatible with that of the encoder. Therefore, the preset neural network model provided in this application may include an embedding module to segment the image into blocks, and the resulting multiple image blocks can be converted into an input sequence that meets the requirements of the encoder.
[0019] In one possible implementation, both the first neural network and the second neural network can be neural networks based on a vision transformer (ViT) structure. The first neural network includes a ViT encoder and a ViT decoder, and the second neural network shares the ViT encoder with the first neural network.
[0020] It can be understood that the ViT structure can treat an image as a sequence of image patches, and the ViT encoder can encode a series of feature vectors corresponding to each image patch based on the correlation between any two image patches in the input series. In other words, the feature vector corresponding to each image patch, in addition to the feature information inherent in each image patch itself, also incorporates feature information from other image patches based on its correlation with them. Therefore, the pre-defined neural network model provided in this application, using the ViT structure as the main network structure, can fully integrate global information from face images to extract feature information from face images, improving the training effect of the gaze recognition model and increasing the accuracy of gaze estimation.
[0021] In one possible implementation, the second neural network may include a first fully connected layer and a second fully connected layer. The process of injecting gaze features into the first neural network may include: downsampling the gaze features through the first fully connected layer to obtain bottleneck features; upsampling the bottleneck features through the second fully connected layer to obtain bottleneck features with the same dimension as the facial features; and injecting the bottleneck features with the same dimension as the facial features into the first neural network.
[0022] It is understandable that after the second neural network extracts gaze features from the complete, unobstructed first image, it can compress these gaze features to obtain low-dimensional features containing gaze information. Then, the second neural network can inject these low-dimensional features into the first neural network to help it reconstruct the human eye image. In this way, this application can train a pre-defined neural network model to learn low-dimensional gaze features.
[0023] In one possible implementation, the above-mentioned calculation of the error loss between the reconstructed human eye and the original human eye may include: calculating the error loss between the reconstructed human eye and the human eye image in the first image. Thus, when training a preset neural network model for human eye reconstruction, this application can use the human eye image in the first image as the training target to ensure that the error loss between the model's output human eye reconstruction result and the human eye image in the first image meets preset conditions.
[0024] In one possible implementation, the above-mentioned method of training a preset neural network model based on the error loss to obtain a trained gaze recognition model when the error loss does not meet the preset conditions may include: updating the parameters in the preset neural network model based on the error loss to obtain a trained neural network model; obtaining an initial gaze recognition model based on the second neural network and the linear regression layer in the trained neural network model, wherein the linear regression layer is used to convert the gaze features extracted by the second neural network into gaze information, including the horizontal and vertical angles of the gaze; and training the initial gaze recognition model based on face sample images labeled with gaze information to obtain a trained gaze recognition model.
[0025] It is understood that the second neural network in the trained neural network model of this application can learn more accurate gaze features based on full-face information, but it cannot directly obtain intuitive gaze information. Therefore, in order to apply the model, a linear regression layer can be trained using a small number of face sample images labeled with gaze information to map the gaze features output by the model from the feature space to the gaze space, thereby obtaining intuitive gaze information. In this embodiment, the gaze information can be set as two-dimensional data composed of the horizontal and vertical angles of the gaze.
[0026] In one possible implementation, the above-mentioned training of the initial gaze recognition model based on face sample images labeled with gaze information to obtain a trained gaze recognition model may include: training the linear regression layer in the initial gaze recognition model based on face sample images labeled with gaze information to obtain a trained gaze recognition model.
[0027] It is understood that this application can fix the network parameters of the second neural network in the pre-trained neural network model, and only train the linear regression layer to learn the linear mapping relationship between gaze features and gaze information. Thus, intuitive gaze angle information can be regressed from abstract gaze features.
[0028] Optionally, when there are many face sample images labeled with gaze information, this application can also use face sample images labeled with gaze information to train the entire gaze recognition model in order to fine-tune the gaze recognition model and obtain gaze recognition results.
[0029] In one possible implementation, the first and second images mentioned above can be face sample images without labeled gaze information. Thus, this application can also train a gaze recognition model for identifying gaze information in face images even in the absence of labeled samples with gaze data.
[0030] Secondly, this application provides a computing device, which includes an acquisition module and a recognition module. The acquisition module is used to acquire a face image to be recognized; the recognition module is used to analyze the face image using a pre-trained gaze recognition model to obtain a gaze recognition result. The gaze recognition model is trained on a neural network model based on training data, which includes a first image and a second image. The first image is an unobstructed face sample image, and the second image is a face sample image with obscured eyes.
[0031] As can be seen, this application can obtain a gaze recognition model for recognizing gaze information in face images through unobstructed face sample images (i.e., the first image) and their corresponding face sample images with occluded eyes (i.e., the second image) in an unsupervised manner, without relying on a large number of labeled samples with gaze data for training. Furthermore, after obtaining the aforementioned pre-trained gaze recognition model, this application can also directly utilize this gaze recognition model to accurately identify gaze information in the face image to be recognized when it is acquired.
[0032] In one possible implementation, the computing device provided in this application further includes a training module. This training module is used to: analyze the first image and the second image using a preset neural network model to obtain the human eye reconstruction result of the second image; calculate the error loss between the human eye reconstruction result and the original human eye; and when the error loss does not meet preset conditions, train the preset neural network model based on the error loss to obtain a trained gaze recognition model.
[0033] It is understood that this application uses a pre-set neural network model to reconstruct the human eye in a second image that has obscured human eye information. This can force the pre-set neural network model to learn the gaze characteristics of the person during the human eye reconstruction process, thereby obtaining a well-trained gaze recognition model.
[0034] In one possible implementation, the aforementioned pre-defined neural network model may include a first neural network and a second neural network. The training module analyzes the first and second images using the pre-defined neural network model to obtain the human eye reconstruction result of the second image. This may include: extracting gaze features from the first image using the second neural network and injecting these gaze features into the first neural network; extracting facial features from the second image using the first neural network, and generating the human eye reconstruction result based on the facial features and the gaze features injected into the second neural network. This application can train the pre-defined neural network model to learn the gaze features of a person.
[0035] In one possible implementation, the first neural network may include an encoder and a decoder. The training module described above extracts facial features from the second image using the first neural network, and generates a human eye reconstruction result based on the facial features and the gaze features injected by the second neural network. This process may include: extracting facial features from the second image using the encoder; concatenating the facial features with the gaze features injected by the second neural network to obtain a concatenated feature vector; and reconstructing the human eye image from the feature vector using the decoder to obtain the human eye reconstruction result.
[0036] In other words, the pre-defined neural network model provided in this application includes an autoencoder structure for reconstructing eye occlusion and a gaze injection structure. The encoder in the autoencoder structure can extract facial features from a first image that occludes human eye information, and the decoder in the autoencoder structure can reconstruct the eye image based on the facial features and the gaze features injected by the second neural network. Thus, this application can train the pre-defined neural network model to learn the gaze features of a person.
[0037] In one possible implementation, the second neural network shares an encoder with the first neural network. The training module described above extracts the gaze features of the first image through the second neural network, which may include: extracting the gaze features of the first image through the encoder.
[0038] It is understandable that the network weights of the encoder in the first neural network can reflect the learning ability of facial information. When the network weights of the encoder in the second neural network are shared with those of the encoder in the first neural network, the encoder in the second neural network can learn facial information as well as human eye gaze information. Thus, the encoder in the second neural network can learn more accurate human gaze features based on full-face information.
[0039] In one possible implementation, the pre-defined neural network model may further include an embedding module. The extraction of gaze features from the first image via the encoder in the training module may include: converting the first image into a first embedding block sequence via the embedding module; and extracting gaze features from the first embedding block sequence via the encoder. Similarly, the extraction of facial features from the second image via the encoder may include: converting the second image into a second embedding block sequence via the embedding module; and extracting facial features from the second embedding block sequence via the encoder.
[0040] It is understandable that since the input and output of the encoder are both one-dimensional vector sequences, and the data format of image data is clearly incompatible with that of the encoder. Therefore, the preset neural network model provided in this application may include an embedding module to segment the image into blocks, and the resulting multiple image blocks can be converted into an input sequence that meets the requirements of the encoder.
[0041] In one possible implementation, both the first and second neural networks can be based on a vision transformer (ViT) structure. The first neural network includes a ViT encoder and a ViT decoder, while the second neural network shares the ViT encoder with the first. Therefore, the pre-defined neural network model provided in this application, using a ViT structure as its core, can fully integrate global information from face images to extract feature information, thus improving the training effect of the gaze recognition model and increasing the accuracy of gaze estimation.
[0042] In one possible implementation, the second neural network may include a first fully connected layer and a second fully connected layer. The injection of gaze features into the first neural network in the training module described above may include: downsampling the gaze features through the first fully connected layer to obtain bottleneck features; upsampling the bottleneck features through the second fully connected layer to obtain bottleneck features with the same dimension as the facial features; and injecting the bottleneck features with the same dimension as the facial features into the first neural network. Thus, this application can train a pre-defined neural network model to learn low-dimensional gaze features.
[0043] In one possible implementation, calculating the error loss between the reconstructed human eye and the original human eye in the training module may include calculating the error loss between the reconstructed human eye and the human eye image in the first image. Thus, when training a preset neural network model for human eye reconstruction, this application can use the human eye image in the first image as the training target to ensure that the error loss between the model's output reconstructed human eye and the human eye image in the first image meets preset conditions.
[0044] In one possible implementation, when the error loss in the above training module does not meet the preset conditions, a preset neural network model is trained based on the error loss to obtain a trained gaze recognition model. This may include: updating the parameters in the preset neural network model based on the error loss when the error loss does not meet the preset conditions to obtain a trained neural network model; obtaining an initial gaze recognition model based on the second neural network and the linear regression layer in the trained neural network model, wherein the linear regression layer is used to convert the gaze features extracted by the second neural network into gaze information, including the horizontal and vertical angles of the gaze; and training the initial gaze recognition model based on face sample images labeled with gaze information to obtain a trained gaze recognition model.
[0045] It is understandable that the second neural network in the trained neural network model of this application can learn more accurate gaze features based on full-face information, but it cannot directly obtain intuitive gaze information. Therefore, in order to apply the model, a linear regression layer can be trained using a small number of face sample images labeled with gaze information to map the gaze features output by the model from the feature space to the gaze space, thereby obtaining intuitive gaze information. Specifically, this application can set the gaze information as two-dimensional data composed of the horizontal and vertical angles of the gaze.
[0046] In one possible implementation, the training module described above trains an initial gaze recognition model based on face sample images labeled with gaze information to obtain a trained gaze recognition model. This may include: training the linear regression layer in the initial gaze recognition model based on face sample images labeled with gaze information to obtain a trained gaze recognition model.
[0047] It is understood that this application can fix the network parameters of the second neural network in the pre-trained neural network model, and only train the linear regression layer to learn the linear mapping relationship between gaze features and gaze information. Thus, intuitive gaze angle information can be regressed from abstract gaze features.
[0048] Optionally, when there are many face sample images labeled with gaze information, the above training module can also be used to: train the entire gaze recognition model based on the face sample images labeled with gaze information, so as to fine-tune the gaze recognition model and obtain gaze recognition results.
[0049] In one possible implementation, the first and second images mentioned above can be face sample images without labeled gaze information. Thus, this application can also train a gaze recognition model for identifying gaze information in face images even in the absence of labeled samples with gaze data.
[0050] Thirdly, this application provides a computing device including one or more processors and one or more memories. The one or more memories are coupled to the one or more processors, and the one or more memories are used to store computer program code, which includes computer instructions. When the one or more processors execute the computer instructions, the computing device performs the image recognition method in any possible implementation of the first aspect described above.
[0051] Fourthly, this application provides a computer storage medium including computer instructions that, when executed on a computing device, cause the computing device to perform the image recognition method in any possible implementation of the first aspect described above.
[0052] Fifthly, this application provides a computer program product that, when run on a computing device, causes the computing device to execute the image recognition method in any possible implementation of the first aspect described above.
[0053] In a sixth aspect, this application provides a chip including a processor and a data interface. The processor reads instructions stored in a memory through the data interface to execute the image recognition method in any possible implementation of the first aspect described above.
[0054] Alternatively, as one implementation, the chip may also include a memory storing instructions, and a processor is used to execute the instructions stored in the memory. When the instructions are executed, the processor is used to execute the image recognition method in any possible implementation of the first aspect described above.
[0055] In a seventh aspect, this application provides a server or electronic device, including one or more processors and one or more memories. The one or more memories are coupled to the one or more processors, and the one or more memories are used to store computer program code, including computer instructions, which, when executed by the one or more processors, cause the server or electronic device to perform the image recognition method in any possible implementation of the first aspect described above.
[0056] Understandably, the beneficial effects achieved by the computing device of the third aspect, the computer storage medium of the fourth aspect, the computer program product of the fifth aspect, the chip of the sixth aspect, and the server or electronic device of the seventh aspect can be referred to the beneficial effects of the first aspect and any possible implementation thereof, which will not be repeated here. Attached Figure Description
[0057] Figure 1 A schematic diagram of a system architecture provided in an embodiment of this application;
[0058] Figure 2 A schematic diagram of another system architecture provided in this application embodiment;
[0059] Figure 3 This application provides a schematic diagram of a chip hardware structure.
[0060] Figure 4 A schematic diagram of a training model provided in an embodiment of this application;
[0061] Figure 5 A schematic diagram of a vision transformer (ViT) structure provided in an embodiment of this application;
[0062] Figure 6A schematic diagram of a self-attention module in a ViT structure provided in an embodiment of this application;
[0063] Figure 7 A schematic diagram of a transformer module in a ViT structure provided in an embodiment of this application;
[0064] Figure 8 A flowchart illustrating an unsupervised training method for a gaze recognition model provided in an embodiment of this application;
[0065] Figure 9 A schematic diagram illustrating the process of injecting a bottleneck in a ViT structure, as provided in an embodiment of this application;
[0066] Figure 10 This is a schematic diagram of the structure of a linear regressor in a training model provided in an embodiment of this application;
[0067] Figure 11 A schematic diagram of an overall system provided for an embodiment of this application;
[0068] Figure 12 This is a flowchart illustrating an image recognition method based on a gaze recognition model, provided in an embodiment of this application. Detailed Implementation
[0069] The technical solutions of the embodiments of this application will now be described with reference to the accompanying drawings. Hereinafter, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Therefore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this embodiment, unless otherwise stated, "multiple" means two or more.
[0070] Eye gaze estimation is a crucial technique for understanding human intentions. It infers the direction of a person's gaze, revealing their focus and allowing for a more accurate understanding of their intent. In recent years, with the continuous development of deep neural networks, eye gaze estimation techniques based on deep neural network learning have demonstrated superior performance in numerous eye gaze recognition applications.
[0071] However, current deep neural network-based gaze estimation methods rely on a large amount of labeled gaze data to train the deep neural network. Gauge labeling is expensive and requires significant time, manpower, and resources, making it difficult to collect large amounts of labeled gaze data for every scenario. This results in poor training performance and inaccurate gaze estimation for the deep neural network. Furthermore, in some scenarios, the labeled gaze data needs to be compatible with the hardware devices in the scene. For example, in automotive scenarios, labeled gaze data needs to be compatible with different car models and cameras, further increasing the difficulty of acquiring labeled gaze data. This makes the training cycle for deep neural networks long and inefficient.
[0072] To address the aforementioned issues, this application provides an unsupervised training method for deep neural networks. This method eliminates the need for extensive labeled gaze data to train the deep neural network, saving time and resources associated with gaze labeling, reducing the difficulty of acquiring training data, and shortening the training cycle of the deep neural network. Unsupervised training can refer to a training method that trains a deep neural network using unlabeled training data.
[0073] As a possible solution, a large number of unlabeled human eye images can be used as training data and input into a gaze estimation neural network model. The gaze estimation neural network model can then gradually identify the correlations and potential rules between the training data until it can be used to judge or identify the gaze features of the input human eye images, thereby completing the unsupervised training of the gaze estimation neural network model based on human eye images.
[0074] However, the aforementioned gaze estimation neural network model, which is based on unsupervised training of human eye images, is only applicable to human eye images as input and loses full-face information, especially head pose information. Therefore, the aforementioned gaze estimation neural network model based on unsupervised training of human eye images is powerless for full-face images as input and is not sensitive to head pose, making it difficult to deal with images with large-angle head poses.
[0075] Based on this, embodiments of this application provide an unsupervised training method for a gaze recognition model. This method can obtain the gaze recognition model through unsupervised training using a large number of unlabeled face images, enabling the model to judge or recognize the gaze features of input face images. Thus, during unsupervised training, the gaze recognition model can learn the gaze features of a person by combining eye and facial information. Therefore, the gaze recognition model obtained through unsupervised training can use face images as input. Even with face images showing large head poses, the gaze recognition model exhibits good robustness and can accurately identify gaze information in face images, improving both the performance and accuracy of gaze estimation.
[0076] Furthermore, based on the gaze recognition model obtained through unsupervised training, this application embodiment also provides an image recognition method. When a face image to be recognized is obtained, the face image to be recognized can be used as input to the gaze recognition model obtained through unsupervised training. This allows the gaze recognition model to identify the gaze information of the person based on the full-face information of the face image to be recognized and output the gaze recognition result. Thus, the gaze direction of the person in the face image to be recognized can be obtained based on the gaze recognition result output by the gaze recognition model.
[0077] It should be noted that the application scenarios of the image processing method provided in this application embodiment are not limited, and can be applied to any scenario involving technologies such as gaze estimation and gaze tracking, such as in-vehicle scenarios, game interaction, and home scenarios. For example, in an in-vehicle scenario, the image processing method provided in this application embodiment can be used to identify the driver's gaze direction in the cabin in order to analyze whether the driver is exhibiting abnormal behaviors such as fatigue driving or distraction.
[0078] The system architecture provided in the embodiments of this application will be introduced below.
[0079] Please see Figure 1 , Figure 1 A schematic diagram of a system architecture provided in an embodiment of this application is shown. For example... Figure 1 As shown, the system architecture 100 includes an execution device 110, a training device 120, a database 130, a client device 140, a data storage system 150, and a data acquisition device 160.
[0080] The data acquisition device 160 is used to acquire training data. For example, the training data in this embodiment may be sample images used to train a gaze recognition model. These sample images may be images or videos containing human faces.
[0081] Optionally, the training data may include a first image and a second image. The first image may be an unoccluded face sample image, and the second image corresponds to the first image; the second image may be a face sample image with the eyes obscured. Optionally, the second image may be a face sample image obtained by obscuring the eyes in the first image.
[0082] After collecting the training data, the data acquisition device 160 can store this training data in the database 130, and the training device 120 can train the target model 101 based on the training data maintained in the database 130. In the embodiments of this application, the target model can also be referred to as the target rule.
[0083] Optionally, the training device 120 in this embodiment can input a first image containing human eye information and a corresponding second image with the human eye information obscured into a preset neural network model to obtain gaze information in the first image and human eye reconstruction results in the second image. Then, the training device 120 trains the preset neural network model based on the human eye reconstruction results to obtain a target model 101 that meets application requirements. The preset neural network model can be a neural network model set according to the application scenario of the training model, or it can be a pre-stored neural network model (e.g., a training model obtained through previous model training).
[0084] In this embodiment of the application, the preset neural network model may include two neural networks. The first neural network can be used to attempt to reconstruct human eye information from a second image that obscures human eye information. The second neural network can be used to extract gaze features containing gaze information from an unobstructed first image that contains human eye information, and inject the gaze features into the first neural network to help the first neural network reconstruct human eye information.
[0085] Then will be combined Figure 4 The training device 120 obtains the target model 101 based on training data in more detail. This trained target model 101 can be used to implement the image processing method provided in the embodiments of this application. For example, after the execution device 110 receives a face image to be recognized from the client device 140, the execution device 110 can input the face image to be recognized into the target model 101 for processing to obtain the gaze information of the face image. In the embodiments of this application, the target model 101 can be used to recognize gaze information in face images; this target model 101 is also called a gaze recognition model.
[0086] It should be noted that in practical applications, the training data maintained in database 130 may not all come from the data acquisition device 160; it may also be received from other devices. Furthermore, it should be noted that training device 120 may not necessarily train the target model 101 entirely based on the training data maintained in database 130; it may also obtain training data from the cloud or other sources for model training. The above description should not be construed as limiting the embodiments of this application. It should also be noted that at least a portion of the training data maintained in database 130 can also be used to perform the process of device 110 processing the face image to be recognized.
[0087] Optionally, the target model 101 trained using training device 120 can be applied to different systems or devices, such as... Figure 1The execution device 110 can be a terminal, such as a mobile phone, tablet, laptop, augmented reality (AR) / virtual reality (VR) device, in-vehicle terminal, etc. Alternatively, it can be a chip that can be applied to these devices, or it can be a server or cloud service. Figure 1 In this embodiment, the execution device 110 is configured with an input / output (I / O) interface 112 for data interaction with external devices. Users can input data to the I / O interface 112 through the client device 140. For example, the input data in this embodiment may be a face image to be processed that requires gaze recognition.
[0088] Preprocessing module 113 and / or preprocessing module 114 can be used to preprocess the input data received by I / O interface 112. For example, in this embodiment, preprocessing module 113 can be used to crop the input face image to be recognized to obtain a standardized face image. Optionally, the size (also called resolution) of the standardized face image can conform to the input size of target model 101. For example, in this embodiment, preprocessing module 114 can be used to segment the input face image to be recognized to obtain a series of image patches, which can be input to target model 101 for processing in the form of a patch sequence.
[0089] In this embodiment, the execution device 110 may include a calculation module 111, which includes a target model 101 trained by the training device 120 based on training data. In this embodiment, preprocessing modules 113 and 114 may be omitted (or only one of them may be used), and the calculation module 111 may be used directly to process the input data. It should be noted that preprocessing module 113 or preprocessing module 114 may preprocess all or only a portion of the input data.
[0090] It should be noted that preprocessing module 113 and / or preprocessing module 114 may also be trained in training device 120. Calculation module 111 can be used to perform calculations and other related processing on the input data from preprocessing module 113 or I / O interface 112 according to the target model 101 described above.
[0091] During the preprocessing of input data by the execution device 110, or during the calculation module 111 of the execution device 110 performing calculations and other related processes, the execution device 110 can call data, code, etc. in the data storage system 150 for corresponding processing, or store the data, instructions, etc. obtained from the corresponding processing into the data storage system 150.
[0092] Finally, the I / O interface 112 can present the processing results, such as the gaze recognition results calculated by the target model 101, to the client device 140, thereby providing them to the user. Alternatively, the I / O interface 112 can also feed the processing results back to other devices for corresponding processing. This application embodiment does not limit this.
[0093] It is understood that the execution device 110 and the client device 140 in the embodiments of this application may be the same device, such as the same terminal device.
[0094] It is worth noting that the training device 120 can train a target model 101 based on different training data for different objectives or tasks (or business). The target model 101 can then be used to achieve the aforementioned objectives or complete the aforementioned tasks, thereby providing the user with the desired results.
[0095] exist Figure 1 In this system, users can manually provide input data, such as a facial image or video of the person whose gaze information is to be recognized. This manual input can be performed through the interface provided by I / O interface 112. Alternatively, client device 140 can automatically send input data to I / O interface 112. If user authorization is required for client device 140 to automatically send input data, the user can set the corresponding permissions in client device 140. Users can view the output results of execution device 110 on client device 140, which can be presented in various forms such as display, sound, or motion. Client device 140 can also act as a data acquisition terminal to collect data such as... Figure 1 The input data and output results of the input I / O interface 112 shown are used as new training data and stored in the database 130. Alternatively, data can be collected directly from the I / O interface 112 without going through the client device 140. Figure 1 The input data and output results of the input I / O interface 112 shown are stored in the database 130 as new training data.
[0096] Understandable, Figure 1 This is merely a schematic diagram of a system architecture provided in an embodiment of this application. The positional relationships between the devices, components, modules, etc., shown in the diagram do not constitute any limitation. For example, in Figure 1In this case, the data storage system 150 is an external memory relative to the execution device 110. In other cases, the data storage system 150 may also be placed in the execution device 110.
[0097] Please see Figure 2 , Figure 2 A schematic diagram of another system architecture provided in an embodiment of this application is shown. This system architecture 200 may include at least one electronic device 21 and at least one server 22. The electronic device 21 and the server 22 establish a communication connection through one or more networks. This network may be a local area network (LAN) or a wide area network (WAN), such as the Internet. The network can be implemented using any known network communication protocol, which may be various wired or wireless communication protocols, such as Ethernet, Universal Serial Bus (USB), any cellular network communication protocol (such as 3G / 4G / 5G), Bluetooth, Wireless Fidelity (Wi-Fi), or any other suitable communication protocol.
[0098] The aforementioned server 22 may be, for example, a cloud server, a network server, an application server, or a management server, or other devices or servers with data processing capabilities. The aforementioned electronic device 21 may be, for example, a mobile phone, tablet computer, personal computer (PC), personal digital assistant (PDA), smartwatch, netbook, wearable electronic device, augmented reality (AR) device, virtual reality (VR) device, in-vehicle equipment, smart car, smart speaker, robot, etc.
[0099] In some examples, electronic device 21 can be Figure 1 The client device 140 shown; the server 22 can be Figure 1 The execution device 110 and training device 120 are shown. In other examples, the electronic device 21 may be... Figure 1 The execution device 110 and client device 140 shown, and the server 22 can be Figure 1 The training device 120 is shown. In some other examples, the electronic device 21 can be... Figure 1 The client device 140, execution device 110, and training device 120 are shown.
[0100] In some embodiments, server 22 can pre-acquire a large amount of training data containing face images, and based on this large amount of training data, train a gaze recognition model using machine learning / deep learning or other methods. This gaze recognition model can be used to predict the gaze information corresponding to a face image, that is, to recognize the gaze of the person in the face image. The gaze information obtained by the gaze recognition model can be a feature vector containing gaze information, or it can be a specific gaze angle value, such as the horizontal and vertical angles of the gaze direction.
[0101] It is understood that the unsupervised training method provided in this application embodiment is applied to the training process of the gaze recognition model. That is, in this embodiment, the server 22 can execute the unsupervised training method for the gaze recognition model provided in this application embodiment.
[0102] Optionally, after training the gaze recognition model, when the server 22 receives a face image of a gaze to be recognized, it can directly input the face image of the gaze to be recognized into the gaze recognition model for processing to obtain the gaze information corresponding to the face image.
[0103] It is understood that the image recognition method provided in this application embodiment is applied to the training of the gaze recognition model. That is to say, in this embodiment, the server 22 can also execute the image recognition method provided in this application embodiment.
[0104] Optionally, server 22 can also perform corresponding operations based on the identified gaze information. For example, server 22 can output different content to the user based on the identified gaze information. Alternatively, server 22 can send the identified gaze information to electronic device 21, which will then present it to the user, or electronic device 21 can output different content to the user based on the identified gaze information. For example, in a vehicle scenario, a gaze recognition model can identify the driver's gaze direction in the cabin, and when driver distraction is detected, a warning message can be output, effectively reducing the occurrence of abnormal driving situations. Furthermore, in gaming, a gaze recognition model can identify the user's gaze direction for game interaction, improving the human-computer interaction experience.
[0105] Optionally, after training the gaze recognition model, the server 22 can also send the gaze recognition model to the electronic device 21. In this way, after receiving a facial image of the gaze to be recognized, either input by the user or captured by the electronic device 21, it can directly input the facial image of the gaze to be recognized into the gaze recognition model for processing to obtain the gaze information corresponding to the facial image. That is, in this embodiment, the electronic device 21 can also execute the image recognition method provided in this application embodiment.
[0106] Optionally, the entire system architecture 200 may not include the server 22. That is, the electronic device 21 can acquire a large amount of training data containing face images and train a gaze recognition model based on this data. Then, after receiving a face image of the gaze to be recognized input by the user or captured by the device itself, the electronic device 21 can directly input the face image of the gaze to be recognized into the gaze recognition model for processing to obtain the gaze information corresponding to the face image. In other words, in this embodiment, only the electronic device 21 can execute the unsupervised training method of the gaze recognition model and the image recognition method provided in this application embodiment.
[0107] The following describes a chip hardware structure provided by an embodiment of this application.
[0108] Please see Figure 3 , Figure 3 This application illustrates the hardware structure of a chip system according to an embodiment of the present application. The chip system includes a neural processing unit (NPU) 300. The chip system can be configured as follows: Figure 1 The execution device 110 shown is used to perform the calculations of the calculation module 111. This chip can also be located in, for example... Figure 1 The training device 120 shown is used to complete the training work of the training device 120 and output the target model 101. The algorithms of each neural network in the preset neural network model provided in the embodiments of this application can all be implemented in... Figure 3 This is achieved in the chip system shown.
[0109] The Neural Processing Unit (NPU) 300 is mounted as a coprocessor on the main central processing unit (CPU) (Host CPU), and tasks are assigned by the host CPU. The core of the NPU is the arithmetic circuit 303, which is controlled by the controller 304 to retrieve matrix data from the memory (weight memory or input memory) and perform multiplication operations.
[0110] In some implementations, the arithmetic circuit 303 internally includes multiple processing engines (PEs). In some implementations, the arithmetic circuit 303 is a two-dimensional pulsating array. The arithmetic circuit 303 can also be a one-dimensional pulsating array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 303 is a general-purpose matrix processor.
[0111] For example, suppose we have an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit retrieves the corresponding data of matrix B from the weight memory 302 and caches it in each PE of the arithmetic circuit. The arithmetic circuit retrieves the data of matrix A from the input memory 301 and performs matrix operations with matrix B. The partial result or the final result of the obtained matrix is stored in the accumulator 308.
[0112] The vector computation unit 307 can further process the output of the arithmetic circuit 303, such as vector multiplication, vector addition, exponentiation, logarithmic operations, size comparisons, etc. For example, the vector computation unit 307 can be used for network computation in non-convolutional / non-FC layers of neural networks, such as pooling, batch normalization, local response normalization, etc.
[0113] In some implementations, the vector computation unit 307 stores the processed output vector into the unified memory 306. For example, the vector computation unit 307 can apply a nonlinear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate activation values.
[0114] In some implementations, vector computation unit 307 generates normalized values, merged values, or both.
[0115] In some implementations, the vector of processed output can be used as an activation input to the computation circuit 303, for example, for use in subsequent layers of the neural network.
[0116] Unified memory 306 is used to store input data and output data. Weight data is directly stored in input memory 301 and / or unified memory 306, weight data in external memory 302, and data in unified memory 306 in external memory through direct memory access controller 305 (DMAC).
[0117] The bus interface unit 310 (BIU) is used to enable interaction between the main CPU, DMAC and instruction fetch buffer 309 via the bus.
[0118] The instruction fetch memory 309, connected to the controller 304, is used to store the instructions used by the controller 304. The controller 304 uses the instructions cached in the instruction fetch memory 309 to control the operation of the arithmetic accelerator.
[0119] Generally, the unified memory 306, input memory 301, weight memory 302, and instruction fetch memory 309 are all on-chip memories, while the external memory is memory outside the NPU. This external memory can be double data rate synchronous dynamic random access memory (DDR SDRAM), high bandwidth memory (HBM), or other readable and writable memory.
[0120] In the preset neural network model provided in this application embodiment, the operations of each neural network can be performed by the operation circuit 303 or the vector calculation unit 307.
[0121] Optionally, the methods provided in the various embodiments of this application can be processed by a CPU, or by a combination of a CPU and a GPU, or they can be processed without a GPU, using other processors suitable for neural network computing. This application does not impose any restrictions.
[0122] The method provided in this application will be described below with reference to the accompanying drawings, focusing on both the training and application aspects of the gaze recognition model.
[0123] Referring to the previous introduction to the training process of the gaze recognition model (also known as target model 101), the training data is symbolized and formalized through intelligent information modeling, extraction, preprocessing, and training by a preset neural network model, and finally the trained target model 101 can be obtained.
[0124] The following describes the network architecture of the preset neural network model involved in the training phase of the gaze recognition model provided in the embodiments of this application.
[0125] The preset neural network model can include two neural networks. The first neural network can be used to attempt to reconstruct human eye information from a second image that obscures human eye information. The second neural network can be used to extract gaze features containing gaze information from an unobstructed first image that contains human eye information, and inject these gaze features into the first neural network to help the first neural network reconstruct human eye information.
[0126] Please see Figure 4In this embodiment, the preset neural network model can be a neural network model based on an encoder-decoder architecture. Optionally, the first neural network in the preset neural network model can be an autoencoder neural network. An autoencoder neural network can learn the implicit features of the input data, i.e., encoding, and simultaneously reconstruct the original input data using the learned implicit features, i.e., decoding.
[0127] like Figure 4 As shown, an autoencoder neural network can contain an encoder and a decoder. The encoder encodes the input second image, which obscures human eye information, into a feature vector, which reflects the implicit features present in the input second image. The decoder reconstructs the human eye information based on this feature vector.
[0128] It is understandable that, since the human eye information in the second image is obscured, most of the latent features extracted by the encoder here are abstract features of facial information. That is, the encoder in the first neural network can learn the facial information in the face image, which may include head pose information.
[0129] Optionally, the second neural network in the preset neural network model can be an encoder for extracting hidden features from the input data. This encoder can be used to extract gaze features containing gaze information from the input unobstructed first image that includes human eye information. That is, the encoder in the second neural network can learn the gaze information of the human eyes in the face image.
[0130] In this embodiment, the gaze features extracted by the second neural network also need to be injected into the first neural network, so that the first neural network can reconstruct human eye information through the decoder based on the facial information it has learned and the gaze information injected by the second neural network.
[0131] Optionally, such as Figure 4As shown, the second neural network may also include fully connected layers (FC). A fully connected layer is one where all nodes in the previous layer are connected to all nodes in the next layer, allowing for the synthesis of previously extracted features. In some embodiments, the fully connected layer can act as an upsampling layer to increase the dimensionality of the feature vector. In other embodiments, the fully connected layer can also act as a downsampling layer to reduce the dimensionality of the feature vector, compressing its information and ensuring its high compactness.
[0132] In this embodiment, after the encoder in the second neural network extracts the gaze features, it can reduce the dimensionality of these gaze features through a fully connected layer to obtain a compressed, low-dimensional gaze-related feature, which can also be called a bottleneck feature. This reduces the computational load of the neural network when injecting the bottleneck feature while avoiding information loss.
[0133] Since the compressed low-dimensional gaze features obtained from the second neural network need to have the same dimension as the feature vector extracted from the first neural network when injected into the first neural network, after obtaining the low-dimensional gaze features, it is also necessary to increase the dimension of the low-dimensional gaze features through a fully connected layer to obtain gaze features with the same dimension as the feature vector extracted from the first neural network.
[0134] In this embodiment, after the first neural network extracts the feature vector of the second image, it can concatenate this feature vector with the gaze features extracted by the second neural network to obtain a concatenated feature vector, thus injecting the gaze features extracted by the second neural network into the first neural network. Then, the first neural network can reconstruct the human eye image from the concatenated feature vector using a decoder. In this way, by injecting small-dimensional features containing gaze information into the first neural network, the second neural network can help the first neural network reconstruct human eye information.
[0135] In this embodiment, the encoder in the second neural network needs to share network weights with the encoder in the first neural network. It can be understood that the network weights of the encoder in the first neural network reflect the learning ability of facial information. When the network weights of the encoder in the second neural network are shared with those of the encoder in the first neural network, the encoder in the second neural network can learn both human gaze information and facial information. Therefore, the encoder in the second neural network can learn more accurate gaze characteristics by combining the learning of facial features.
[0136] Optionally, the encoder in the second neural network can be the same encoder as the encoder in the first neural network. That is, when the first image containing human eye information and its corresponding second image with human eye information obscured are input into the preset neural network model, they will pass through the same encoder. After extracting the facial information in the first image and the gaze information of the human eyes in the second image, the encoder can transmit them to their respective subsequent neural networks.
[0137] Optionally, the encoder in the second neural network can be a different encoder from the encoder in the first neural network, but have the same network weights.
[0138] Since the encoder's input and output are both one-dimensional vector sequences, and the data format of image data is clearly incompatible with the encoder's, this embodiment uses an embedding module to convert the image data to obtain an input sequence that meets the encoder's requirements.
[0139] Optionally, the embedding module can work together with the aforementioned encoding / decoding architecture to form a preset neural network model. In this case, when an input image is fed into the preset neural network model, the neural network model can first convert the image into an input sequence through the embedding module, which is then input to the encoder for subsequent processing. Optionally, the embedding module can also be independent of the neural network model; that is, the input image is first converted into an input sequence through the embedding module, and then input into the preset neural network model for processing. This application does not impose limitations on the embodiments described.
[0140] In some embodiments, the encoder-decoder architecture in the pre-defined neural network model can be a vision transformer (ViT) structure. See also... Figure 5 The vision transformer structure mainly consists of an embedding module and a transformer module.
[0141] The embedding module converts the image input to the neural network model into a series of embedded patches, which then serve as the input sequence conforming to the transformer module's input format. The transformer module encodes a series of feature vectors corresponding to each embedded patch based on the correlation between any two embedded patches. In other words, each feature vector, in addition to its own feature information, incorporates feature information from other embedded patches based on their correlation. Thus, the transformer module can effectively integrate global information from the face image to extract its features, achieving good performance in many computer vision tasks.
[0142] like Figure 5 As shown, after receiving an input image, the vision transformer in a neural network model can segment the image into blocks, resulting in a series of image patches, also known as a patch sequence. For example, an input image of size 224×224 can be segmented into 14×14 image patches, forming a patch sequence of length 196.
[0143] Then, the neural network model can input the image patch sequence into the embedding module. The embedding module obtains a series of embedding block vectors, i.e., the embedding block sequence, by performing linear mapping (i.e., patch embedding) on each image patch in the image patch sequence and adding position embedding. The neural network model can use this embedding block sequence as the input sequence of the encoder of the transformer module and input it into the transformer module.
[0144] In this embodiment, the embedding module can first flatten each image patch in the image patch sequence, that is, unfold each image patch into a one-dimensional vector of a certain length. This vector is called the flattened patch vector. The one-dimensional vectors of all image patches can form a vector sequence. Then, the embedding module can use a fully connected layer to reduce the dimensionality of the vector sequence to obtain a fixed-length embedding patch vector sequence, i.e., the embedding patch sequence. This process is also called the linear projection of flattened patches.
[0145] Since the positional information of each image patch relative to the original image is lost when the original image is broken down into image patches, it is necessary to encode the original positional information of the image and add it to the embedding block vector, i.e., positional embedding. Positional embedding encodes the relative position of the image patch in the image using a sine function, so that the input embedding block sequence contains positional information. The formula for calculating positional embedding is:
[0146]
[0147]
[0148] Where pos represents the position of the image patch in the image patch sequence, i represents the dimension of the embedded patch vector, and d model The length of the embedded block vector is represented by the vector dimension.
[0149] like Figure 5 As shown, the neural network model can input the sequence of embedded blocks containing positional information into the transformer module, and the transformer module can encode the sequence of embedded blocks to obtain stacked feature vectors as output. These stacked feature vectors can form a feature vector sequence of a certain length.
[0150] The transformer module, also known as the transformer model or transformer structure, is a multi-layer neural network based on a self-attention module. When the transformer module processes an embedding block, the self-attention module allows it to examine other locations in the input embedding block sequence to find cues that help better encode that embedding block. Specifically, the self-attention module is a neural network structure that can be used to calculate the correlation between each embedding block in the input embedding block sequence and extract information between the embedding blocks according to their correlation.
[0151] For example, please refer to Figure 6 , Figure 6 A schematic diagram of the structure of a self-attention module is shown. Figure 6As described above, for an input embedding block x, the self-attention module first converts it into three vectors, and then multiplies these three vectors by three weight matrices to obtain three new vectors Query(Q), Key(K), and Value(V), which can be denoted as W1, W2, and W3. Then, the self-attention module calculates the dot product between Q and K (which can be done using the matmul function) to obtain the relevance of each embedding block to other embedding blocks. This relevance can be understood as an attention score. To prevent the dot product result from being too large, it is scaled, i.e., divided by a scale. Where d k Let Q and K be the dimensions of the vectors. Then, the self-attention module performs a softmax operation on the calculated relevance to normalize the result into a probability distribution. Finally, the self-attention module multiplies the normalized result by the V vector to obtain the V vector with attention weights. The calculation formula for this process is:
[0152]
[0153] The transformer module can be further divided into an encoder and a decoder, which can also be called the encoding module and decoding module. The encoder and decoder mentioned above can both be transformer modules in the ViT architecture. Since the encoder and decoder have basically similar structures, the following description will use the structure of the encoder as an example, and the structure of the decoder will not be repeated.
[0154] The encoder of a transformer module can include any number of encoding submodules. Similarly, the decoder of a transformer module can also include any number of decoding submodules. The number of encoding submodules and the number of decoding submodules can be different.
[0155] For example, please refer to Figure 7 , Figure 7 A schematic diagram of the composition structure of an encoding submodule is shown. Among them, Figure 7 Lx, as shown, represents the encoding submodule. The structure of the encoding submodules in the encoder described above can be as follows: Figure 7 The structure of Lx is shown in the diagram. Figure 7As shown, each encoding submodule mainly consists of a multi-head attention module, a layer normalization module, a multi-layer perceptron (MLP), and skip connections. The multi-head attention module is composed of multiple parallel self-attention modules.
[0156] In this embodiment, the embedded patches are encoded by the stacked encoding submodules in the transformer module to obtain stacked feature vector outputs. For example, a 224×224 face image is input, divided into 14×14 image patches, each image patch being 16×16 in size. All image patches can form an embedded patch sequence of length 196, which can be processed by the transformer encoder to obtain a feature vector sequence of length 196.
[0157] The following will use the ViT structure as an example to illustrate the unsupervised training method of the gaze recognition model provided in this application embodiment.
[0158] In this embodiment of the application, the unsupervised training method for the gaze recognition model can be as follows: Figure 1 The training device 120 shown performs the operation. The training device 120 can be... Figure 2 The electronic device 21 shown can also be Figure 2 The server 22 shown in this embodiment is not limited to any particular type. Here, we take an electronic device as an example to specifically illustrate the unsupervised training method for the gaze recognition model provided in this embodiment. Figure 8 As shown, the training method for the gaze recognition model provided in this application embodiment may include S801-S804:
[0159] S801, The electronic device acquires a first image and a second image, wherein the first image is an unobstructed face sample image and the second image is a face sample image with eyes obscured.
[0160] In this embodiment of the application, when the electronic device trains a preset neural network model to obtain a trained gaze recognition model, the electronic device first acquires training data for model training. This training data may include multiple sets of training samples. Optionally, an unobstructed face sample image (i.e., the first image) and a face sample image with occluded eyes (i.e., the second image) corresponding to the unobstructed face sample image constitute a set of training samples. The first image is a full-face image containing eye information, and the second image is a full-face image without eye information.
[0161] In some embodiments, the electronic device can acquire a large number of first images containing human faces as training data for a preset neural network model. The electronic device can then perform eye occlusion processing on the acquired first images to obtain a second image with the eyes occluded corresponding to the first image.
[0162] Optionally, the electronic device can generate an eye mask for the first image. Here, a mask, also called a veil, refers to a portion of the image that is obscured, typically represented by completely black pixels. In some embodiments, the electronic device can identify the human eye region in the first image to directly generate a mask for that region. In other embodiments, the electronic device can also randomly generate a mask for regions in the first image to occlude the human eye region. For example, the electronic device can randomly generate a mask for 75% of the image region in the first image. Figure 4 The second image shown.
[0163] Optionally, the electronic device can acquire facial images with different head poses as training data to train a pre-defined neural network model, thereby obtaining a trained gaze recognition model. This improves the sensitivity of the trained gaze recognition model to head poses in facial images. Even with facial images of large-angle head poses as input, the gaze recognition model of this application exhibits good robustness and can accurately identify gaze information in facial images, improving both the performance of the gaze recognition model and the accuracy of gaze estimation.
[0164] Optionally, after collecting unlabeled face images, the electronic device can preprocess the face images to obtain training samples that meet the training requirements. For example, the electronic device can crop the collected face images to a standardized size before using them as training samples.
[0165] S802, the electronic device inputs the first image and the second image into a preset neural network model to obtain the human eye reconstruction result of the second image.
[0166] In this embodiment of the application, the preset neural network model may include two neural networks. The first neural network can be used to attempt to reconstruct human eye information from a second image that obscures human eye information. The second neural network can be used to extract gaze features containing gaze information from an unobstructed first image that contains human eye information, and inject the gaze features into the first neural network to help the first neural network reconstruct human eye information.
[0167] As one implementation method, the preset neural network model can have the following characteristics: Figure 4 The neural network structure shown is illustrated. Both the first and second neural networks can be ViT structures. Figure 4As shown, the first neural network includes a ViT encoder and a ViT decoder, while the second neural network shares the ViT encoder with the first. Both the encoder and decoder have the aforementioned transformer module structure. Optionally, the ViT decoder can be smaller than the ViT encoder. For example, if the ViT encoder contains 6 encoding sub-modules, the ViT decoder can contain 4 decoding sub-modules.
[0168] When the electronic device inputs the first image and the second image into a preset neural network model, since both the first and second neural networks are ViT structures, the ViT structure can treat the first and second images as a sequence of image blocks, that is, segment the image into blocks for processing, such as... Figure 4 The first and second images shown are then encoded by the ViT encoder to obtain the stacked feature vectors corresponding to the first image and the stacked feature vectors corresponding to the second image.
[0169] When the electronic device inputs the first image and the second image into a preset neural network model, the input for the first neural network is the second image with the human eye information obscured, while the input for the second neural network is the first image of the human face without obscuration and with complete features.
[0170] Referring to the aforementioned introduction to the ViT structure, the vision transformer structure mainly consists of an embedding module and a transformer module. Therefore, when the electronic device inputs a first image and a second image into a preset neural network model, the embedding module in the preset neural network model can convert the second image input to the first neural network into a series of embedding blocks. These embedding block sequences are then encoded by the ViT encoder of the first neural network to obtain stacked feature vectors. Similarly, the embedding module in the preset neural network model can also convert the first image input to the second neural network into a series of embedding blocks. These embedding block sequences are then encoded by the ViT encoder of the second neural network to obtain stacked feature vectors.
[0171] For example, when the first neural network is input with 14×14=196 image blocks obtained after the second image segmentation process, if the second image occludes 75% of the image blocks, it is equivalent to inputting only 49 image blocks carrying facial information. After being encoded by the ViT encoder, 49 feature vectors can be obtained. Adding 147 mask feature vectors containing position embeddings but no information, 196 stacked feature vectors can be obtained.
[0172] For the second neural network, since the first input image is a complete, unobstructed face image, the 14×14=196 image blocks obtained after segmentation of the first input image can be directly encoded by the ViT encoder to obtain 196 stacked feature vectors. These feature vectors contain the gaze features in the first image.
[0173] Since the features extracted by the second neural network need to be fed into the first neural network, to reduce the training complexity of the neural network, the features extracted by the second neural network can be compressed and reduced in dimensionality. Optionally, after the ViT encoder of the second neural network encodes the stacked feature vectors, these stacked feature vectors can be subjected to average pooling to obtain a single feature vector. This single feature vector is then downsampled into a low-dimensional vector (e.g., 16-dimensional) using a fully connected layer (also called a linear layer). Figure 4 The image shows a low-dimensional line-of-sight feature. This low-dimensional feature is also called a bottleneck feature. For an example, please refer to [link to example image]. Figure 9 , Figure 9 The diagram illustrates the algorithm process for bottleneck features. The encoder of the second neural network extracts features from the image containing the eyes, then downsamples them into low-dimensional gaze-related features through a fully connected layer.
[0174] Since the compressed low-dimensional gaze feature obtained from the second neural network needs to have the same dimension as the feature vector extracted by the first neural network when injected into the first neural network, the features extracted by the second neural network also need to undergo dimensionality upscaling. Optionally, such as Figure 4 As shown, after obtaining a single low-dimensional gaze feature, the second neural network can use a fully connected layer to upscale this low-dimensional gaze feature, resulting in a single gaze feature with the same dimension as the feature vector extracted by the first neural network. Then, the second neural network injects this upscaled single gaze feature into the first neural network, concatenating the single gaze feature with the feature vector extracted by the first neural network.
[0175] For example, please continue reading Figure 9 The second neural network downsamples the encoder output features into low-dimensional line-of-sight related features, i.e., bottleneck features, through a fully connected layer, and then upsamples the low-dimensional line-of-sight related features into bottleneck features to be injected into the first neural network through another fully connected layer.
[0176] For example, when the ViT encoder of the second neural network outputs 196 stacked feature vectors, the second neural network can use average pooling to obtain a single feature vector from the 196 stacked feature vectors. After downsampling the single feature vector to obtain the bottleneck feature, the bottleneck feature can be up-dimensionalized and concatenated with the 196 stacked feature vectors output by the ViT encoder of the first neural network to obtain 197 feature vectors.
[0177] After obtaining a feature vector that incorporates low-dimensional gaze features, the first neural network can decode this feature vector using a ViT decoder to obtain the human eye reconstruction result of the second image. This human eye reconstruction result is the actual output of the preset neural network model. Optionally, the ViT decoder can reconstruct only the eye region in the second image, or it can reconstruct the complete, unoccluded second image.
[0178] S803, Electronic device calculates the distance between the reconstructed human eye and the original human eye.
[0179] In this embodiment, after obtaining the human eye reconstruction result of the second image output by a preset neural network model, the electronic device can calculate the distance between the human eye reconstruction result and the original human eye. Here, the human eye reconstruction result is the actual output of the preset neural network model, and the original human eye is the expected output of the preset neural network model, i.e., the training target. The distance between the human eye reconstruction result and the original human eye can be understood as the error between the reconstructed human eye and the original human eye. In some embodiments, the original human eye can be the human eye image in an unobstructed first image.
[0180] In this embodiment of the application, the electronic device can calculate the distance between the reconstructed human eye and the original human eye using a loss function.
[0181] It's understandable that during neural network training, to ensure the network's output closely approximates the desired output (the training objective), we can compare the actual output with the expected output. Based on the difference, we update the weight matrix of each layer (usually with initialization before the first update, pre-configuring parameters for each layer). For example, if the actual output is too high, we adjust the weight matrix to lower it, and so on, until the network outputs the desired value. Therefore, we need to predefine "how to compare the difference between the actual and expected outputs," which is the loss function. These are crucial equations used to measure the difference between the actual and expected outputs. A higher loss function value indicates a greater difference, making neural network training a process of minimizing this loss.
[0182] Optionally, the loss function can be the mean squared error loss (MSE Loss), also known as L2 Loss, used to calculate the L2 distance between the reconstructed eye region and the original eye region. The formula for calculating the mean squared error loss function is as follows:
[0183]
[0184] Where t represents the expected output value of the preset neural network model, and y represents the actual output value of the preset neural network model.
[0185] Optionally, the loss function can also be other loss functions, such as L1 norm loss function, L2 norm loss function, and other loss functions in the prior art, or various modifications of existing loss functions. This application does not limit this.
[0186] S804. If the distance between the reconstructed human eye and the original human eye does not meet the preset conditions, the electronic device trains a preset neural network model based on the reconstructed human eye to obtain a gaze recognition model.
[0187] The preset condition can be a preset convergence condition corresponding to the convergence of the neural network model. Optionally, the preset condition can be a preset convergence threshold. When the distance between the reconstructed human eye result and the original human eye is less than the convergence threshold, the preset neural network model can be considered to have converged, and the preset neural network model has been trained. This trained neural network model is the gaze recognition model implemented in this application. It can be understood that the smaller the preset threshold, the higher the requirements for model training, and the better the effect that the gaze recognition model can achieve after training. It can be understood that when the distance between the reconstructed human eye result and the original human eye is greater than or equal to the convergence threshold, the preset neural network model can be considered not to have converged, and the preset neural network model has not been trained yet, and further training of the neural network model is still required.
[0188] Optionally, if the distance between the reconstructed human eye and the original human eye is calculated using the mean squared error loss function, then when the mean squared error loss function does not meet the preset conditions, the electronic device can train a preset neural network model based on the mean squared error loss function to obtain a gaze recognition model.
[0189] Optionally, the neural network can employ backpropagation (BP) to correct the parameters in the preset neural network model during training, thereby reducing the reconstruction error loss. Specifically, forward propagation of the input signal to the output generates error loss. This error loss information is used to update the parameters in the preset neural network model, completing the training iterations for the first and second images in this round. Similarly, the same method is used to train other first and second images. After several rounds of training iterations, the error loss can decrease slowly or converge, at which point the model can be considered converged, resulting in a well-trained gaze recognition model. The backpropagation algorithm is an error-loss-driven backpropagation process aimed at obtaining the optimal parameters of the neural network model, such as the weight matrix. Generally, the longer the training time, the better the model's performance; however, after a certain number of rounds, the model's performance will no longer improve.
[0190] It is understandable that after the electronic device trains the preset neural network model to convergence based on the reconstruction results from the human eye, the ViT encoder in the second neural network of the trained preset neural network model, which is used to extract feature vectors from the complete and unobstructed first image, also accurately learns the gaze characteristics of the human eye based on the global information of the person's eyes and face in the face image. In this way, a gaze recognition model is trained.
[0191] It is understood that the unsupervised training method for the gaze recognition model provided in this application embodiment can train the model to learn gaze representations using largely unlabeled first and second images when gaze annotation data is lacking. This allows the trained model to perform gaze estimation. Furthermore, since both the first and second images used for training are face images containing head pose information, the gaze recognition model trained in this application also exhibits good robustness to face images with large-angle head poses.
[0192] Furthermore, the embodiments of this application employ a vision transformer network architecture during the pre-training stage of the gaze recognition model. The Vision Transformer can treat an image as a sequence of image patches, and through multi-head attention operations, it can fully integrate global information of the image. Therefore, the gaze recognition model of this application achieves excellent performance in multiple gaze estimation applications. Compared to the poor performance of other existing unsupervised training methods in 100-shot experiments, the unsupervised training method of the gaze recognition model in this application also demonstrates excellent performance on benchmarks such as MPIIGaze, EyeDiap, ColumbiaGaze, Eth-XGaze, and Gaze360.
[0193] Optionally, the unsupervised training method for the gaze recognition model provided in this application involves image segmentation and is not applicable to traditional convolutional neural networks (CNNs). However, this application can use techniques such as data distillation to transfer the unsupervised gaze recognition model to a CNN.
[0194] It is understandable that the aforementioned unsupervised, pre-trained neural network model can extract a relatively accurate gaze representation from face images, namely the low-dimensional gaze features mentioned above, but it cannot directly obtain specific gaze direction values. Therefore, in order to apply this pre-trained neural network model, this embodiment of the application can use a small amount of gaze-annotated training data to train a small low-dimensional linear regressor to convert the extracted low-dimensional gaze features into specific gaze direction values.
[0195] For example, please refer to Figure 10 , Figure 10 The portion within the dashed box represents the network architecture of the low-dimensional linear regressor. This low-dimensional linear regressor can be any linear regression network or a simple linear regression layer; the specific linear regressor described in this application is not limited.
[0196] In this embodiment, after obtaining the pre-set unsupervised trained neural network model, i.e., the gaze recognition model, the electronic device can add a linear regression layer after the second neural network. The electronic device can then acquire a small number of gaze-annotated training images, which are face images annotated with gaze directions. These gaze-annotated training images are then used to train the linear regression layer, converting the low-dimensional gaze representation injected into the bottleneck part of the second neural network into gaze angle values. These gaze angle values can be two-dimensional, including both horizontal and vertical gaze angles. Optionally, the electronic device can use L1 distance as a loss function to train the linear regression layer. Optionally, the electronic device can also use other loss functions to train the linear regression layer; this embodiment is not limited to these methods.
[0197] For example, for a 16-dimensional gaze bottleneck feature, a linear regressor with a bias value can be trained, consisting of 16 x 2 (i.e., the dimension of the output gaze angle value) + 2 (i.e., the bias value) = 34 parameters, to regress the gaze angle value from the low-dimensional gaze feature. Thus, with other model parameters fixed, we only need as few as 100 gaze-annotated samples to train a simple linear regressor.
[0198] In some embodiments, when the electronic device can acquire more training images labeled with gaze patterns, it can also use these gaze-labeled training images to fine-tune the gaze recognition model to obtain a more accurate gaze estimation effect. Optionally, after the electronic device has trained a pre-defined neural network model in an unsupervised manner, it can use a small number of face images with gaze patterns to directly train the entire neural network model to obtain a trained gaze recognition model that can be applied to various gaze recognition scenarios. The loss function used during training is not limited; for example, it can be L1 Loss.
[0199] It is understandable that after obtaining a trained gaze recognition model, an electronic device can use this model to perform gaze recognition on facial images to be identified in practical applications. For example, please refer to... Figure 11 , Figure 11 A schematic diagram of the system flow of the gaze recognition model provided in an embodiment of this application is shown. In the absence of training data with gaze annotations, a gaze recognition model can be trained using the aforementioned unsupervised training method by acquiring a large number of unlabeled face images. Then, a small number of face images with gaze labels can be acquired to supervisedly fine-tune the gaze recognition model, resulting in the final gaze recognition model applied to the execution device. Figure 11 As shown, the finally trained gaze recognition model can be configured in the central processing unit of the execution device. The execution device can be any device that requires gaze estimation.
[0200] For example, such as Figure 11 As shown, when a camera or other sensor module captures a facial image, it can send the captured image to a central processing unit (CPU). The CPU can then call a configured gaze recognition model to process the facial video or image and output the gaze angle predicted by the model for later use. The facial image can be a video frame or a photograph containing a face.
[0201] Furthermore, this application embodiment also provides an image recognition method based on the gaze recognition model trained as described above. The image recognition method can be derived from, for example... Figure 1 The execution device 110 shown performs the operation. This execution device 110 can be... Figure 2 The electronic device 21 shown can also be Figure 2 The server 22 shown in this embodiment is not limited to any particular type. Here, taking an electronic device as the execution device as an example, the image recognition method of the gaze recognition model provided in this embodiment will be specifically described. Figure 12 As shown, the training method for the gaze recognition model provided in this application embodiment may include S1201-S1202:
[0202] S1201. The electronic device acquires the face image to be identified.
[0203] Optionally, the electronic device may be equipped with a sensor module, such as a camera, for sampling images or videos. The electronic device can acquire the face image to be identified through the sensor module. The face image to be identified may include head pose information.
[0204] Optionally, the electronic device can also acquire the facial image to be identified from other devices. These other devices can be other electronic devices, servers, etc.
[0205] Alternatively, the electronic device can also receive a facial image to be identified input by the user.
[0206] S1202. The electronic device inputs a face image into a pre-trained gaze recognition model and obtains the gaze recognition result output by the gaze recognition model. The gaze recognition model is trained on a neural network model based on training data. The training data includes a first image and a second image. The first image is an unobstructed face sample image, and the second image is a face sample image with obstructed eyes.
[0207] In this embodiment, the electronic device can input a face image to be recognized into a pre-trained gaze recognition model to obtain the gaze recognition result output by the gaze recognition model. The gaze recognition model is trained on a neural network model using training data, which includes a first image and a second image. The first image is an unobstructed face sample image, and the second image is a face sample image with obscured eyes. The specific training process is described above and will not be repeated here.
[0208] Since the aforementioned preset neural network model contains two neural networks, and the second neural network extracts low-dimensional gaze features from an unobstructed face image, while the first neural network is used to reconstruct the human eye, in this embodiment of the application, the second neural network in the pre-trained neural network model can also be used as a pre-trained gaze recognition model to perform gaze recognition on the face image to be recognized.
[0209] In some embodiments, since the second neural network in the pre-trained neural network model can learn more accurate gaze features based on full-face information, but cannot directly obtain intuitive gaze information, in order to apply the model, a linear regression layer can be trained using a small number of face sample images labeled with gaze information to map the gaze features output by the second neural network from the feature space to the gaze space, thereby obtaining intuitive gaze information.
[0210] Optionally, the second neural network in a pre-trained neural network model, along with a linear regression layer for regressing gaze angle values from low-dimensional gaze features, can be used as a pre-trained gaze recognition model to perform gaze recognition on the face image to be recognized. In this way, after the electronic device inputs the face image to be recognized into the pre-trained gaze recognition model, the model can output the angle value of the gaze direction in the face image; this output angle value is the gaze recognition result of the face image.
[0211] In summary, this application provides an unsupervised training method and device for a gaze recognition model. It can obtain the gaze recognition model through unsupervised training using a large number of unlabeled face images, enabling the model to judge or recognize the gaze features of input face images. Thus, during unsupervised training, the gaze recognition model can learn the gaze features of a person by combining eye and facial information. Therefore, the gaze recognition model obtained through unsupervised training can use face images as input. Even with face images showing large head poses, the gaze recognition model exhibits good robustness and can accurately identify gaze information in face images, improving both the performance and accuracy of gaze estimation.
[0212] Furthermore, based on the gaze recognition model obtained through unsupervised training described above, this application also provides an image recognition method and execution device. When a face image to be recognized is acquired, the gaze recognition model obtained through unsupervised training can be used to analyze the face image and obtain the gaze recognition result. Thus, this application can obtain the gaze direction of a person in the face image to be recognized, and then perform subsequent applications based on the gaze direction. For example, when the face image to be recognized is the face image of a driver inside a vehicle cabin, the gaze recognition model trained by this application can be used to analyze the driver's face image to obtain the driver's gaze direction. Based on the driver's gaze direction, it is possible to analyze whether the driver is exhibiting abnormal behaviors such as fatigued driving or distraction.
[0213] It should be noted that the unsupervised training method for the gaze recognition model provided in the embodiments of this application, and the method for image recognition using the gaze recognition model trained by the training method provided in this application, are inventions based on the same concept. They can also be understood as two parts of a system or two stages of an overall process: such as the model training stage and the model application stage.
[0214] Optionally, the method provided in this application embodiment can also be used to expand the training database, such as... Figure 1The I / O interface 112 of the execution device 110 can send the user-input face image to be recognized and the gaze recognition result calculated by the gaze recognition model in the execution device 110 together as training data pairs to the database 130, so that the training data maintained by the database 130 is richer, thereby providing richer training data for the training work of the training device 120.
[0215] This application embodiment also provides a training device. The training device may include an acquisition unit, an input unit, a calculation unit, and a training unit. The acquisition unit is used to acquire a first image and a second image, wherein the first image is an unobstructed face sample image, and the second image is a face sample image with occluded eyes. The input unit is used to input the first image and the second image into a preset neural network model to obtain the eye reconstruction result of the second image. The calculation unit is used to calculate the distance between the eye reconstruction result and the original eye. The training unit is used to, if the distance between the eye reconstruction result and the original eye does not meet a preset condition, train the preset neural network model based on the eye reconstruction result to obtain a gaze recognition model.
[0216] This application also provides a recognition device. The recognition device may include an acquisition unit and a recognition unit. The acquisition unit is used to acquire a face image to be recognized; the recognition unit is used to input the face image into a pre-trained gaze recognition model to obtain a gaze recognition result output by the gaze recognition model. The gaze recognition model is trained on a neural network model based on training data, which includes a first image and a second image. The first image is an unobstructed face sample image, and the second image is a face sample image with obstructed eyes.
[0217] This application embodiment also provides a training device, which includes the aforementioned training apparatus. The training device can be a computer device, and may be... Figure 2 The electronic device 21 shown can also be Figure 2 Server 22 is shown.
[0218] This application also provides an execution device, which includes the aforementioned execution apparatus. The execution device can be a computer device, and may be... Figure 2 The electronic device 21 shown can also be Figure 2 Server 22 is shown.
[0219] Optionally, the execution device and the training device can be the same device. For example, after the server trains the gaze recognition model of this application as a training device, the server can also act as an execution device to process the user-input face image to be recognized through its own trained gaze recognition model.
[0220] This application also provides a system comprising the aforementioned execution device and training device. Optionally, the execution device may be... Figure 2 The electronic device 21 shown can be a training device. Figure 2 Server 22 is shown.
[0221] This application also provides a computer storage medium storing computer instructions. When the computer instructions are executed on an electronic device, the electronic device performs the aforementioned method steps to implement the interface display method in the above embodiments.
[0222] This application also provides a computer program product that, when run on a computer, causes the computer to perform the aforementioned related steps to implement the interface display method executed by the electronic device in the above embodiments.
[0223] This application also provides a chip system including a processor for implementing the technical methods of this application. In one possible design, the chip system further includes a memory for storing program instructions and / or data necessary for the embodiments of this application. In another possible design, the chip system further includes a memory for the processor to call application code stored in the memory. This chip system may be composed of one or more chips, or may include chips and other discrete devices; this application does not specifically limit this.
[0224] In this embodiment, the computing device (such as an execution device and a training device), computer storage medium, computer program product or chip are all used to execute the corresponding methods provided above. Therefore, the beneficial effects that can be achieved can be referred to the beneficial effects in the corresponding methods provided above, and will not be repeated here.
[0225] It is understood that the aforementioned computing devices, etc., include hardware structures and / or software modules corresponding to the execution of each function in order to achieve the above-mentioned functions. Those skilled in the art should readily recognize that, based on the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein, the embodiments of this application can be implemented in hardware or a combination of hardware and computer software. Whether a function is executed in a hardware-driven or software-driven manner depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the embodiments of this application.
[0226] This application embodiment can divide the above-mentioned computing device into functional modules according to the above method example. For example, each function can be divided into a separate functional module, or two or more functions can be integrated into one processing module. The integrated module can be implemented in hardware or as a software functional module.
[0227] It should be noted that the module division in this embodiment of the invention is illustrative and is only a logical functional division. In actual implementation, there may be other division methods.
[0228] For example, a computing device may include memory, a processor, a communication interface, and a bus. The memory, processor, and communication interface are interconnected via the bus.
[0229] The memory can be read-only memory (ROM), static storage device, dynamic storage device, or random access memory (RAM). The memory can store programs, and when the programs stored in the memory are executed by the processor, the processor and the communication interface are used to execute the various steps of the unsupervised training method of the gaze recognition model of the embodiments of this application or the image recognition method based on the trained gaze recognition model.
[0230] The processor can be a general-purpose central processing unit (CPU), microprocessor, application-specific integrated circuit (ASIC), graphics processing unit (GPU), or one or more integrated circuits, used to execute related programs to achieve the functions required by the units in the training device or execution device of this application embodiment, or to execute the unsupervised training method of the gaze recognition model or the image recognition method based on the trained gaze recognition model of this application method embodiment.
[0231] The processor can also be an integrated circuit chip with signal processing capabilities. In implementation, the various steps of the unsupervised training method for the gaze recognition model or the image recognition method based on the trained gaze recognition model can be completed by integrated logic circuits in the processor's hardware or by software instructions. The aforementioned processor can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly implemented by a hardware decoding processor, or by a combination of hardware and software modules in the decoding processor. The software modules can reside in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. The storage medium is located in the memory 801. The processor 802 reads the information in the memory 801 and, in conjunction with its hardware, performs the functions required by the units included in the training device or execution device of this application embodiment, or executes the unsupervised training method of the gaze recognition model or the image recognition method based on the trained gaze recognition model of this application method embodiment.
[0232] The communication interface uses transceiver devices, such as, but not limited to, transceivers, to enable communication between the device and other devices or communication networks. For example, training data (such as the first and second images in the embodiments of this application) can be obtained through the communication interface.
[0233] A bus can include a path for transmitting information between various components of a computing device (e.g., memory, processor, communication interface).
[0234] It should be understood that the acquisition unit in the training device is equivalent to the communication interface in a computing device, and the input unit, computing unit, and training unit in the training device can be equivalent to the processor in a computing device. The acquisition unit in the execution device is equivalent to the communication interface in a computing device, and the recognition unit in the execution device can be equivalent to the processor in a computing device.
[0235] Through the above description of the embodiments, those skilled in the art can clearly understand that, for the sake of convenience and brevity, only the division of the above functional modules is used as an example. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above.
[0236] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another device, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.
[0237] The units described as separate components may or may not be physically separate. A component shown as a unit can be one or more physical units; that is, it can be located in one place or distributed in multiple different locations. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0238] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0239] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a readable storage medium. Based on this understanding, the technical solutions of the embodiments of this application, essentially or in other words, the parts that contribute to the prior art, or all or part of the technical solutions, can be embodied in the form of a software product. This software product is stored in a storage medium and includes several instructions to cause a device (which may be a microcontroller, chip, etc.) or processor to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0240] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. An image recognition method, characterized in that, The method includes: Obtain the face image to be identified; The face image is analyzed by a pre-trained gaze recognition model to obtain gaze recognition results, wherein the gaze recognition results include gaze angle values; the gaze recognition model is trained on a neural network model based on training data, wherein the training data includes a first image and a second image, the first image being an unobstructed face sample image, and the second image being a face sample image with obstructed eyes; The gaze features of the first image are extracted using the second neural network, and the gaze features are then injected into the first neural network. The facial features of the second image are extracted using the first neural network, and a human eye reconstruction result is generated based on the facial features and the gaze features injected by the second neural network. Calculate the error loss between the reconstructed human eye and the original human eye; When the error loss does not meet the preset conditions, the preset neural network model is trained according to the error loss to obtain the trained gaze recognition model; the preset neural network model includes the first neural network and the second neural network.
2. The method according to claim 1, characterized in that, The first neural network includes an encoder and a decoder. The process of extracting facial features from the second image using the first neural network and generating the human eye reconstruction result based on the facial features and the gaze features injected into the second neural network includes: The encoder extracts facial features from the second image; The facial features and the gaze features injected by the second neural network are concatenated to obtain a concatenated feature vector; The human eye image is reconstructed from the feature vector using the decoder, resulting in the human eye reconstruction.
3. The method according to claim 2, characterized in that, The second neural network shares the encoder with the first neural network. Extracting the gaze features of the first image through the second neural network includes: The encoder extracts the gaze features of the first image.
4. The method according to claim 3, characterized in that, The preset neural network model further includes an embedding module, and the extraction of gaze features from the first image through the encoder includes: The first image is converted into a first embedding block sequence through the embedding module; The encoder extracts the gaze features of the first embedded block sequence; The step of extracting facial features from the second image using the encoder includes: The embedding module converts the second image into a second embedding block sequence. The encoder extracts facial features from the second embedded block sequence.
5. The method according to any one of claims 1-4, characterized in that, Both the first neural network and the second neural network are neural networks based on the vision transformer (ViT) structure. The first neural network includes a ViT encoder and a ViT decoder, and the second neural network shares the ViT encoder with the first neural network.
6. The method according to any one of claims 1-4, characterized in that, The second neural network includes a first fully connected layer and a second fully connected layer. The step of injecting the gaze features into the first neural network includes: The bottleneck features are obtained by downsampling the line-of-sight features using the first fully connected layer; The bottleneck features are upsampled by the second fully connected layer to obtain bottleneck features with the same dimension as the facial features; The bottleneck feature, which has the same dimension as the facial feature, is injected into the first neural network.
7. The method according to any one of claims 1-4, characterized in that, The calculation of the error loss between the reconstructed human eye and the original human eye includes: Calculate the error loss between the reconstructed human eye and the human eye image in the first image.
8. The method according to any one of claims 1-4, characterized in that, When the error loss does not meet the preset conditions, training the preset neural network model based on the error loss to obtain the trained gaze recognition model includes: When the error loss does not meet the preset conditions, the parameters in the preset neural network model are updated according to the error loss to obtain the trained neural network model. Based on the second neural network and the linear regression layer in the trained neural network model, an initial gaze recognition model is obtained, wherein the linear regression layer is used to convert the gaze features extracted by the second neural network into gaze information, and the gaze information includes the horizontal angle and vertical angle of the gaze. The initial gaze recognition model is trained based on face sample images labeled with gaze information to obtain the trained gaze recognition model.
9. The method according to claim 8, characterized in that, The process of training the initial gaze recognition model based on face sample images labeled with gaze information to obtain the trained gaze recognition model includes: Based on face sample images labeled with gaze information, the linear regression layer in the initial gaze recognition model is trained to obtain the trained gaze recognition model.
10. The method according to any one of claims 1-4, characterized in that, The first image and the second image are sample images of faces without labeled gaze information.
11. A computing device, characterized in that, The computing device includes a memory and one or more processors, the memory and the processors being coupled together, the memory storing programs or instructions, and when the programs or instructions are executed by the processors, the computing device performs the method as described in any one of claims 1-10.
12. A chip, characterized in that, The chip includes a processor and a data interface. The processor reads instructions stored in the memory through the data interface to execute the method as described in any one of claims 1-10.
13. A computer-readable storage medium, characterized in that, Includes computer instructions that, when executed on a computing device, cause the computing device to perform the method as described in any one of claims 1-10.
14. A computer program product, characterized in that, When the computer program product is run on a computing device, it causes the computing device to perform the method as described in any one of claims 1-10.