Face feature detection method, neural network model training method, and electronic device
By preprocessing facial images and combining multiple cameras, the problems of detection failure and false detection in non-contact facial feature detection are solved, improving detection accuracy and user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HONOR DEVICE CO LTD
- Filing Date
- 2023-10-20
- Publication Date
- 2026-06-26
AI Technical Summary
Existing non-contact facial feature detection technologies have shortcomings in terms of detection failure or false detection, resulting in a poor user experience.
By preprocessing the captured face images, features relevant to face feature detection are retained while irrelevant features are weakened. Combined with self-supervised training of multiple cameras and neural network models, detection accuracy is improved.
It reduces the chances of detection failure and false detection, and improves the accuracy of facial feature detection and user experience.
Smart Images

Figure CN119904895B_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of electronic equipment technology, and in particular relates to a face feature detection method, a neural network model training method, and an electronic device. Background Technology
[0002] With the rapid development of electronic information technology and the maturity of communication technology, people have begun to explore new human-computer interaction methods that are free from touch operation or keyboard and mouse input, namely non-contact interaction methods, such as voice recognition, gesture recognition, face recognition, and eye tracking, which can provide users with more convenient and diversified interaction methods and improve the user experience.
[0003] However, the technology related to contactless interaction is not yet mature enough and still has some shortcomings. For example, when implementing contactless interaction by detecting facial features, there are still cases of detection failure or false detection, resulting in a poor user experience. Summary of the Invention
[0004] This application provides a face feature detection method, a neural network model training method, and an electronic device, which can improve the accuracy of face feature detection.
[0005] Firstly, a face feature detection method is provided, applied to an electronic device with a first camera. The method specifically includes: capturing a first image using the first camera; processing a first face image in the first image based on a first neural network model to output a second face image; and then performing face feature detection based on the second face image. Specifically, the feature intensity of a first feature in the second face image is greater than or equal to the feature intensity of a second feature in the first face image; the first and second features are related to the face feature detection process, and the first feature corresponds to the second feature. The feature intensity of a third feature in the second face image is less than the feature intensity of a fourth feature in the first face image; the third and fourth features are not related to the face feature detection process, and the third feature corresponds to the fourth feature.
[0006] In the above scheme, after obtaining the first face image, the face feature detection process is not directly performed on it. Instead, the first face image is preprocessed to obtain the second face image. On the one hand, features relevant to the face feature detection process in the first face image are retained or even enhanced; that is, features needed for subsequent face feature detection are retained or enhanced. For example, the first feature is a feature relevant to the face feature detection process in the first face image. When processing the first face image, this first feature can be retained or enhanced. Therefore, the feature strength of the feature corresponding to the first feature in the second face image (i.e., the second feature) is greater than (in the scenario of enhancing the first feature) or equal to (in the scenario of retaining the first feature) the feature strength of the first feature. On the other hand, features irrelevant to the face feature detection process in the first face image are weakened; that is, noisy features that are not needed for subsequent face feature detection are weakened. Through this preprocessing, interference in the face feature detection process can be reduced, allowing the subsequent face feature detection process to focus more on features relevant to the face feature detection process, thereby improving detection accuracy and reducing the probability of detection failure and false detection.
[0007] Optionally, performing a facial feature detection service based on the second facial image includes: detecting whether the user's gaze point is located within the display screen of the electronic device based on the second facial image. After performing the facial feature detection service, the method further includes: controlling the on / off state of the display screen of the electronic device based on the detection result of the facial feature detection service.
[0008] The solution provided in this application can be applied to scenarios of "viewing without turning off the screen" or "viewing while the screen is on." In other words, the electronic device can detect whether the user's gaze point is within the display screen based on a second facial image, i.e., whether the user is currently looking at the display screen, and control the electronic device to switch between a screen-off state and a screen-on state. For example, in the screen-on state, if the user is not currently looking at the display screen, the electronic device can automatically switch from the screen-on state to the screen-off state, thereby saving power; if the user is currently looking at the display screen, the electronic device can remain in the screen-on state, or switch from the screen-off state to the screen-on state, thereby improving the user experience.
[0009] Optionally, the electronic device is currently in a screen-on state. At this time, capturing a first image using the first camera includes: capturing a first image using the first camera when the screen-off countdown is less than or equal to a first threshold; and controlling the on / off state of the electronic device's display screen based on the detection result of the face feature detection service, including: maintaining the screen-on state of the electronic device's display screen and resetting the screen-off countdown when the detection result indicates that the user's gaze point is within the display screen of the electronic device.
[0010] When the solution provided in this application is applied to a "stay-on display" scenario, it can use the first camera to acquire a facial image to detect whether the user's gaze is within the display screen of the electronic device when the screen-off countdown is about to end (i.e., when the screen-off countdown is less than or equal to a first threshold). Based on the detection result, it decides whether to automatically turn off the screen, thus improving the user experience by determining whether to automatically turn off the screen according to the user's actual usage. If the user's gaze is within the display screen, the screen does not automatically turn off. Furthermore, the electronic device resets the screen-off countdown, meaning the countdown starts counting down from the beginning, so that the electronic device can still automatically turn off the screen after the user no longer uses it.
[0011] Optionally, the first and second features include eye features and / or facial contour features, and the third and fourth features include one or more of the following: light and shadow features affected by illumination, facial features other than the eyes, biological features other than the facial features, jewelry features, and makeup features.
[0012] Therefore, in the above scheme, when processing the first face image, eye features and / or facial contour features can be preserved or enhanced, while light and shadow features affected by illumination, facial features other than the eyes, biological features other than the facial features, jewelry features, makeup features, etc. are weakened. In this way, when performing gaze point detection, more focus can be placed on eye features and / or facial contour features, thereby improving the accuracy of detection.
[0013] Optionally, detecting whether the user's gaze point is within the display screen of the electronic device based on the second face image includes: determining a left-eye image and a right-eye image based on the second face image; processing the left-eye image and the right-eye image based on an open-eye recognition model and outputting an open-eye result, which is used to indicate whether the user is in an open-eye state; and, if the user is in an open-eye state, detecting whether the user's gaze point is within the display screen of the electronic device based on the second face image, the left-eye image, and the right-eye image.
[0014] In the above scheme, after acquiring the second face image, gaze recognition is not performed directly. Instead, the system first identifies whether the user's eyes are open based on the left and right eye images. If so, the gaze recognition process continues. Since the training difficulty of the open / closed eye recognition model is much lower than that of the gaze recognition model, its accuracy is very high. Therefore, performing precise open / closed eye filtering in advance helps improve the overall accuracy of gaze recognition. Secondly, compared to the deep neural network structure and large number of parameters of the face feature extraction model, the very shallow open / closed eye recognition model requires extremely low computation. By reducing meaningless full-face branch operations, the power consumption can be further reduced, which helps to achieve low power consumption requirements.
[0015] Optionally, detecting whether the user's gaze point is within the display screen of the electronic device based on the second face image, the left eye image, and the right eye image includes: extracting left-eye features and right-eye features from the left-eye image and the right-eye image respectively using a human eye feature extraction model; extracting face features from the second face image using a face feature extraction model; fusing the left-eye features, the right-eye features, and the face features to obtain fused data; processing the fused data using a gaze recognition model and outputting a gaze result, the gaze result being used to indicate whether the user's gaze point is within the display screen of the electronic device.
[0016] In the above scheme, different models are used to extract features from human eye images and face images, which can improve the precision of feature extraction. Since different images have different characteristics, if the same model is used to extract features from different types of images, it may affect the accuracy of subsequent gaze recognition.
[0017] Optionally, the electronic device further includes a second camera; the method further includes: when the confidence level of the detection result of the face feature detection service is less than a second threshold or the face feature detection service fails, obtaining the average pixel value of the first face image; when the average pixel value of the first face image is lower than a third threshold, capturing a second image using the second camera, the second image including a third face image; processing the third face image using a second neural network model to output a fourth face image; performing a face feature detection service based on the fourth face image; wherein, the feature intensity of the fifth feature in the fourth face image is greater than or equal to the feature intensity of the sixth feature in the third face image, the feature intensity of the seventh feature in the fourth face image is less than the feature intensity of the eighth feature in the third face image, the fifth feature corresponds to the sixth feature, the seventh feature corresponds to the eighth feature, the fifth feature and the sixth feature are related to the face feature detection service, the seventh feature and the eighth feature are not related to the face feature detection service, the first image is a color image, and the second image is an infrared (IR) image. Therefore, in the above scheme, the first camera is used to capture a color image (i.e., the first image) to detect facial features. Since the power consumption caused by capturing a color image is low, and most electronic devices are used in bright light scenes, the imaging quality of the color image is satisfactory in most scenarios. Therefore, by calling the first camera to detect facial features, both low power consumption and a good detection success rate can be maintained.
[0018] If a face feature detection operation is performed based on the first image, but the reliability of the detection result is low (e.g., below the second threshold) or the detection fails, there could be many reasons for this. These could be due to the lighting conditions at the time of capture, the nature of the target itself, or other unexpected errors. In this case, the average pixel value of the first face image can be obtained. If this average pixel value is below the third threshold, it indicates that the image quality of the first face image is poor. Therefore, poor image quality is likely the main reason for low reliability or detection failure. For example, in low-light, backlight, sidelight, or reflective scenes, the facial features in the first face image may not be clear enough, and the electronic device may not be able to detect facial or eye features from the first face image. In this situation, the electronic device can continue to use the second camera to acquire an infrared image (i.e., the second image) to perform the face feature detection operation. Since the imaging quality of infrared images is almost unaffected by lighting conditions, high image quality can be guaranteed even in poor lighting conditions, thereby improving the success rate and accuracy of face feature detection.
[0019] If the face feature detection is performed based on the first image and the detection result has high reliability (e.g., greater than or equal to the second threshold), and the detection result indicates that the user's gaze point is outside the electronic device's display screen, then the current detection process ends or restarts. For example, it might re-evaluate whether the screen-off countdown is less than or equal to the first threshold. If so, a color image is captured using the first camera to perform the face feature detection task. The specific implementation scheme will not be repeated here. In this case, it is not necessary to calculate the average pixel value of the first face image, nor is it necessary to use the second camera to capture a second image. In other words, if the detection result has high reliability, it is directly used instead of re-detecting using the second camera, thus reducing unnecessary power consumption.
[0020] In summary, the solution provided in this application involves two types of cameras (i.e., a first camera and a second camera). The first camera can capture color images, and the second camera can capture infrared images. In scenarios with good lighting conditions, using the first camera to capture color images for facial feature detection can maintain low power consumption while ensuring a good detection success rate. In scenarios with poor lighting conditions, using the second camera to capture infrared images can ensure a high detection success rate under various lighting conditions. However, since lighting conditions are not easily quantifiable, if the camera is selected solely based on ambient light intensity, facial feature detection will still fail in many scenarios, or even if facial features are detected, the reliability of the detection results will be relatively low. For example, in backlit, sidelit, or mirror-reflective lighting scenarios, the ambient light intensity may be high, but the quality of the acquired color image may be poor. The solution provided in this application can detect facial features based on the first image, but when the reliability of the detection result is lower than the second threshold, it can analyze and find that the reason for the failure is likely due to the poor image quality of the first image. Only then can the second camera be called to capture an infrared image. Since capturing an infrared image will result in high power consumption, the above solution can reduce the number of infrared image captures, thereby reducing power consumption, while ensuring a high detection success rate. At least it can reduce the situation where facial feature detection fails due to image quality.
[0021] It is understandable that, when the first image is a color image, subsequent image processing and detection procedures can be performed directly based on the first image, or based on the image obtained after conversion of the first image. In other words, the first face image can be an image directly cropped from the first image, or an image cropped from an image obtained after conversion of the first image. For example, after capturing the first image (color image) using the first camera, the first image can be converted to a grayscale image, and then the first face image can be cropped from the grayscale image.
[0022] Optionally, in one possible implementation, the first image is a color image. Obtaining the first image using the first camera includes: acquiring the current ambient light intensity; and capturing the first image using the first camera when the ambient light intensity is greater than or equal to a fourth threshold. Optionally, in another possible implementation, the first image is an infrared image. Obtaining the first image using the first camera includes: acquiring the current ambient light intensity; and capturing the first image using the first camera when the ambient light intensity is less than a fourth threshold.
[0023] Therefore, in the above scheme, when the light intensity is relatively high (ambient light intensity greater than or equal to the fourth threshold), facial features are detected by acquiring color images. When the light intensity is relatively low (ambient light intensity less than the fourth threshold), facial features can be detected directly by acquiring infrared images. Because the ambient light intensity is relatively low, indicating a dark scene, capturing a color image in this situation would likely result in poor image quality. Directly using the second camera to detect facial features eliminates the need for the processes of acquiring the first image, processing the first facial image to obtain the second facial image, and detecting facial features based on the second facial image. This saves resources, reduces latency, and improves the user experience.
[0024] Optionally, before capturing the first image using the first camera, the method further includes: determining and detecting facial features. That is, the electronic device activates the first camera to capture the image after determining that facial features are detected. This can be achieved by the user controlling the electronic device to perform facial feature detection, or by the electronic device automatically performing facial feature detection based on preset rules; this application does not limit the scope of the method.
[0025] Optionally, the first camera and the second camera can be one camera or two different cameras.
[0026] Optionally, the first camera and the second camera may be front-facing cameras of an electronic device.
[0027] Optionally, the first camera is used to acquire two-dimensional images, and the second camera is used to acquire images containing depth information.
[0028] Optionally, the first camera is a red-green-blue (RGB) camera, and the second camera is a time-of-flight (Tof) camera.
[0029] Optionally, the first camera is an always-on (AO) camera. The AO camera is a low-power RGB camera; capturing the first image using an AO camera further reduces power consumption.
[0030] Optionally, before processing the first face image based on the first adaptive equalization model, the method further includes: determining that the first image includes the first face image.
[0031] In the above scheme, after the electronic device captures the first image using the first camera, it can detect whether the first image includes a first face image. If it does, the subsequent process continues. If it does not, the subsequent process can be stopped, or the first camera can be called again to capture the image, or the other frames captured by the first camera can be detected to see if they include a face image (the first image and the other frames can be multiple frames captured by the first camera at one time). This avoids the situation where the subsequent face feature detection fails because the first image does not include a face image, omits the detection process in this case, saves resources and reduces latency.
[0032] Optionally, before capturing the first image using the first camera, the method further includes: displaying a first interface on the screen, wherein the first interface is any one of the following interfaces: the interface of a first application installed in the electronic device, the system desktop, the negative one screen, and the lock screen interface.
[0033] Optionally, the first neural network model is trained based on training data, which includes multiple sample images. These sample images are obtained by augmenting features unrelated to the face feature detection business from the same original image. The original image includes a third face image.
[0034] Optionally, the first neural network model is trained based on the adversarial loss function value, the second loss function value, and the third loss function value of the first loss function value. The first loss function value is obtained by comparing multiple sample images with multiple reconstructed images, which are generated based on multiple feature vectors extracted from the training data, and the multiple feature vectors, multiple sample images, and multiple reconstructed images correspond one-to-one. The second loss function value is obtained by comparing the average vector distance between the multiple feature vectors with the target distance. The third loss function value is obtained by comparing the detection result obtained by performing face feature detection based on the multiple feature vectors with the target result.
[0035] Secondly, a method for training a neural network model is provided. This method includes: acquiring training data, which includes multiple sample images obtained by augmenting the same original image with features unrelated to face feature detection; extracting multiple feature vectors from the training data; generating multiple reconstructed images based on the multiple feature vectors, with each feature vector, sample image, and reconstructed image corresponding to the other; performing face feature detection based on the multiple feature vectors to obtain detection results; comparing each sample image with its corresponding reconstructed image to obtain a first loss function value; comparing the average vector distance between the multiple feature vectors with the target distance to obtain a second loss function value; comparing the detection results with the target results to obtain a third loss function value; and adjusting the parameters of the neural network model based on the first, second, and third loss function values.
[0036] According to the above scheme, the parameters of the neural network model can be adjusted through self-supervised training and adversarial training, so that when the neural network model processes images, it can include features related to face feature detection and weaken features unrelated to face feature detection, thereby improving the accuracy of face feature detection.
[0037] Optionally, the parameters of the neural network model are adjusted based on the first loss function value, the second loss function value, and the third loss function value, including adjusting the parameters of the neural network model with the aim of reducing the adversarial loss function value, the second loss function value, and the third loss function value of the first loss function value.
[0038] Optionally, the adversarial loss function value of the first loss function value is equal to the difference between 1 and the first loss function value.
[0039] Thirdly, an electronic device is provided, which has a first camera. Specifically, the electronic device includes: a shooting module for capturing a first image using the first camera, the first image including a first face image; a processing module for processing the first face image based on a first neural network model and outputting a second face image, wherein the feature intensity of a first feature in the second face image is greater than or equal to the feature intensity of a second feature in the first face image, and the feature intensity of a third feature in the second face image is less than the feature intensity of a fourth feature in the first face image; and a detection module for performing face feature detection based on the second face image; wherein the first and second features are related to the face feature detection service, while the third and fourth features are not related to the face feature detection service, and the first feature corresponds to the second feature, and the third feature corresponds to the fourth feature.
[0040] Fourthly, an electronic device is provided, including a memory and a processor, wherein the memory stores a computer program that can run on the processor, and when the processor executes the computer program, the electronic device performs the steps of the face feature detection method as described in any one of the first or second aspects above.
[0041] Fifthly, a computer-readable storage medium is provided that stores a computer program, which, when executed by a processor, implements the steps of the face feature detection method as described in any one of the first or second aspects above.
[0042] Sixthly, a computer program product is provided, which, when run on an electronic device, causes the electronic device to execute the face feature detection method described in either the first or second aspect.
[0043] In a seventh aspect, a chip system is provided, the chip system including a processor coupled to a memory, the processor executing a computer program stored in the memory to implement the face feature detection method described in any one of the first or second aspects above.
[0044] The chip system can be a single chip or a chip module composed of multiple chips.
[0045] It is understood that the beneficial effects of the third to seventh aspects mentioned above can be found in the relevant descriptions in the first or second aspects mentioned above, and will not be repeated here. Attached Figure Description
[0046] Figure 1 This illustration shows an application example of an automatic screen-off rule provided in an embodiment of this application;
[0047] Figure 2 This illustration shows an application scenario provided by an embodiment of this application.
[0048] Figure 3 This illustration shows another application scenario provided by an embodiment of this application;
[0049] Figure 4 This illustration shows another application scenario provided by an embodiment of this application;
[0050] Figure 5 A schematic block diagram of a method for detecting facial features provided in an embodiment of this application is shown;
[0051] Figure 6 An example image is shown, captured by an electronic device using a camera.
[0052] Figure 7An exemplary process for identifying and cropping a human face image is shown;
[0053] Figure 8 An exemplary process for determining a face image and a human eye image is shown;
[0054] Figure 9 Example images of a face and a face image after adaptive equalization are shown.
[0055] Figure 10 An exemplary flowchart of image acquisition and image processing is shown;
[0056] Figure 11 An exemplary process for gaze recognition is shown;
[0057] Figure 12 This illustration shows another application scenario provided by an embodiment of this application;
[0058] Figure 13 A schematic diagram of an electronic device provided in an embodiment of this application is shown;
[0059] Figure 14 An exemplary framework and process for an electronic device for gaze recognition are shown;
[0060] Figure 15 A schematic flowchart of a method for detecting facial features provided in an embodiment of this application is shown;
[0061] Figure 16 This application illustrates a system architecture diagram provided by an embodiment of the present application;
[0062] Figure 17 An exemplary block diagram of a neural network model training method provided in an embodiment of this application is shown;
[0063] Figure 18 A schematic diagram illustrating a process for training a neural network model according to an embodiment of this application is shown;
[0064] Figure 19 A block diagram of the hardware and software system of an electronic device provided in an embodiment of this application is shown;
[0065] Figure 20 A block diagram of a hardware system for an electronic device according to an embodiment of this application is shown. Detailed Implementation
[0066] The technical solutions of the embodiments of this application will be described below with reference to the accompanying drawings. To more clearly describe the embodiments of this application, some terms or technologies involved in the embodiments of this application will be briefly introduced first.
[0067] With the rapid development of electronic information technology and the maturity of communication technology, contactless interaction methods have begun to be applied to various electronic devices. The electronic devices described in this application can be mobile phones, tablets, wearable devices, in-vehicle devices, augmented reality (AR) / virtual reality (VR) devices, laptops, ultra-mobile personal computers (UMPCs), netbooks, personal digital assistants (PDAs), etc. This application does not limit the specific type of electronic device. For convenience, a mobile phone will be used as an example for the following description.
[0068] As an example, in contactless interaction methods, mobile phones can use their cameras to detect facial features to perform corresponding operations. For instance, facial feature detection can be used for unlocking or making payments, eye tracking, and controlling the brightness or on / off state of the display screen. Examples will be provided below.
[0069] Often, users don't actively press the power button to turn off the screen after using their phones. If the screen remains on in this situation, it wastes battery power. To solve this problem, you can configure an automatic screen-off rule for your phone.
[0070] For example, Figure 1 This demonstrates an application example of an automatic screen-off rule: the phone is currently displaying the interface of a shopping application (App) (e.g., ...). Figure 1 In clause (a), if the user does not perform any operation on the phone within 30 seconds, that is, the phone does not receive any command from the user within 30 seconds, the phone will automatically switch from the screen-on state to the screen-off state (e.g., ...). Figure 1 (b)). It is understandable that the 30 seconds here could be the duration of the screen-off countdown set by the user.
[0071] In this embodiment, the screen-off countdown timer is used by the electronic device to automatically complete the screen-off operation. For example, when the electronic device, which is in a screen-on state, receives an operation command from the user, the electronic device automatically starts a screen-off countdown timer. If the electronic device receives another command from the user before the screen-off countdown ends, the electronic device resets the screen-off countdown. Here, the command can be a touch operation by the user at any position on the display screen, or a voice operation, etc. If the electronic device does not receive any command from the user before the screen-off countdown ends, the electronic device switches the display screen to a screen-off state.
[0072] In this embodiment, the "screen-on state" refers to the state in which the display screen of the electronic device is lit up. At this time, the display screen may show the lock screen (i.e., the screen to be unlocked), the system desktop or the negative one screen, or the application interface of an application installed on the electronic device (e.g.,...). Figure 1 (e.g., the application interface of the shopping app shown in (a)). The screen-on state can also be called the on-screen state, etc.; the screen-off state refers to the state where the electronic device's display screen is not lit, that is, the display screen is off. The screen-off state can also be called the off-screen state or the black screen state, etc.
[0073] It is understandable that a screen being in an off state does not necessarily mean that the screen is completely devoid of any content. For example, even when the screen is off, it can still display information such as the time, date, battery level, application notifications, and wallpaper in certain areas of the screen (i.e., the always-on display function of the electronic device).
[0074] It is also understandable that if an electronic device includes multiple displays, and the main screen is on while the secondary screen is off (or the main screen is off while the secondary screen is on), the electronic device may be determined to be on or off based on the actual strategy. This application does not make specific limitations on this situation.
[0075] Electronic devices can switch between screen-on and screen-off states based on user commands. In one example, the electronic device is equipped with a power button, which the user can press to switch the display's on / off state; pressing the power button while the screen is on switches to screen-off mode, and pressing the power button while the screen is off switches to screen-on mode. Alternatively, the electronic device can automatically switch between screen-on and screen-off states based on preset rules.
[0076] Alternatively, in one implementation, the phone can first reduce the brightness before switching to screen-off mode. For example, if the user does not perform any operation on the phone within 25 seconds, the phone automatically reduces the brightness of the display (e.g., ...). Figure 1(As shown in (c)). After the screen brightness is reduced, if the user performs any operation on the phone within 5 seconds, the phone restores the screen brightness to [normal level]. Figure 1 The state shown in (a) is as follows: If the user does not perform any operation on the phone within 5 seconds, the phone automatically switches to screen-off state (e.g., Figure 1 (as shown in (b)).
[0077] The above solution allows the phone to automatically turn off its screen if it does not receive any touch input from the user for a period of time, thus saving resources.
[0078] However, in some scenarios, users may still be using their phones even without performing any actions. For example, when reading articles or thinking about design solutions on an electronic device, a user might be looking at the screen without making any touch operations; or, for instance, after opening a shopping app on their phone, they might enter a... Figure 1 The purchase interface shown in (a) is from... Figure 1 As shown in (a), there are 3 minutes and 6 seconds left before the product sale. In order to avoid missing the sale, users may stare at the current display page without performing any operation. If the phone automatically enters screen-off mode at this time, it will affect the user experience.
[0079] Based on this, in the solution involved in this application, the mobile phone can detect the user's gaze before entering the screen-off state, and control the screen-on / off state of the phone based on the user's gaze. Please refer to... Figure 2 If the user does not interact within 30 seconds, the phone can use the front-facing camera 210 to detect whether the user is looking at the display screen before entering the screen-off state. If the user is looking at the display screen, that is, the user's gaze point is within the display screen, it can be assumed that the user is using the phone at this time. In this case, the phone will not automatically enter the screen-off state (i.e., keep the phone screen on) and will reset the screen-off timer (i.e., the screen-off timer starts counting down from the beginning). This reduces the occurrence of the phone automatically turning off the screen while the user is using the phone, thus improving the user experience.
[0080] It should be noted that, in the embodiments of this application, the gaze point refers to a point on the target object that the user's gaze is focused on during the visual perception process. If the user's gaze point is located within the display area of the electronic device's screen (hereinafter referred to as the user's gaze point being within the screen), it can be considered that the user is looking at the screen; otherwise, it is considered that the user is not looking at the screen. In the embodiments of this application, "detecting whether the user's gaze point is within the screen" can be referred to as "gaze point detection" or "gaze recognition".
[0081] Please see Figure 3If the user does not operate within 30 seconds, the phone can use the front-facing camera 310 to detect whether the user is looking at the display screen before entering the screen-off state. If the user is not looking at the display screen at this time (that is, the user's gaze point is outside the display screen), it can be assumed that the user is not using the phone at this time, and the phone will automatically enter the screen-off state according to the original logic, thereby saving power and reducing power consumption.
[0082] As can be seen from the above, the above Figure 2 and Figure 3 The illustrated solution uses a "constantly looking at the screen" scenario as an example. In this scenario, the electronic device detecting facial features refers to detecting whether the user's gaze point is within the display screen. However, it should be understood that the embodiments of this application are not limited to the above scenario. For example, Figure 4 This demonstrates another application scenario for detecting facial features.
[0083] like Figure 4 As shown in (a), the electronic device enters the lock screen interface (or locked screen). At this time, identifier 410 is used to indicate to the user that the electronic device is currently in a locked state. The electronic device then performs facial recognition through the front-facing camera. If the recognition is successful, the electronic device is unlocked, and the unlocked interface is as follows. Figure 4 As shown in (b), identifier 420 is used to indicate to the user that the electronic device is in an unlocked state. Therefore, in this scenario, detecting facial features refers to capturing a facial image through a camera and extracting facial features, then comparing the extracted facial features with information in a local or cloud database to determine whether the facial features match successfully.
[0084] Optionally, in Figure 4 In the scenario shown, before performing facial recognition, it can be determined whether the user is looking at the display screen. If so, facial recognition is performed; otherwise, it is not, thus reducing the chance of false unlocking. Therefore, in Figure 4 In the scenario shown, detecting facial features can also include determining whether the gaze point of the human eye is within the display screen.
[0085] It is understood that the above application scenarios are merely examples, and this application is not limited to them. Other scenarios requiring facial feature detection are also within the scope of protection of this application, such as gaze-on screen-on scenarios, i.e., when an electronic device automatically turns on the screen after detecting that a user is looking at the display for a period of time, or when an electronic device reduces brightness to reduce power consumption (e.g., ...). Figure 1In example (c), if the screen brightness is automatically restored after the user is detected looking at it; another example is eye-tracking control scenarios, where electronic devices detect the user's gaze point on the screen to open applications or notifications in the gazed area, or to enable page turning, page zooming, etc., or to track the user's dynamic gaze point to achieve contactless control, such as interactive video games through eye tracking; yet another example is electronic payment scenarios, where the user opens the payment interface and completes password verification through facial recognition, etc. For convenience, the following will use... Figure 2 The following example illustrates the "constant screen-gazing" scenario.
[0086] However, facial feature detection often involves complex algorithms and requires high-quality captured facial images, so detection failures or false detections frequently occur. Regardless of the application scenario, if facial feature detection fails, subsequent solutions based on the detection results may not be able to be implemented smoothly.
[0087] Therefore, this application provides a face feature detection method that can improve detection accuracy. Specifically, before performing face feature detection based on the captured face image, the captured face image can be preprocessed: on the one hand, features in the face image that are related to the face feature detection process can be retained or even enhanced to improve detection accuracy; on the other hand, features in the face image that are not related to the face feature detection process can be weakened to reduce interference during the face feature detection process. This approach can reduce the probability of detection failure and false detection. The following is a detailed explanation... Figure 5 Method 100 in this application provides a detailed description of the solution provided in the embodiments of this application.
[0088] S110 uses the first camera to capture the first image.
[0089] For example, the electronic device has a first camera, which is, for example, the front-facing camera of the electronic device. Figure 2 The first camera 210 is included. It is understood that the first camera can also be an external camera of an electronic device, and this application does not limit it in this regard.
[0090] The electronic device can capture a first image using the first camera. This first image can be one frame from a series of images captured by the first camera. It is understood that after capturing the first image, the electronic device can store it in an internal data buffer, without displaying it on a screen. In other words, capturing the first image is an internal operation of the electronic device, and the user is usually unaware of this process.
[0091] This application does not limit the image type of the first image.
[0092] In one example, the first image is a color image. That is, the first camera can be used to capture color images, or in other words, the first camera can be used to acquire two-dimensional images. For example, the first camera is an RGB camera, specifically an AO camera within an RGB camera. Its related algorithms can run on a smart sensor hub with low power consumption, thus further reducing power consumption by using an AO camera.
[0093] In another example, the first image is an infrared image. That is, the first camera can be used to capture infrared images, or in other words, the first camera can be used to acquire images containing depth information; for example, the first camera is a ToF camera.
[0094] Optionally, prior to S110, the electronic device can also acquire the current ambient light intensity. For example, if the first image is a color image, the electronic device can pre-determine whether the current ambient light intensity is greater than or equal to a fourth threshold. If so, it indicates that the lighting conditions may be good, and the first camera can be used to capture a color image. Since capturing a color image consumes relatively low power and can ensure image quality even under good lighting conditions, using a color image to perform subsequent face feature detection can reduce power consumption and, to some extent, ensure the accuracy of face feature detection. As another example, if the first image is an infrared image, the electronic device can pre-determine whether the current ambient light intensity is less than a fourth threshold. If so, it indicates that the lighting conditions are poor, and the first camera can be used to capture an infrared image. Since the image quality of an infrared image is almost unaffected by lighting conditions, using an infrared image to perform subsequent face feature detection can improve the accuracy of face feature detection.
[0095] For convenience, the following embodiments will be described using a color image as an example.
[0096] For example, the first image includes a first face image. The face image described in this application embodiment refers to an image containing part or all of a face; for example, suppose... Figure 6 Image 610 in the image corresponds to the first image (for convenience, image 610 is displayed in black and white), and the partial image 620 therein is the first face image.
[0097] The first face image is used to perform face feature detection, or in other words, the first face image is used to detect face features, for example, corresponding to... Figure 2In the application scenario shown, the first face image is used to detect whether the user's gaze point is within the display screen. In this scenario, an electronic device in a screen-on state can capture a first image using a first camera when the screen-off countdown is less than a preset value (e.g., 4 seconds); or the electronic device can obtain the current ambient light intensity when the screen-off countdown is less than a preset value, and then capture a first image using the first camera based on the ambient light intensity.
[0098] Optionally, after S110 and before S120, the method further includes: determining that the first image includes a first face image. For example, the first image is detected based on a face detection model, and a detection result is output, indicating that the first image includes the first face image.
[0099] In other words, after the electronic device captures a first image using the first camera, it detects whether the first image includes a face image. If it does, it continues to execute the subsequent operation S120; if it does not, it can re-execute S110, or continue to detect whether other images captured by the first camera include face images, or stop the face feature detection process. This application does not limit this.
[0100] It is understood that this application does not limit the specific method for detecting whether an image includes a human face. The following section combines... Figure 7 and Figure 8 One possible implementation is illustrated by example.
[0101] Figure 7 An exemplary process for recognizing and cropping facial images is presented. Optionally, the original image (i.e., the first image mentioned above, hereinafter referred to as the first image) can be processed first according to the input requirements of different models. Figure 6 (Taking image 610 as an example for illustration) Basic preprocessing is performed, such as scaling and downsampling.
[0102] The processed image is then input into a face detection model, such as a neural network model. The face detection model processes the input image and outputs a detection result. If the input image contains a face image, the output detection result includes the face image; optionally, it may also output eye images (including left and right eye images). If the input image does not contain a face image, the output detection result indicates that the image does not contain a face image.
[0103] Please see Figure 8This face detection model can utilize facial landmark detection algorithms (such as DeepBlueface (Dbface), Practical Facial Landmark Detector (PFLD), Face-Landmark Factory, Multi-Task Convolutional Neural Network (MTCNN), Centerface, etc., which are not limited here) to determine the facial landmarks in the input image: left eye A, right eye B, nose C, left lip corner D, and right lip corner E, and determine the coordinate positions of each landmark (e.g., Figure 8 (as shown in (a)). Optionally, if the face is tilted, such as key points A and B not being on the same horizontal line, the face image can be corrected using a face correction algorithm; the specific process is not limited in this application. It is understood that if the face detection model does not detect facial key points in the input image (or does not detect all five key points), then it is considered that there is no face image in the input image.
[0104] Furthermore, the face detection model outputs detection results. These results include a face image. For example, the face detection model can define a rectangle centered on the nose (C), which contains several other key points; the image extracted from this rectangle is the face image (corresponding to the first face image mentioned above), as shown below. Figure 8 As shown in (b) above. Optionally, the detection result also includes the left eye image. For example, the face detection model defines a fixed-size rectangle centered on the left eye A, and the image extracted from this rectangle is the left eye image, as shown below. Figure 8 As shown in (c); optionally, the detection result also includes the right eye image. For example, the face detection model determines a fixed-size rectangle centered on the right eye B, and the image extracted from this rectangle is the right eye image, as shown in (c). Figure 8 As shown in (d) in the figure.
[0105] S120, the first face image is processed based on the first neural network model, and the second face image is output.
[0106] For example, before performing the face feature detection service, the first face image can be pre-processed with adaptive equalization to make targeted improvements to the first face image for the face feature detection service, thereby improving the accuracy of subsequent face feature detection.
[0107] For example, a first face image is processed based on a first neural network model, and a second face image is output. This means that the second face image is obtained by processing the first face image using the first neural network model.
[0108] As an example, the first neural network model is a lightweight graph-to-graph transformation neural network model used for adaptive equalization processing of images. The adaptive equalization processing in this application refers to targeted adjustments to image features for face feature detection, including retaining or enhancing features relevant to face feature detection and weakening features irrelevant to face feature detection, in order to specifically improve image quality.
[0109] Therefore, the features in the second face image are specifically improved compared to the features in the first face image. For example, the feature strength of the first feature in the second face image is greater than or equal to the feature strength of the second feature in the first face image, and the feature strength of the third feature in the second face image is less than the feature strength of the fourth feature in the first face image. Here, the first and second features are related to face feature detection, while the third and fourth features are not. The first feature corresponds to the second feature, and the third feature corresponds to the fourth feature. Specifically, the first feature corresponding to the second feature means that the second feature is the feature obtained by processing the first feature using the first neural network model; similarly, the third feature corresponding to the fourth feature means that the fourth feature is the feature obtained by processing the third feature using the first neural network model. In other words, the first feature is the same as the second feature, or the first feature is obtained by enhancing the second feature, and the third feature is obtained by weakening the fourth feature.
[0110] It should be noted that, in this embodiment, a feature related to the face feature detection service refers to a feature that can be used in the face feature detection process; that is, when performing the face feature detection service, it is necessary to detect face features based on this feature. Conversely, a feature unrelated to the face feature detection service refers to a feature that is not used in the face feature detection process; that is, this feature is an interfering feature in the face feature detection process.
[0111] For example, in the case of a facial feature detection service that is gaze point detection, the first and second features may include eye features and / or facial contour features, and the third and fourth features may include one or more of the following: light and shadow features affected by illumination (such as highlights or shadows on the face), facial features other than the eyes (such as the nose, mouth, etc.), biological features other than facial features (such as freckles, moles, etc. on the face), accessory features (such as earrings, nose studs, etc.), and makeup features (such as lipstick, eyeliner, etc.). Figure 9 A concrete example is shown. Assume... Figure 9In the first face image (a), there is a mole 910, earrings 920a and 920b, and the image contains uneven lighting. After inputting this first face image into the first neural network model, a second face image is input, assuming... Figure 9 As shown in (b) of the image, the lighting features have been balanced, the features of the eyes and facial contours have been preserved, and the mole 910, earrings 920a and 920b have been weakened (assuming they are weakened to the point of being invisible). Alternatively, the second face image may also be as follows. Figure 9 As shown in (c), in the second face image, the feature intensity of eye features and facial contour features is enhanced, while other features are weakened (assuming they are weakened to the point of being invisible). This reduces interference and improves the accuracy, precision, and recall of the detection when performing subsequent face feature detection operations using the second face image.
[0112] As shown above, the proposed method can acquire face images and perform adaptive equalization processing on them to retain or even enhance features relevant to face feature detection while weakening features irrelevant to it. In other words, it retains or enhances features to be used later and weakens other noisy features. For example, in gaze detection scenarios, it balances the changes in image features caused by different lighting conditions, reducing the fluctuation of model performance with changes in lighting; it suppresses other high-frequency facial details unrelated to gaze, focusing on retaining key eye features, thereby reducing the possibility of the model being interfered with by other irrelevant high-frequency details (such as earrings, mouth expressions, etc.).
[0113] The following is combined with Figure 10 The above-described face image acquisition and image preprocessing process is summarized below. First, the first camera is activated, and a first image is acquired using it. Then, basic preprocessing is performed on the first image, such as image scaling, downsampling, and pixel normalization. Next, face detection and keypoint detection are performed on the processed image to extract the face image and eyes (optional). Further, the extracted face image can be input into an adaptive equalization network model for adaptive image equalization processing to improve the original image.
[0114] It is understood that this application does not limit the training method of the first neural network model. In one implementation, the first neural network model is trained based on training data, which includes multiple sample images. These multiple sample images are obtained by augmenting the same original image with features unrelated to the face feature detection service. The original image includes a third face image. Specifically, for example, the first neural network model is trained based on an adversarial loss function value, a second loss function value, and a third loss function value. The first loss function value is obtained by comparing the multiple sample images with multiple reconstructed images, which are generated based on multiple feature vectors extracted from the training data. These feature vectors, sample images, and reconstructed images correspond one-to-one. The second loss function value is obtained by comparing the average vector distance between the multiple feature vectors with the target distance. The third loss function value is obtained by comparing the detection result obtained by performing the face feature detection service based on the multiple feature vectors with the target result. Specific model training methods can be found in the detailed description in subsequent method 200, and will not be repeated here.
[0115] It is understood that the embodiments of this application are described using adaptive equalization processing on a first face image as an example, but this application is not limited thereto. For example, histogram equalization processing can also be performed on the first face image to improve the image quality of the first face image, and this application does not limit this.
[0116] S130, Perform facial feature detection based on the second facial image.
[0117] For example, after processing the first face image using a first neural network model to obtain a second face image, a face feature detection process is performed based on the second face image. It is understood that performing the face feature detection process described in this application embodiment refers to the process of capturing a face image through a camera and performing feature detection. This application embodiment can be applied to various scenarios involving face feature detection. For example, corresponding to... Figure 2 The application scenario shown, detecting facial features based on a second face image, refers to detecting whether the user's gaze point is located within the display screen of an electronic device based on the second face image. This application does not limit the specific detection method. As an example, the electronic device can determine the left and right eye images based on the second face image; for specific implementation methods, please refer to [reference needed]. Figure 7 and Figure 8 The proposed solution involves using the second face image, left-eye image, and right-eye image together to detect whether the user's gaze is within the display screen of the electronic device.
[0118] For example, the left and right eye images can be processed based on the open and closed eye recognition model, and the open and closed eye results can be output to indicate whether the user is in an open eye state.
[0119] In one example, the eye-opening / closing recognition model is a shallow classification model consisting of fewer neural network layers, thus requiring only a very low computational cost to achieve accurate eye-opening / closing judgment. It is understood that this eye-opening / closing recognition model can determine that the user is in an open-eye state when both eyes are open, or it can determine that the user is in an open-eye state when only one eye is open; this application does not limit this.
[0120] When the user's eyes are open, the system detects whether the user's gaze point is within the display screen of the electronic device based on the second facial image, the left eye image, and the right eye image. When the user's eyes are closed, the system determines that the user's gaze point is not within the display screen of the electronic device.
[0121] The following is combined with Figure 11 The specific detection process is illustrated below: Left-eye features and right-eye features are extracted from the left-eye image and the right-eye image respectively using a human eye feature extraction model. This human eye feature extraction model is a neural network model used to extract eye image features to generate high-order representations of the left and right eye images.
[0122] The left and right eye features are input into the eye-opening / closing recognition model to determine whether the user's eyes are currently open. If the user's eyes are closed, it is determined that the user's gaze point is not within the display screen of the electronic device. If the user's eyes are open, the face recognition branch is activated: a face feature extraction model is used to extract face features from the second face image. This face feature extraction model is a neural network model used to extract features from the entire face image to generate a high-order representation of the entire face image. Then, the left eye features, right eye features, and face features are fused to obtain fused data. Finally, the gaze recognition model is used to process the fused data and output a gaze result. This gaze result indicates whether the user's gaze point is within the display screen of the electronic device. The gaze recognition model is a shallow classification model composed of a few neural network layers, which can achieve high-precision judgment of gaze and non-gaze.
[0123] Therefore, in the above scheme, gaze recognition is not performed directly. Instead, the system first identifies whether the user's eyes are open based on the left and right eye images. If so, the gaze recognition process continues. Since the training difficulty of the open / closed eye recognition model is much lower than that of the gaze recognition model, its accuracy is very high. Therefore, performing accurate open / closed eye filtering in advance helps improve the overall accuracy of gaze recognition. Secondly, compared to the deep neural network structure and large number of parameters of the facial feature extraction model, the very shallow open / closed eye recognition model requires extremely low computation. By reducing meaningless full-face branch operations, the power consumption can be further reduced, which helps to achieve the low power consumption requirement.
[0124] It is understandable that in another implementation, instead of performing eye-opening / closing recognition, the left eye features, right eye features, and facial features are directly fused to obtain fused data, and then gaze recognition is performed based on the fused data. This application does not limit this.
[0125] Optionally, after detecting whether the user's gaze point is within the display screen, the on / off state of the electronic device's display screen can be controlled based on the detection result. In this embodiment, the on / off state of the display screen is used to indicate whether the electronic device's display screen is currently in a bright or off state. For example, corresponding to... Figure 2 The application scenario shown can feed back the detection results to the automatic screen-off control program to control the on / off state of the display. Specifically, if the detection results indicate that the user's gaze is within the display screen of the electronic device, the control program can keep the electronic device in a screen-on state and reset the screen-off countdown.
[0126] In summary, the above-mentioned solution provided in this application embodiment can perform adaptive equalization processing on the captured face image to retain or enhance features related to the face feature detection business and weaken features unrelated to the face feature detection business, thereby improving detection accuracy.
[0127] However, if the first camera is a color camera, the detection results may be affected by factors such as lighting when using it to capture color images and perform facial feature detection. For example, in extremely dark environments, backlit scenes with the camera facing away from the light source, or scenes with glare from glasses, the facial features in the images captured by the first camera may be unclear, thus affecting the detection results. Figure 12 A specific example scenario is provided. Please refer to [link / reference]. Figure 12The electronic device has a front-facing camera 1210, which is a two-dimensional camera, specifically designed for capturing two-dimensional color images, such as an RGB camera. If a user uses their phone while lying in bed at night with the lights off, and the front-facing camera is used to capture a face image, the resulting image may be of poor quality, potentially failing to successfully detect the user's gaze point. However, if the first camera is a ToF (Total Free) camera, the detection results are generally better when using it to capture infrared images and perform face feature detection. This is because infrared images are captured by an infrared emitter emitting modulated infrared light pulses that continuously strike the surface of an object. After reflection, the pulses are received by a receiver, and the time difference is calculated using phase changes. This, combined with the speed of light, allows for the calculation of object depth information. Generally, the image quality is unaffected by ambient light; therefore, even in extremely dark or unevenly lit scenes, the infrared image captured by the electronic device can contain a relatively clear image of the user's face. However, the algorithms for capturing infrared images run at the hardware abstraction layer (HAL), resulting in relatively high power consumption.
[0128] Therefore, embodiments of this application provide an electronic device with two cameras, one camera for acquiring color images, such as an RGB camera, and the other camera for acquiring infrared images. For example, the camera can be used to capture images containing depth information or a 3D camera, such as a ToF camera. Figure 13 The mobile phone features a front-facing camera module with two cameras: camera 1310 is an RGB camera, and camera 1320 is a ToF camera. These two cameras can detect facial features under different ambient light conditions. Specifically, when the light intensity is higher than a preset value, the RGB camera captures a color image; when the light intensity is lower than a preset value, the ToF camera captures an infrared image. If the reliability of the detection result using the color image is lower than a threshold, the ToF camera can then capture an infrared image.
[0129] The following explanation uses method 100 as an example: Assume the first image in method 100 is a color image, meaning the first camera is used to capture color images. After determining the second face image based on the first image and performing face feature detection based on the second face image, if the confidence level of the face feature detection result is lower than a second threshold or the face feature detection process fails, the average pixel value of the first face image is obtained. If the average pixel value of the first face image is lower than a third threshold, the second camera is used to capture a second image, which includes the third face image. If the average pixel value of the first face image is higher than the third threshold, the first camera can be used to re-capture the face image and perform the face feature detection process again, or the face feature detection process can be directly determined to have failed, or the aforementioned low-confidence detection result can be used. The specific configuration can be based on actual needs.
[0130] Specifically, when the average pixel value of the first face image is higher than the third threshold, the probability of detection failure due to the quality problem of the first face image is less than the preset value. In other words, if the average pixel value of the first face image is higher than the third threshold, it means that the image quality of the first face image is relatively good. Therefore, it is highly unlikely that the face feature detection will fail due to quality problems. In other words, if the average pixel value of the first face image is higher than the third threshold, but the face feature detection business still fails when using the first face image, the reason for the failure is generally not because the quality of the first face image is poor.
[0131] Correspondingly, when the average pixel value of the first face image is lower than the third threshold, the probability of detection failure due to the quality problem of the first face image is greater than the preset value. In other words, if the average pixel value of the first face image is lower than the third threshold, it means that the image quality of the first face image is relatively poor, and therefore, the face feature detection is likely to fail due to quality problems. In other words, if the average pixel value of the first face image is lower than the third threshold, and the face feature detection using the first face image fails, poor image quality is likely one of the reasons for the detection failure. Therefore, in this case, the second camera needs to be called to re-capture the second image. This second image includes the third face image and is an infrared image, so the imaging quality is not affected by ambient light. Furthermore, a facial feature detection service can be performed based on a third facial image: the third facial image is processed using a second neural network model to output a fourth facial image; the facial feature detection service is then performed based on the fourth facial image; wherein, the feature intensity of the fifth feature in the fourth facial image is greater than or equal to the feature intensity of the sixth feature in the third facial image, the feature intensity of the seventh feature in the fourth facial image is less than the feature intensity of the eighth feature in the third facial image, the fifth feature corresponds to the sixth feature, the seventh feature corresponds to the eighth feature, the fifth feature and the sixth feature are related to the facial feature detection service, and the seventh feature and the eighth feature are not related to the facial feature detection service; the first image is a color image, and the second image is an infrared image.
[0132] Understandably, the execution process of the aforementioned face feature detection is similar to that of S120 and S130, and for the sake of brevity, it will not be repeated here. However, it should be noted that the second neural network model is trained in the same way as the first neural network model, but the training data for the second neural network is an infrared image set, while the training data for the first neural network is a color image set. For specific implementation details, please refer to the description in section 200 of the subsequent method; no specific limitations will be made here.
[0133] It is also understood that the average pixel value of the first face image can be calculated in advance by the face detection model or other models. In this case, the electronic device can directly obtain the average pixel value of the pre-generated first face image after the confidence of the detection result is lower than the second threshold or the face feature detection service fails, thereby reducing latency; or, the electronic device can calculate the average pixel value of the first face image after the confidence of the detection result is lower than the second threshold or the face feature detection service fails. This application does not limit this.
[0134] It is also understood that the embodiments of this application only use the average pixel value of the first face image to characterize the image quality of the first face image as an example for illustration, but this application is not limited to this. That is to say, other parameters used to characterize the image quality can also be calculated to determine whether to activate the second camera. For example, parameters used to characterize the image quality can be calculated based on the brightness, contrast, and sharpness of the first face image. The specific method is not limited in this application.
[0135] In summary, in the face feature detection method provided in this application embodiment, if face feature detection fails based on a first face image captured by a first camera, and the failure may be due to the average pixel value of the first face image being lower than a third threshold, then a second camera is used to capture a second image, and face features are detected based on a third face image in the second image. That is, in this application solution, after face feature detection fails using the first face image, it is not directly judged as a failure. Instead, it is first determined that the average pixel value of the first face image is less than or equal to the third threshold. If so, it indicates that the failure is likely due to image quality issues with the first image. For example, in backlit scenes, the imaging quality of color images is relatively poor. Switching to the second camera to capture infrared images can improve the detection success rate, as the imaging quality of infrared images is almost unaffected by lighting conditions. Therefore, in the above solution, the first camera can be used first to acquire images and perform face feature detection, thereby minimizing additional power consumption. If detection fails and the failure may be due to image quality issues, then the second camera is used to acquire images for face feature detection, thereby improving the detection success rate.
[0136] In addition, after acquiring the first face image using the first camera, the first neural network model is used to preprocess the first face image before performing the face feature detection service. Since this preprocessing process can specifically improve the image quality for the face feature detection service, it enhances the quality of the color image under imperfect lighting conditions to a certain extent (or enhances the performance of the color camera under imperfect lighting conditions), which is equivalent to reducing the shooting of infrared images (or reducing the use of the infrared camera), thus indirectly reducing unnecessary power consumption.
[0137] The embodiments of this application can be applied to various scenarios for detecting facial features. After successfully performing the facial feature detection service, subsequent operations can be performed based on the detection results.
[0138] For example, corresponding to Figure 4 The scenario shown refers to the facial feature detection process, specifically the facial recognition process. If the detection is successful, the electronic device will automatically switch from an unlocked state (e.g., ...). Figure 4 (as shown in (a)) Switch to the unlocked state (as shown in (a)) Figure 4 (as shown in (b)).
[0139] For example, corresponding to Figure 2 In the scenario shown, performing facial feature detection refers to detecting whether the user's gaze is within the display screen of the electronic device. If the detection is successful, the electronic device controls the on / off state of its display screen based on the detection result. Specifically, for example, if the user's gaze is within the display screen, the display screen remains on, and the screen-off countdown is reset; if the user's gaze is outside the display screen, the screen-off countdown continues, and automatically switches from the on-screen state to the screen-off state after the countdown expires. To more clearly describe the application method of the solution provided in this application embodiment in a real-world scenario, an electronic device is described below in conjunction with this scenario. This electronic device includes modules for executing the corresponding methods in the above embodiments. These modules can be software, hardware, or a combination of software and hardware. Figure 14 As shown, the electronic device includes a system activation module, an RGB image acquisition and preprocessing module, an RGB image gaze recognition module, an IR image acquisition and preprocessing module, and an IR image gaze recognition module. The system activation module is used to execute a screen-off countdown and to make a decision on enabling the ToF camera. The RGB image acquisition and preprocessing module is used to enable the RGB camera to capture RGB images, perform facial landmark detection on the captured RGB images, crop the face and eye images, and perform adaptive image equalization. The RGB image acquisition and preprocessing module is used to perform eye opening / closing and gaze detection based on the face images, and to make gaze recognition decisions. The IR image acquisition and preprocessing module is used to enable the IR camera to capture IR images, perform facial landmark detection on the captured IR images, crop the face and eye images, and perform adaptive image equalization. The IR image acquisition and preprocessing module is used to perform eye opening / closing and gaze detection based on the face images, and to make gaze recognition decisions. The following section combines... Figure 15 The flowchart shown illustrates the execution flow of each module in the above-mentioned gaze recognition scenario.
[0140] S1. Determine if the screen-off countdown is less than t.
[0141] For example, an electronic device can start a screen-off countdown after the user completes an operation. When the screen-off countdown is less than t, the system activation module activates the system, that is, triggers the execution of step S2.
[0142] It is understandable that 't' here can be a pre-configured threshold used to activate the entire gaze-on non-sleeping feature. It is also understandable that the value of 't' and other subsequent thresholds can be determined through extensive experimentation with specific electronic devices in different scenarios; this solution does not impose specific restrictions on this, and will not be repeated hereafter.
[0143] S2, Obtain ambient light intensity.
[0144] S3. Determine if the ambient light smoothing value is less than p.
[0145] For example, when the screen-off countdown is less than t, the system activation module triggers the electronic device to acquire the current ambient light intensity using the ambient light sensor. For convenience, this embodiment uses the ambient light smoothing value (ambient light sliding average) to characterize the ambient light intensity. Further, it is determined whether the ambient light smoothness is less than p, where p corresponds to a light intensity value reflecting dark light collected by the ambient light sensor. This means that when the ambient light intensity is less than p, the RGB camera can hardly acquire any images that can be used for subsequent algorithm analysis. Therefore, when the ambient light smoothing value is less than p, the ToF camera is triggered to acquire an infrared (IR) image (i.e., steps S4 to S12 are triggered); when the ambient light smoothing value is greater than or equal to p, the RGB camera is triggered to acquire an RGB image (i.e., steps S14 to S22 are triggered). Steps S4 to S12 will be explained below as an example.
[0146] S4. Enable ToF camera.
[0147] S5. Acquire IR image.
[0148] For example, if the ambient light smoothing value is less than p, it indicates that the current scene is dark, and the process directly proceeds to the ToF-related steps. Specifically, the IR image acquisition and preprocessing module enables the ToF camera and uses the ToF camera to capture IR images.
[0149] S6, IR face detection.
[0150] S7. Determine if a face has been detected.
[0151] For example, after acquiring the IR image, face detection can be performed based on the IR image. For instance, the IR image acquisition and preprocessing module inputs the IR image into an IR face detection model to detect whether the IR image contains a face image. This IR face detection model is a neural network model trained using an IR face dataset. If no face is detected, the process is restarted. If a face is detected, step S8 is continued.
[0152] S8. Extract face image + eyes image.
[0153] For example, after the IR image acquisition and preprocessing module detects a face in the IR image, it extracts the face image and the eyes image from the IR image. For a specific implementation, please refer to [reference needed]. Figure 7 and Figure 8 The corresponding methods will not be elaborated here.
[0154] S9, IR image adaptive equalization.
[0155] For example, the IR image acquisition and preprocessing module performs adaptive equalization processing on the cropped face image (optionally, it can also perform adaptive equalization processing on the cropped eye image). For the specific implementation method, please refer to the description of step S120 in method 100, which will not be repeated here.
[0156] S10, IR gaze recognition.
[0157] For example, after the IR image acquisition and preprocessing module performs adaptive equalization processing on the face image, it outputs the processed image to the IR image gaze recognition module. The IR image gaze recognition module performs gaze recognition based on the processed image. For details, please refer to [reference needed]. Figure 11 The corresponding methods will not be elaborated here.
[0158] S11, determine whether the recognition credibility is lower than the credibility threshold.
[0159] For example, after completing gaze recognition, the electronic device can determine whether the recognition confidence level is lower than a confidence level threshold. In this embodiment, confidence level is a value used to measure recognition confidence level, and the specific calculation method is not limited in this application. As an example, recognition confidence level can be calculated based on current light intensity, IR image sharpness, IR image brightness, IR image contrast, etc.
[0160] If the recognition confidence level is lower than the confidence level threshold, the current process is reset. If the recognition confidence level is greater than or equal to the confidence level threshold, then proceed to step S12.
[0161] It is understood that when this process is combined with the above method 100, the confidence threshold may correspond to the second threshold in method 100.
[0162] S12. Determine whether the human eye is looking at the screen.
[0163] For example, the IR image gaze recognition module can determine whether the human eye is looking at the screen (i.e., whether the user's gaze point is within the display screen) based on the gaze recognition result. If the user is not looking, the current process is reset. If the user is looking, S13 is executed.
[0164] S13: Keep screen on; Reset screen-off countdown; Turn off camera.
[0165] For example, if the user is currently looking at the screen, the electronic device keeps the screen on and resets the screen-off countdown. Additionally, the IR camera is turned off.
[0166] It is understandable that S14 to S22 are similar to S4 to S12. The main difference is that in S14 to S22, the RGB camera is used to capture RGB images, and face detection, image cropping, adaptive equalization processing, etc. are performed based on the RGB images. The specific process will not be described in detail.
[0167] In addition, when using RGB images for gaze recognition, if the recognition confidence level is determined to be lower than the confidence level threshold in S21, S23 is executed.
[0168] S23. Determine whether the average pixel intensity of the face image is lower than the pixel threshold.
[0169] For example, after the RGB image gaze recognition module completes gaze recognition, if the recognition confidence level is lower than the confidence level threshold, it calculates the average pixel intensity of the face image. If the average pixel intensity of the face image is lower than the pixel threshold, it indicates that the image quality of the captured RGB image may be poor. In this case, the ToF camera can be enabled to recapture the IR face image and perform gaze recognition, i.e., steps S4 to S12 are executed, which will not be repeated here.
[0170] It is understood that when this process is combined with the above-described method 100, the pixel threshold may correspond to the third threshold in method 100.
[0171] In summary, based on Figure 15 The illustrated process determines the camera used to capture facial images based on factors such as ambient light intensity and the reliability of gaze recognition, minimizing power consumption while ensuring high recognition accuracy. Furthermore, adaptive equalization processing of the facial image before gaze recognition further improves the accuracy of gaze recognition.
[0172] This application also provides a training system for a neural network model. This training system can be used to train any model involved in the above embodiments of this application, and then to train an adaptive equilibrium network model (such as the first neural network model or the second neural network model in method 100).
[0173] It's important to note that in machine learning, training a model means learning (determining) ideal values for all weights and biases using labeled samples. During training, machine learning algorithms essentially examine multiple samples and attempt to find a model that minimizes loss; the goal is to minimize the loss.
[0174] In the general process of training a general model, (1) Model (prediction function): Take one or more features as input and then return a prediction (y') as output. For simplification, consider a model that takes a feature x as input and returns a prediction, as shown in the following formula (where b is the bias and w is the weight): y' = b + w1x1. (2) Calculate the loss: Calculate the loss under the parameters (bias, weight) using the loss function. (3) Calculate the parameter update: Detect the value of the loss function and generate new values for parameters such as bias and weight to minimize the loss.
[0175] Neural networks can employ backpropagation (BP) to correct the parameters in the initial model during training, thereby reducing the reconstruction error loss. Specifically, forward propagation of the input signal to the output generates an error loss; this error loss information is then propagated back to update the parameters of the initial super-resolution model, leading to convergence of the error loss. The backpropagation algorithm is an error-loss-driven backpropagation process aimed at obtaining the optimal parameters of the super-resolution model, such as the weight matrix.
[0176] Figure 16 A schematic diagram of a system architecture 1600 is shown. Figure 16 In this embodiment, the data acquisition device 1660 is used to generate or collect training data. For the method of training a neural network according to this application, the training data may include multiple sample images. These multiple sample images are obtained by augmenting the same original image with features unrelated to the face feature detection business, and the original image includes a face image. However, it should be noted that the training data used for training different models is different. For example, when the first image is a color image, the sample images in the training data are color images when training the first neural network model described in method 100; when the second image is an infrared image, the sample images in the training data are infrared images when training the second neural network model described in method 100.
[0177] After acquiring the training data, the data acquisition device 1660 stores the training data into the database 1630, and the training device 1620 trains the target model / rule 1601 based on the training data maintained in the database 1630.
[0178] The specific method by which the training device 1620 obtains the target model / rule 1601 based on the training data will be described in detail in Method 200 later; it will not be elaborated here. The target model / rule 1601 can be used for adaptive equalization processing of face images, that is, enhancing features relevant to face feature detection and weakening features irrelevant to face feature detection. In other words, by inputting the image to be processed into the target model / rule 1601, a face image specifically improved for face feature detection can be obtained.
[0179] It should be noted that in practical applications, the training data maintained in database 1630 may not all come from the data acquisition device 1660; it may also be received from other devices. Furthermore, it should be noted that training device 1620 may not necessarily train the target model / rule 1601 entirely based on the training data maintained in database 1630; it may also acquire training data from the cloud or other sources for model training. The above description should not be construed as limiting the embodiments of this application.
[0180] The target model / rule 1601 trained using training device 1620 can be applied to different systems or devices, such as... Figure 16 The execution device 1610 shown can be, for example, the electronic device described in method 100, or it can be a server or cloud, etc. Figure 16 In this embodiment, the execution device 1610 is configured with an input / output (I / O) interface 1612 for data interaction with external devices. The user can input data to the I / O interface 1612 through the client device 1640. The input data may include a face image to be processed, such as the first face image or the third face image in method 100.
[0181] During the preprocessing of input data by the execution device 1610, or during the calculation module 1611 of the execution device 1610 performing calculations and other related processes, the execution device 1610 can call data, code, etc. in the data storage system 1650 for corresponding processing, or store the data, instructions, etc. obtained from the corresponding processing into the data storage system 1650.
[0182] Finally, the I / O interface 1612 returns the processing result, such as the second face image obtained after processing the first face image, to the client device 1640, thereby providing it to the user. It is understood that the client device 1640 and the execution device 1610 can be the same device; if both are electronic devices, then the I / O interface 1612 may not be necessary.
[0183] It is worth noting that the training device 1620 can generate corresponding target models / rules 1601 based on different training data for different objectives or tasks. The corresponding target models / rules 1601 can be used to achieve the above objectives or complete the above tasks, thereby providing the user with the required results.
[0184] It is worth noting that, Figure 16 This is merely a schematic diagram of a system architecture provided in an embodiment of this application. The positional relationships between the devices, components, modules, etc., shown in the diagram do not constitute any limitation. For example, in Figure 16 In this context, the data storage system 1650 is an external memory relative to the execution device 1610. In other cases, the data storage system 1650 may also be placed within the execution device 1610.
[0185] The following is combined with Figure 17 Method 200 in this document provides an exemplary description of the neural network model training method provided in the embodiments of this application. Method 200 can be applied to... Figure 16 In the system architecture 1600 shown, method 200 can be used to train the first neural network model or the second neural network model in method 100. The following section provides a detailed explanation of method 200 with specific steps.
[0186] S210, Obtain training data.
[0187] For example, the first step is to obtain training data, which is used to train the model parameters. See also... Figure 18 Training data can be obtained by augmenting the original image. For example, multiple sample images can be obtained by augmenting the same original image with features unrelated to the face detection task. These multiple sample images constitute the training data, and the original image includes a face image. In the case of face detection for gaze-based recognition, the augmented features can include various lighting conditions, irrelevant accessories, makeup, facial defects, high-frequency signal interference, etc.
[0188] S220 extracts multiple feature vectors from the training data.
[0189] For example, after obtaining training data by augmenting the original image, the training data can be input into a neural network structure to extract feature vectors. See also... Figure 18 Multiple feature vectors can be extracted from multiple sample images in the training data through the encoder in the neural network structure, and these multiple feature vectors correspond one-to-one with these multiple sample images.
[0190] S230 generates multiple reconstructed images based on multiple feature vectors.
[0191] For example, after obtaining multiple feature vectors, multiple reconstructed images are generated based on these feature vectors. Please refer to [link to relevant documentation]. Figure 18 Multiple feature vectors are input into the decoder of the neural network structure to regenerate multiple face images, i.e., multiple reconstructed images. These multiple reconstructed images correspond one-to-one with the multiple feature vectors, and therefore, they correspond one-to-one with the multiple sample images mentioned above. It should be noted that during the process of generating the reconstructed images, efforts should be made to make the reconstructed images as close as possible to the corresponding sample images.
[0192] S240 performs face feature detection based on multiple feature vectors to obtain detection results.
[0193] For example, multiple feature vectors can be input into a face feature detection model to perform face feature detection. This face feature detection model is, for example, a gaze recognition model. In this case, the detection result is used to indicate whether the user's gaze point is located within the display screen of the electronic device. Specific implementation processes can be found in the foregoing embodiments and will not be repeated here.
[0194] S250, calculate the first loss function value, the second loss function value, and the third loss function value respectively.
[0195] Before introducing the corresponding solution for this step, let's first introduce the loss function: During the training of a deep neural network, because we want the output of the deep neural network to be as close as possible to the actual predicted value, we can compare the current network's predicted value with the actual target value, and then update the weight vector of each layer of the neural network based on the difference between the two (of course, there is usually a optimization process before the first update, that is, pre-configuring the parameters of each layer in the deep neural network). For example, if the network's predicted value is too high, the weight vector is adjusted to make it predict lower, and this adjustment is continued until the deep neural network can predict the actual target value or a value very close to the actual target value. Therefore, it is necessary to predefine "how to compare the difference between the predicted value and the target value", which is the loss function or objective function. They are important equations used to measure the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference, so the training of the deep neural network becomes the process of minimizing this loss as much as possible. Generally, the smaller the loss, the higher the training quality of the deep neural network; the larger the loss, the lower the training quality of the deep neural network. Similarly, the smaller the loss fluctuation, the more stable the training; the larger the loss fluctuation, the less stable the training.
[0196] The following provides an exemplary description of the three methods for calculating the loss function values in the embodiments of this application.
[0197] For example, multiple sample images are compared with their corresponding reconstructed images to obtain the first loss function value (corresponding to...). Figure 18 The first loss function value is used to characterize the difference between the sample image (i.e., the input image of the encoder) and the reconstructed image corresponding to the sample image (i.e., the output image of the decoder). The smaller the first loss function value, the more similar the sample image is to the reconstructed image.
[0198] For example, the average vector distance between multiple feature vectors is compared with the target distance to obtain the second loss function value (corresponding to...). Figure 18 (The second loss function value is used to characterize the difference between different feature vectors; the smaller the value of the second loss function, the smaller the vector distance between sample images from the same original image).
[0199] For example, the detection result is compared with the target result to obtain the third loss function value (corresponding to...). Figure 18 The third loss function value is used to characterize the difference between the actual detection result and the target detection result. The smaller the value of the third loss function, the closer the detection result is to the actual result, and the feature vector can better characterize the features related to face feature detection.
[0200] S260, adjust the parameters of the neural network model based on the first loss function value, the second loss function value, and the third loss function value.
[0201] As an example, the parameters of the neural network model are adjusted with the aim of reducing the adversarial loss function value, the second loss function value, and the third loss function value of the first loss function value.
[0202] The adversarial loss function refers to a loss function used in generative adversarial networks (GANs). A GAN is a machine learning architecture that primarily consists of two types of networks: a generator (G) and a discriminator (D). G is responsible for generating images; that is, after inputting a random code z, it outputs a fake image G(z) automatically generated by the neural network. D receives the image output by G as input and determines whether the image is real or fake, outputting 1 for real and 0 for fake. For example, a GAN can be simply viewed as a game between two networks. D trains a binary classification neural network using data from real and fake images. G can fabricate a "fake image" based on a string of random numbers, and then uses this fabricated image to deceive D. D is responsible for identifying whether it is a real or fake image and assigning a score.
[0203] As an example, the adversarial loss function value of the first loss function value is equal to the difference between 1 and the first loss function value. In other words, the larger the first loss function value, the smaller the adversarial loss function value.
[0204] Therefore, the first loss function value is used to train the parameters of the decoder, while the other three loss function values are used to train the encoder. The first loss function value is used to preserve all information of the input image in the feature vector to achieve reconstruction; the third loss function value is used to ensure that features relevant to face feature detection are well represented; and the combination of the second and first loss function values and the adversarial loss function value forms an adversarial learning with the first loss function value. They prevent the preservation of all features and tend to remove features irrelevant to face feature detection.
[0205] Finally, after the above self-supervised training and adversarial training, the encoder and decoder constitute an adaptive equalization network model. The image reconstructed by this model can retain features related to face feature detection.
[0206] Corresponding to the face feature detection methods given in the above method embodiments, this application also provides a corresponding face feature detection system. This system is deployed in an electronic device to execute the methods provided in the above method embodiments. Therefore, details not described in detail can be found in the above method embodiments, and for the sake of brevity, will not be repeated here.
[0207] Figure 19 This application illustrates a system corresponding to an electronic device according to an embodiment of the present application. This system can adopt a layered architecture, event-driven architecture, microkernel architecture, microservice architecture, or cloud architecture, etc. This application embodiment uses the layered architecture Android system as an example to exemplify the software system of the electronic device.
[0208] See Figure 19 A layered architecture divides software into several layers, each with a clear role and function. Layers communicate with each other through software interfaces. In some embodiments, the Android system is structured from top to bottom as the application layer, the application framework layer, the hardware abstraction layer (HAL), and the driver layer. In addition to this, a hardware layer is also included.
[0209] The application layer can include multiple applications, such as dialer applications, gallery applications, etc. In this embodiment, the application layer also includes a face feature detection software development kit (SDK), such as a gaze recognition SDK. The electronic device's system and third-party applications installed on the electronic device can detect face features, such as identifying the location of the user's gaze point, by calling the face feature detection SDK.
[0210] The framework layer provides application programming interfaces (APIs) and programming frameworks for applications in the application layer. The framework layer includes some predefined functions. In this embodiment, the framework layer may include a camera service interface, a face feature detection service interface, and an image pixel analysis service interface. The camera service interface provides APIs and programming frameworks for using a camera. The face feature detection service interface provides APIs and programming frameworks for using face feature detection models (such as gaze recognition models) and adaptive equalization models (such as the first neural network model in method 100). The image pixel analysis service interface provides APIs and programming frameworks for algorithms used to calculate the average pixel intensity of face images. It is understood that the face feature detection service interface and the image pixel analysis service interface can be two separate interfaces or combined into one; this application does not limit this. For ease of description, this embodiment uses the example of the face feature detection service interface and the image pixel analysis service interface being two different interfaces for illustration.
[0211] The hardware abstraction layer (HAL) is an interface layer located between the framework layer and the driver layer, providing a virtual hardware platform for the operating system. In this embodiment, the HAL may include a camera hardware abstraction layer and a face feature detection process. The camera hardware abstraction layer can provide virtual hardware for one or more camera devices. This embodiment uses two camera devices as an example, such as camera device 1 (e.g., an RGB camera) and camera device 2 (e.g., a ToF camera). The calculation process for detecting face features using a face feature detection model is executed within this face feature detection process; for example, the calculation process for identifying the location of the user's gaze point using a gaze recognition module is executed within this face feature detection model.
[0212] The driver layer is the layer between hardware and software. It includes drivers for various hardware components. The driver layer can include camera device drivers. Camera device drivers are used to drive the camera's sensor to acquire images and to drive the image signal processor to preprocess the images.
[0213] The hardware layer includes sensors, and optionally also includes a data buffer. The sensors include a first camera (such as an RGB camera) and a second camera (such as a ToF camera). The functions of the first and second cameras are described in the above method embodiments and will not be repeated here. Optionally, the sensors may also include a light sensor for detecting ambient light intensity.
[0214] Optionally, the data captured by the camera can be stored in a data buffer. Upper-level processes or references can then retrieve the image data from the data buffer.
[0215] The following is a brief description of the interaction process of the face feature detection method in the embodiments of this application, based on the above system structure. For details not described in detail, please refer to the above method embodiments.
[0216] Electronic devices in determining and detecting facial features (such as in Figure 2 In the scenario shown, when the countdown to the screen-off state is less than a preset value, and facial features are detected, the facial feature detection service is called through the facial feature detection SDK.
[0217] On one hand, the facial feature detection service can call the camera service in the framework layer to acquire and obtain image frames containing the user's facial image. Specifically, the camera service can send a command to start the first camera by calling camera device 1 (the first camera) in the camera hardware abstraction layer. The camera hardware abstraction layer sends this command to the camera device driver in the driver layer. The camera device driver can start the first camera according to the above command. The command sent by camera device 1 to the camera device driver can be used to start the first camera. After the first camera is turned on, it acquires light signals, which are then processed by the image signal processor to generate a color image of electrical signals.
[0218] On the other hand, the face feature detection service can create a face feature detection process and initialize a face feature detection model (such as a face image acquisition model, gaze recognition model, and adaptive equalization model).
[0219] The color image generated by the image signal processor can be stored in a data buffer. After the face feature detection process is created and initialized, the color image data stored in the data buffer can be input into the face feature detection process. In the face feature detection process, the input image can first be processed using an adaptive equalization model, and then the processed image can be detected using a face feature detection model (such as a gaze recognition model) to perform face feature detection (e.g., detecting whether the user's gaze point is within the display screen of the electronic device). The detection results can then be returned to the application-layer face feature detection SDK via camera service and face feature detection service.
[0220] On the other hand, if the reliability of the face feature detection result is lower than a preset value, the image pixel analysis service can obtain a color image from the data buffer and determine whether the average pixel intensity of the color image is lower than a preset value. If so, the image pixel analysis service calls the camera service in the frame layer through the face feature detection service to re-acquire and obtain an image frame containing the user's face image. Specifically, the camera service can send a command to start the second camera by calling camera device 2 (such as a ToF camera) in the camera hardware abstraction layer. The camera hardware abstraction layer sends this command to the camera device driver in the driver layer. The camera device driver can start the second camera according to the above command. The command sent by camera device 2 to the camera device driver can be used to start the second camera. After the second camera is turned on, it collects light signals and generates an infrared image of electrical signals through the image signal processor.
[0221] The infrared image generated by the image signal processor can be stored in a data buffer. After the face feature detection process is created and initialized, the infrared image data stored in the data buffer can be input into the face feature detection process. In the face feature detection process, the infrared image can be processed sequentially using an adaptive equalization model and a face feature detection model (such as a gaze recognition model) to determine whether the user's gaze point is within the display screen. The recognition result can then be returned to the application-layer face feature detection SDK via the camera service and the face feature detection service.
[0222] It should be noted that the embodiments of this application are only illustrated using the Android system. In other operating systems (such as the iOS system), the solution of this application can also be implemented as long as the functions implemented by each functional module are similar to those in the embodiments of this application.
[0223] Corresponding to the methods given in the above embodiments, this application also provides a hardware architecture for a corresponding electronic device.
[0224] For example, Figure 20 A detailed architectural diagram of an electronic device 1000 to which this application applies is shown.
[0225] like Figure 20 As shown, the electronic device 1000 may include a processor 1010, an internal memory 1021, one or more cameras 1093 (these multiple displays can be represented by 1 to N, where N is a positive integer greater than 1), one or more displays 1094 (these multiple displays can be represented by 1 to N), and a sensor module 1080, wherein the sensor module includes an ambient light sensor 1080L.
[0226] Optionally, the electronic device 1000 may also include an external memory interface 1020, a universal serial bus (USB) interface 1030, a charging management module 1040, a power management module 1041, a battery 1042, an audio module 1070, a speaker 1070A, a receiver 1070B, a microphone 1070C, a headphone jack 1070D, a sensor module 1080, buttons 1090, a motor 1091, an indicator 1092, and a subscriber identification module (SIM) card interface 1095, etc. In addition to the ambient light sensor, the sensor module 1080 may also include a pressure sensor 1080A, a gyroscope sensor 1080B, a barometric pressure sensor 1080C, a magnetic sensor 1080D, an accelerometer sensor 1080E, a distance sensor 1080F, a proximity light sensor 1080G, a fingerprint sensor 1080H, a temperature sensor 1080J, a touch sensor 1080K, a bone sensor 1080M, etc.
[0227] It is understood that the structures illustrated in the embodiments of this application do not constitute a specific limitation on the electronic device 1000. In other embodiments of this application, the electronic device 1000 may include more or fewer components than illustrated, or combine some components, or split some components, or have different component arrangements. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
[0228] The processor 1010 may include one or more processing units, such as an application processor (AP), a modem processor, a graphics processing unit (GPU), an image signal processor (ISP), a controller, a memory, a video codec, a digital signal processor (DSP), a baseband processor, and / or a neural network processing unit (NPU).
[0229] The processor 1010 may also include a memory for storing instructions and data. In some embodiments, the memory in the processor 1010 is a cache memory. This memory can store instructions or data that the processor 1010 has just used or that are used repeatedly. If the processor 1010 needs to use the instruction or data again, it can retrieve it directly from the memory. This avoids repeated accesses, reduces the waiting time of the processor 1010, and thus improves the efficiency of the system.
[0230] Internal memory 1021 can be used to store computer executable program code, which includes instructions. Processor 1010 executes various functional applications and data processing of electronic device 1000 by running the instructions stored in internal memory 1021. In one example, internal memory 1021 stores a computer program, which, when called by the processor, enables electronic device 1000 to perform the steps in any of the method embodiments described above.
[0231] Electronic device 1000 implements display functions through GPU, display screen 1094, and application processor.
[0232] Camera 1093 is used to capture still images or videos. An object is projected onto a photosensitive element by generating an optical image through the lens. The photosensitive element can be a charge-coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the light signal into an electrical signal, which is then transmitted to an ISP for conversion into a digital image signal. The ISP outputs the digital image signal to a DSP for processing. The DSP converts the digital image signal into image signals in standard formats such as RGB and YUV. In this embodiment, electronic device 1000 may include two cameras, one of which can be used to capture color images, such as an RGB camera, and the other camera can be used to capture infrared images, such as a ToF camera. It is understood that electronic device 1000 may also include more than two cameras.
[0233] The display screen 1094 is used to display images, videos, etc. The display screen 1094 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), a minimized display, a microLED, a micro-OLED, a quantum dot light-emitting diode (QLED), etc. In some embodiments, the electronic device 1000 may include one or N display screens 1094, where N is a positive integer greater than 1.
[0234] The ambient light sensor 1080L is used to sense ambient light intensity. The electronic device 1000 can adaptively adjust the brightness of its display screen 1094 based on the sensed ambient light intensity. The ambient light sensor 1080L can also be used to automatically adjust the white balance when taking photos. The ambient light sensor 1080L can also work in conjunction with the proximity sensor 1080G to detect whether the electronic device 1000 is in a pocket, preventing accidental touches.
[0235] In this embodiment, the electronic device being in a screen-on state means that the display panel of the display screen 1094 is lit up, and the electronic device being in a screen-off state means that the display panel of the display screen 1094 is not lit up. When the electronic device is in a screen-on state, the display screen 1094 can display any interface such as the desktop, the negative one screen, the application interface of a certain application, or the lock screen.
[0236] Alternatively, in one implementation, the electronic device 1000 can perform a shooting function through an ISP, a camera 1093, a video codec, a GPU, and an application processor to acquire an image including a face, and detect facial features based on the image.
[0237] by Figure 2 Taking the illustrated application scenario as an example, in this embodiment, when the screen-off countdown is less than or equal to 4 seconds, the processor 1010 calls the ambient light sensor 1080L to obtain the current ambient light intensity. If the ambient light intensity is greater than or equal to a second threshold, the processor calls the RGB camera in the camera 1093 to capture a color image containing a face, and detects facial features based on the color image, such as detecting whether the user's gaze point is within the display screen 1094. If the confidence level of the detection result is low, the processor 1010 obtains the average pixel intensity of the image. If the average pixel intensity is less than or equal to a preset value, the processor calls the ToF camera in the camera 1093 again to capture an infrared image containing a face, and detects facial features based on the infrared image.
[0238] It is understood that if the units integrated in the above-described device embodiments are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the above-described method embodiments. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or some intermediate form. The computer-readable medium can include at least: any entity or device capable of carrying computer program code to a photographic device / electronic device, a recording medium, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks. In some jurisdictions, according to legislation and patent practice, computer-readable media cannot be electrical carrier signals or telecommunication signals.
[0239] This application provides a computer program product that, when run on an electronic device 1000, enables a mobile terminal to execute the steps described in the above-described method embodiments.
[0240] It should be noted that the descriptions of each embodiment in the above embodiments have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0241] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0242] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.
[0243] The terms "first," "second," "third," "fourth," and other various terminology (if present) used in the specification, claims, and accompanying drawings of this application are intended to distinguish similar objects and are not necessarily used to describe a particular order or quantity. It should be understood that such data can be interchanged where appropriate so that the embodiments described herein can be implemented in orders other than those illustrated or described herein.
[0244] The terms “comprising” and “having”, and any variations thereof, mean “including, but not limited to”, unless otherwise specifically emphasized, for example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those steps or units that are explicitly listed, but may include other steps or units that are not explicitly listed or that are inherent to such process, method, product, or device.
[0245] In the various embodiments of this application, unless otherwise specified or logically conflicting, the terminology and / or descriptions between different embodiments are consistent and can be referenced mutually. Technical features in different embodiments can be combined to form new embodiments based on their inherent logical relationships. The specific operational methods in the method embodiments of this application can also be applied to the device embodiments or system embodiments.
[0246] In this application, "pre-configuration" may include pre-defined terms, such as protocol definitions. These "pre-defined terms" can be implemented by pre-storing corresponding codes, tables, or other means of indicating relevant information in the device (e.g., including various network elements), and this application does not limit the specific implementation method.
[0247] In the schematic diagrams in the accompanying drawings of this application, dashed lines, arrows, or boxes indicate optional steps or modules, and dotted lines or boxes indicate annotations.
Claims
1. A facial feature detection method, applied to an electronic device, the electronic device having a first camera, characterized in that, The method includes: A first image is obtained by taking a picture using the first camera, and the first image includes a first face image; The first face image is processed based on the first neural network model, and a second face image is output. Perform facial feature detection based on the second facial image; The first feature in the second face image is obtained by enhancing the second feature in the first face image, and the third feature in the second face image is obtained by weakening the fourth feature in the first face image. The first feature corresponds to the second feature, and the third feature corresponds to the fourth feature. The first and second features are related to the face feature detection service, and the first and second features include eye features and facial contour features. The third and fourth features are not related to the face feature detection service, and the third and fourth features do not include eye features and facial contour features. The third and fourth features include one or more of the following: light and shadow features affected by illumination, facial features other than the eyes, biological features other than the facial features, jewelry features, and makeup features.
2. The method according to claim 1, characterized in that, The step of performing facial feature detection based on the second facial image includes: The system detects whether the user's gaze point is within the display screen of the electronic device based on the second facial image. The method further includes: Based on the detection results of the facial feature detection service, the on / off state of the display screen of the electronic device is controlled.
3. The method according to claim 2, characterized in that, The electronic device is currently in a screen-on state, and the step of capturing a first image using the first camera includes: When the countdown timer is less than or equal to the first threshold, the first image is captured using the first camera. The step of controlling the on / off state of the electronic device's display screen based on the detection results of the facial feature detection service includes: If the detection result indicates that the user's gaze point is within the display screen of the electronic device, the display screen of the electronic device is kept on and the screen-off countdown is reset.
4. The method according to claim 2 or 3, characterized in that, The step of detecting whether the user's gaze point is located within the display screen of the electronic device based on the second facial image includes: The left and right eye images are determined based on the second face image; The left eye image and the right eye image are processed based on the eye opening and closing recognition model, and the eye opening and closing result is output. The eye opening and closing result is used to indicate whether the user is in an open eye state. When the user's eyes are open, the system detects whether the user's gaze point is within the display screen of the electronic device based on the second face image, the left eye image, and the right eye image.
5. The method according to claim 4, characterized in that, The step of detecting whether the user's gaze point is located within the display screen of the electronic device based on the second face image, the left eye image, and the right eye image includes: The left-eye features and right-eye features are extracted from the left-eye image and the right-eye image respectively using a human eye feature extraction model; Facial features are extracted from the second face image using a facial feature extraction model; The left eye feature, the right eye feature, and the facial feature are fused to obtain fused data; The fused data is processed using a gaze recognition model, and a gaze result is output, which indicates whether the user's gaze point is located within the display screen of the electronic device.
6. The method according to any one of claims 1 to 3, characterized in that, The electronic device also includes a second camera; If the reliability of the detection result of the face feature detection service is lower than the second threshold or the face feature detection service fails to execute, the average pixel value of the first face image is obtained. If the average pixel value of the first face image is lower than the third threshold, the second image is obtained by taking a picture using the second camera, and the second image includes the third face image; The third face image is processed using a second neural network model to output a fourth face image; The facial feature detection service is performed based on the fourth facial image; Wherein, the feature intensity of the fifth feature in the fourth face image is greater than or equal to the feature intensity of the sixth feature in the third face image, the feature intensity of the seventh feature in the fourth face image is less than the feature intensity of the eighth feature in the third face image, the fifth feature corresponds to the sixth feature, the seventh feature corresponds to the eighth feature, the fifth feature and the sixth feature are related to the face feature detection service, and the seventh feature and the eighth feature are not related to the face feature detection service, the first image is a color image, and the second image is an infrared image.
7. The method according to claim 6, characterized in that, The step of capturing the first image using the first camera includes: Obtain the current ambient light intensity; When the ambient light intensity is greater than or equal to the fourth threshold, the first image is obtained by taking a picture using the first camera.
8. The method according to claim 6 or 7, characterized in that, The first camera is used to acquire two-dimensional images, and the second camera is used to acquire images containing depth information.
9. The method according to claim 8, characterized in that, The first camera is an RGB camera (red, green, blue), and the second camera is a Time-of-Flight (Tof) camera.
10. The method according to any one of claims 1 to 3, characterized in that, Before processing the first face image based on the first neural network model, the method further includes: It is determined that the first image includes the first face image.
11. The method according to any one of claims 1 to 3, characterized in that, The first neural network model is trained based on training data, which includes multiple sample images. These multiple sample images are obtained by augmenting the same original image with features unrelated to the face feature detection business. The original image includes a fifth face image.
12. The method according to claim 11, characterized in that, The first neural network model is trained based on the adversarial loss function value, the second loss function value, and the third loss function value of the first loss function value. The first loss function value is obtained by comparing the plurality of sample images with the plurality of reconstructed images. The plurality of reconstructed images are generated based on the plurality of feature vectors, which are extracted from the training data. The plurality of feature vectors, the plurality of sample images, and the plurality of reconstructed images correspond one-to-one. The second loss function value is obtained by comparing the average vector distance between the plurality of feature vectors with the target distance; The third loss function value is obtained by comparing the detection result obtained from performing the face feature detection service based on the multiple feature vectors with the target result.
13. A method for training a neural network model, characterized in that, The method for training the first neural network model as described in claim 1 includes: Acquire training data, which includes multiple sample images. The multiple sample images are obtained by augmenting the same original image with features unrelated to the face feature detection business. The original image includes a face image. Extract multiple feature vectors from the training data; Multiple reconstructed images are generated based on the multiple feature vectors, and the multiple feature vectors, the multiple sample images, and the multiple reconstructed images correspond one-to-one; The face feature detection service is performed based on the multiple feature vectors to obtain the detection results; The plurality of sample images are compared with the corresponding plurality of reconstructed images to obtain the first loss function value; The vector distance between the multiple feature vectors is compared with the target distance to obtain the second loss function value; The detection result is compared with the target result to obtain the third loss function value; Based on the first loss function value, the second loss function value, and the third loss function value, the parameters of the neural network model are adjusted to obtain the first neural network model.
14. The method according to claim 13, characterized in that, The step of adjusting the parameters of the neural network model based on the first loss function value, the second loss function value, and the third loss function value includes: The parameters of the neural network model are adjusted to reduce the adversarial loss function value, the second loss function value, and the third loss function value of the first loss function value.
15. The method according to claim 14, characterized in that, The adversarial loss function value of the first loss function value is equal to the difference between 1 and the first loss function value.
16. An electronic device, characterized in that, The electronic device includes a processor and a memory. The memory is configured to store programs that support the electronic device in performing the methods provided by any one of claims 1 to 12, and to store data related to implementing the methods described in any one of claims 1 to 12; or, the memory is configured to store programs that support the electronic device in performing the methods provided by any one of claims 13 to 15, and to store data related to implementing the methods described in any one of claims 13 to 15. The processor is configured to execute programs stored in the memory.
17. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores instructions that, when executed on a computer, cause the computer to perform the method as described in any one of claims 1 to 12, or the method as described in any one of claims 13 to 15.