Facial expression recognition

By using a 3D deformable model to determine facial landmark features in a facial expression recognition system and inputting them into a neural network, the problems of insufficient accuracy and efficiency in existing technologies are solved, and more efficient facial expression recognition is achieved.

CN116964643BActive Publication Date: 2026-06-26QUALCOMM INC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
QUALCOMM INC
Filing Date
2022-01-21
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing facial expression recognition technologies are insufficient in terms of accuracy and efficiency, especially when using facial landmark detection technology, which fails to fully realize its potential.

Method used

Facial landmark features are determined using 3D deformable model (3DMM) technology and then input into a trained neural network to improve facial expression recognition.

Benefits of technology

It improves the accuracy and efficiency of facial expression recognition, reduces processing time and power consumption, and enhances the expression classification ability of neural networks.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116964643B_ABST
    Figure CN116964643B_ABST
Patent Text Reader

Abstract

Systems and techniques for facial expression recognition are provided. In some examples, a system receives an image frame corresponding to a face of a person. The system also determines landmark feature information associated with landmark features of the face based on a three-dimensional model of the face. The system then inputs the image frame and the landmark feature information to at least one layer of a neural network trained for facial expression recognition. The system further determines a facial expression associated with the face using the neural network.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to facial expression recognition. More specifically, this disclosure relates to improving facial expression recognition systems by implementing facial landmark detection techniques in a neural network trained for facial expression recognition. Background Technology

[0002] Many devices and systems allow a scene to be captured by generating images (or frames) and / or video data (including multiple frames). For example, a camera or a computing device that includes a camera (e.g., a mobile device, such as a mobile phone or smartphone that includes one or more cameras) can capture a sequence of frames of a scene. Image and / or video data can be captured and processed by such devices and systems (e.g., mobile devices, IP cameras, etc.) and can be output for consumption (e.g., displayed on this device and / or other devices). In some cases, image and / or video data can be captured by such devices and systems and output for processing and / or consumption by other devices.

[0003] Images can be processed (e.g., using human faces or object detection, recognition, segmentation, etc.) to determine any objects or people present in the image, which is useful for many applications. For example, a model can be determined to recognize the facial expressions of a person captured in an image, and this model can be used to facilitate the efficient operation of various applications and systems. Examples of such applications and systems include augmented reality (AR), artificial reality (AI), Internet of Things (IoT) devices, security systems (e.g., vehicle security systems), emotion recognition systems, and many other applications and systems. Summary of the Invention

[0004] This document describes systems and techniques that can be implemented to improve facial expression recognition. According to at least one example, apparatus for improving facial expression recognition is provided. Example apparatus may include a memory (or multiple memories) and a processor or multiple processors (e.g., implemented in circuitry) coupled to the memory (or multiple memories). The processor(s) are configured to: receive image frames corresponding to a human face; determine landmark feature information associated with landmark features of the face based on a three-dimensional model of the face; input the image frames and landmark feature information to at least one layer of a neural network trained for facial expression recognition; and use the neural network to determine the facial expression associated with the face.

[0005] Another example apparatus may include: a component for receiving an image frame corresponding to a human face; a component for determining landmark feature information associated with landmark features of the face based on a three-dimensional model of the face; a component for inputting the image frame and landmark feature information to at least one layer of a neural network trained for facial expression recognition; and a component for using the neural network to determine facial expressions associated with the face.

[0006] In another example, a method for improving facial expression recognition is provided. The example method may include receiving image frames corresponding to a person's face. The method may also include determining landmark feature information associated with landmark features of the face based on a 3D model of the face. The method may include inputting image frames and landmark feature information to at least one layer of a neural network trained for facial expression recognition. The method may also include using the neural network to determine facial expressions associated with the face.

[0007] In another example, a non-transitory computer-readable medium is provided for improving facial expression recognition. The example non-transitory computer-readable medium may store instructions that, when executed by one or more processors, cause the one or more processors to: determine landmark feature information associated with landmark features of the face based on a 3D model of the face; input image frames and landmark feature information to at least one layer of a neural network trained for facial expression recognition; and use the neural network to determine facial expressions associated with the face.

[0008] In some aspects, landmark feature information may include one or more blending shape coefficients determined based on a 3D model. In some examples, the methods, apparatuses, and computer-readable media described above may include: generating a 3D model of a face; and determining one or more blending shape coefficients based on a comparison between the 3D model of the face and image data corresponding to the face within an image frame. In one example, the methods, apparatuses, and computer-readable media described above may include inputting one or more blending shape coefficients into a fully connected layer of a neural network. Furthermore, in some cases, the fully connected layer may concatenate one or more blending shape coefficients with data output from a convolutional layer of the neural network.

[0009] In some aspects, the methods, apparatuses, and computer-readable media described above may include generating a landmark image frame indicating one or more landmark features of a face using one or more hybrid shape coefficients. For example, the methods, apparatuses, and computer-readable media described above may include: determining multiple landmark features of a face based on one or more hybrid shape coefficients; determining a subset of the multiple landmark features corresponding to a key landmark feature; and generating a landmark image frame based on forming one or more connections among the subsets of the multiple landmark features corresponding to the key landmark feature. In one example, the methods, apparatuses, and computer-readable media described above may include determining a subset of multiple landmark features corresponding to a key landmark feature based on determining landmark features related to a person's facial expression. Further, the landmark image frame may include a binary image frame indicating pixels corresponding to key landmark features using predetermined pixel values.

[0010] In some aspects, the methods, apparatuses, and computer-readable media described above may include inputting landmark image frames into one or more layers of a neural network. For example, the methods, apparatuses, and computer-readable media described above may include: inputting a first version of a landmark image frame into a first layer of the neural network, the first version of the landmark image frame having a first resolution; and inputting a second version of the landmark image frame into a second layer of the neural network occurring after the first layer, the second version of the landmark image frame having a second resolution lower than the first resolution. In one example, the first and second layers of the neural network may be convolutional layers. Furthermore, the neural network may include a pooling layer between the first and second layers. The pooling layer may be configured to: downsample activation data output from the first layer to a second resolution of the second version of the landmark image frame; receive the second version of the landmark image frame; and pass the downsampled activation data output from the first layer and the second version of the landmark image frame to the second layer.

[0011] In some aspects, the methods, apparatus, and computer-readable media described above may include training a neural network using a training dataset. The training dataset may include: multiple image frames corresponding to the faces of multiple people, the multiple image frames being labeled with facial expressions associated with the faces of the multiple people; and multiple landmark feature information associated with the multiple image frames.

[0012] In some respects, 3D models can include 3D deformable models (3DMMs).

[0013] In some aspects, the methods, apparatus, and computer-readable media described above may include using a camera system to capture image frames corresponding to a human face.

[0014] In some aspects, each of the devices described above is or includes a camera, a mobile device (e.g., a mobile phone or so-called "smartphone" or other mobile device), a smart wearable device, an extended reality device (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a personal computer, a laptop computer, a server computer, a vehicle (e.g., an autonomous vehicle), or other device. In some aspects, the device includes one or more cameras for capturing one or more videos and / or images. In some aspects, the device also includes a display for displaying one or more videos and / or images. In some aspects, the device described above may include one or more sensors.

[0015] This invention is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used alone to define the scope of the claimed subject matter. This subject matter should be understood by referring to the appropriate portions of the entire specification of this patent, any or all of the drawings, and each claim.

[0016] The foregoing and other features and embodiments will become more apparent when the following description, claims and drawings are taken into account. Attached Figure Description

[0017] The exemplary embodiments of this application will now be described in detail with reference to the following figures:

[0018] Figure 1 This is a block diagram illustrating an example architecture of an image capture and processing system based on some examples;

[0019] Figure 2 This is a block diagram illustrating an example architecture of an expression recognition system based on some examples;

[0020] Figure 3A and Figure 3B This is an example of a sample head model generated from a 3D deformable model (3DMM) based on some examples;

[0021] Figure 3C It is an example of example landmark features associated with some example image frames;

[0022] Figure 3D and Figure 3E This is an example of a sample landmark image frame based on some examples;

[0023] Figure 3F This is a block diagram of an example architecture for a landmark feature system based on some examples;

[0024] Figure 4A , Figure 4B and Figure 4C This is an example architecture diagram of a neural network for facial expression recognition trained using landmark feature information based on some examples;

[0025] Figure 5A and Figure 5B This is an example of the accuracy of a neural network for facial expression recognition trained using landmark feature information, based on some examples.

[0026] Figure 6 This is a flowchart illustrating an example of an improved facial expression recognition process based on some examples;

[0027] Figure 7 This is a diagram illustrating an example of a neural network visualization model based on some examples;

[0028] Figure 8A This is a diagram illustrating examples of neural network models including feedforward weights and recursive weights, based on some examples;

[0029] Figure 8BThe diagram illustrates examples of neural network models with different connection types, based on several examples.

[0030] Figure 9 This is a diagram illustrating detailed examples of convolutional neural network models based on some examples;

[0031] Figure 10A , Figure 10B and Figure 10C This is a diagram illustrating a simple example of convolution based on some examples;

[0032] Figure 11 This is a diagram illustrating examples of max pooling applied to corrected feature maps, based on several examples; and

[0033] Figure 12 This is a diagram illustrating an example of a system used to implement some of the aspects described herein. Detailed Implementation

[0034] Certain aspects and embodiments of this disclosure are provided below. Some of these aspects and embodiments may be applied independently, and some of these aspects and embodiments may be combined, as will be apparent to those skilled in the art. In the following description, specific details are set forth for purposes of explanation to provide a thorough understanding of embodiments of this application. However, it will be apparent that various embodiments may be practiced without these specific details. The accompanying drawings and description are not intended to be limiting.

[0035] The following description provides exemplary embodiments only and is not intended to limit the scope, applicability, or configuration of this disclosure. Rather, the following description of exemplary embodiments will provide an enabling description for those skilled in the art to implement the exemplary embodiments. It should be understood that various changes may be made to the function and arrangement of the elements without departing from the spirit and scope of this application as set forth in the appended claims.

[0036] Facial expression recognition and facial landmark detection are two important tasks that can be performed by facial analysis systems. Facial expression recognition involves the automatic classification and / or estimation of facial expressions depicted in an image frame. Facial landmark detection involves locating key (e.g., important or relevant) facial points in an image frame. Key facial points may include the corners of the eyes, the corners of the mouth, the tip of the nose, and other facial points. The location of detected facial landmarks can characterize and / or indicate facial shape (and the shape of one or more facial features such as the nose or mouth). In some cases, facial expression recognition and facial landmark detection enable computer-implemented systems (e.g., machine learning models) to determine and / or infer human characteristics such as behavior, intention, and / or emotion. Although facial expression recognition and facial landmark detection are related, they are often implemented for separate tasks and output different types of information.

[0037] This disclosure describes systems, apparatuses, methods, and computer-readable media (collectively, the “Systems and Techniques”) for improving facial expression recognition. These systems and techniques provide facial expression recognition systems with the ability to utilize facial landmark detection techniques, which enable more accurate and / or efficient facial expression recognition. For example, a facial expression recognition system can use three-dimensional deformable model (3DMM) techniques (e.g., blendshape techniques) to determine information associated with landmark features of image frames (referred to as landmark feature information). The facial expression recognition system can use the landmark feature information as input to a neural network trained to perform facial expression recognition. In some examples, the landmark feature information enables the neural network to more effectively identify regions in image frames that are relevant to and / or important to facial expression recognition. Therefore, utilizing landmark feature information in a neural network trained for facial expression recognition can improve the accuracy of the expression classification output by the neural network (e.g., without increasing processing time and / or power).

[0038] This article provides further details on facial expression recognition using various graphs. Figure 1 This is a block diagram illustrating the architecture of an image capture and processing system 100. The image capture and processing system 100 includes various components for capturing and processing images of a scene (e.g., an image of scene 110). The image capture and processing system 100 can capture individual images (or photographs) and / or can capture video comprising multiple images (or video frames) in a specific sequence. A lens 115 of the system 100 faces scene 110 and receives light from scene 110. The lens 115 bends the light toward an image sensor 130. The light received by the lens 115 passes through an aperture controlled by one or more control mechanisms 120 and is received by the image sensor 130.

[0039] One or more control mechanisms 120 may control exposure, focus, and / or zoom based on information from image sensor 130 and / or information from image processor 150. One or more control mechanisms 120 may include multiple mechanisms and components; for example, control mechanism 120 may include one or more exposure control mechanisms 125A, one or more focus control mechanisms 125B, and / or one or more zoom control mechanisms 125C. One or more control mechanisms 120 may also include additional control mechanisms besides those illustrated, such as controls for analog gain, flash, HDR, depth of field, and / or other image capture attributes.

[0040] The focus control mechanism 125B of the control mechanism 120 can obtain focus settings. In some examples, the focus control mechanism 125B stores the focus settings in a memory register. Based on the focus settings, the focus control mechanism 125B can adjust the positioning of the lens 115 relative to the image sensor 130. For example, based on the focus settings, the focus control mechanism 125B can adjust the focus by moving the lens 115 closer to or away from the image sensor 130 via an actuated motor or servo system. In some cases, the device 105A may include additional lenses, such as one or more microlenses on each photodiode of the image sensor 130, each bending light received from the lens 115 toward the corresponding photodiode before it reaches the photodiode. The focus settings can be determined via contrast detection autofocus (CDAF), phase detection autofocus (PDAF), or some combination thereof. The focus settings can be determined using the control mechanism 120, the image sensor 130, and / or the image processor 150. The focus settings may be referred to as image capture settings and / or image processing settings.

[0041] The exposure control mechanism 125A of the control mechanism 120 can obtain the exposure settings. In some cases, the exposure control mechanism 125A stores the exposure settings in a memory register. Based on the exposure settings, the exposure control mechanism 125A can control the aperture size (e.g., aperture size or f / stop), the duration of the aperture opening (e.g., exposure time or shutter speed), the sensitivity of the image sensor 130 (e.g., ISO speed or film speed), the analog gain applied by the image sensor 130, or any combination thereof. The exposure settings may be referred to as image capture settings and / or image processing settings.

[0042] The zoom control mechanism 125C of the control mechanism 120 can obtain zoom settings. In some examples, the zoom control mechanism 125C stores the zoom settings in a memory register. Based on the zoom settings, the zoom control mechanism 125C can control the focal length of an assembly (lens assembly) including lens elements such as lens 115 and one or more additional lenses. For example, the zoom control mechanism 125C can control the focal length of the lens assembly by actuating one or more motors or servo systems to move one or more lenses relative to each other. The zoom settings may be referred to as image capture settings and / or image processing settings. In some examples, the lens assembly may include a fixed-focus zoom lens or a zoom zoom lens. In some examples, the lens assembly may include a focusing lens (which may be lens 115 in some cases) that first receives light from scene 110, and then the light passes through a focusless zoom system between the focusing lens (e.g., lens 115) and image sensor 130 before reaching image sensor 130. In some cases, a focusless scaling system may include two positive (e.g., converging, convex) lenses with equal or similar focal lengths (e.g., within a threshold difference) and a negative (e.g., diverging, concave) lens between them. In some cases, the scaling control mechanism 125C moves one or more of the lenses in the focusless scaling system, such as one or both of the negative and positive lenses.

[0043] Image sensor 130 includes one or more arrays of photodiodes or other photosensitive elements. Each photodiode measures the amount of light that ultimately corresponds to a specific pixel in the image generated by image sensor 130. In some cases, different photodiodes can be covered by different color filters, and thus light matching the color of the filter covering the photodiode can be measured. For example, Bayer color filters include red, blue, and green filters, where each pixel of the image is generated based on red light data from at least one photodiode covered by a red filter, blue light data from at least one photodiode covered by a blue filter, and green light data from at least one photodiode covered by a green filter. Other types of color filters can be used in place of or supplement to red, blue, and / or green filters using yellow, magenta, and / or cyan (also known as "emerald") color filters. Some image sensors may lack color filters entirely, instead using different photodiodes (in some cases, vertically stacked) throughout the pixel array. Different photodiodes throughout the pixel array can have different spectral sensitivity profiles, and thus respond to light of different wavelengths. Monochrome image sensors may also lack color filters and therefore lack color depth.

[0044] In some cases, image sensor 130 may alternatively or additionally include an opaque and / or reflective mask to block light from reaching certain photodiodes or portions of certain photodiodes at certain times and / or from certain angles, which can be used for phase detection autofocus (PDAF). Image sensor 130 may also include an analog gain amplifier to amplify the analog signal output by the photodiodes, and / or an analog-to-digital converter (ADC) to convert the analog signal output by the photodiodes (and / or the analog signal amplified by the analog gain amplifier) ​​into a digital signal. In some cases, certain components or functions discussed with respect to one or more of the control mechanisms 120 may alternatively or additionally be included in image sensor 130. Image sensor 130 may be a charge-coupled device (CCD) sensor, an electron-multiplying CCD (EMCCD) sensor, an active pixel sensor (APS), a complementary metal-oxide-semiconductor (CMOS), an N-type metal-oxide-semiconductor (NMOS), a hybrid CCD / CMOS sensor (e.g., sCMOS), or some other combination thereof.

[0045] Image processor 150 may include one or more processors, such as one or more image signal processors (ISPs) (including ISP 154), one or more host processors (including host processor 152), and / or one or more processors of any other type 1210. Host processor 152 may be a digital signal processor (DSP) and / or other types of processors. In some embodiments, image processor 150 is a single integrated circuit or chip (e.g., referred to as a system-on-a-chip or SoC) that includes host processor 152 and ISP 154. In some cases, the chip may also include one or more input / output ports (e.g., input / output (I / O) port 156), a central processing unit (CPU), a graphics processing unit (GPU), a broadband modem (e.g., 3G, 4G, or LTE, 5G, etc.), memory, and connectivity components (e.g., Bluetooth). TM The I / O port 156 may include any suitable input / output port or interface according to one or more protocols or specifications, such as Inter-Integrated Circuit 2 (I2C) interface, Inter-Integrated Circuit 3 (I3C) interface, Serial Peripheral Interface (SPI) interface, Serial General Purpose Input / Output (GPIO) interface, Mobile Industrial Processor Interface (MIPI) (e.g., MIPI CSI-2 physical (PHY) layer port or interface, Advanced High Performance Bus (AHB) bus), any combination thereof, and / or other input / output ports. In an exemplary example, the host processor 152 may use the I2C port to communicate with the image sensor 130, while the ISP 154 may use the MIPI port to communicate with the image sensor 130.

[0046] Image processor 150 can perform a variety of tasks, such as demosaicing, color space conversion, image frame downsampling, pixel interpolation, automatic exposure (AE) control, automatic gain control (AGC), CDAF, PDAF, automatic white balance, merging image frames to form an HDR image, image recognition, object recognition, feature recognition, receiving input, managing output, managing memory, or some combination thereof. Image processor 150 can store image frames and / or processed images in random access memory (RAM) 140 / 1220, read-only memory (ROM) 145 / 1225, cache 1212, memory unit 1215, another storage device 1230, or some combination thereof.

[0047] Various input / output (I / O) devices 160 can be connected to the image processor 150. I / O devices 160 may include a display screen, keyboard, keypad, touchscreen, touchpad, touch-sensitive surface, printer, any other output device 1235, any other input device 1245, or combinations thereof. In some cases, captions can be entered into the image processing device 105B via the physical keyboard or keypad of the I / O device 160, or via the virtual keyboard or keypad of the touchscreen of the I / O device 160. I / O 160 may include one or more ports, jacks, or other connectors that enable wired connections between the device 105B and one or more peripheral devices, through which the device 105B can receive data from and / or send data to one or more peripheral devices. I / O 160 may include one or more wireless transceivers that enable wireless connections between the device 105B and one or more peripheral devices, through which the device 105B can receive data from and / or send data to one or more peripheral devices. Peripheral devices may include any type of I / O device 160 discussed earlier, and can be considered as I / O devices 160 themselves once they are coupled to ports, jacks, wireless transceivers or other wired and / or wireless connectors.

[0048] In some cases, the image capture and processing system 100 may be a single device. In other cases, the image capture and processing system 100 may be two or more separate devices, including an image capture device 105A (e.g., a camera) and an image processing device 105B (e.g., a computing device coupled to the camera). In some embodiments, the image capture device 105A and the image processing device 105B may be wirelessly coupled together, for example, via one or more wires, cables, or other electrical connectors and / or via one or more wireless transceivers. In some embodiments, the image capture device 105A and the image processing device 105B may be disconnected from each other.

[0049] like Figure 1 As shown, the vertical dashed line will Figure 1 The image capture and processing system 100 is divided into two parts, namely image capture device 105A and image processing device 105B. Image capture device 105A includes a lens 115, a control mechanism 120, and an image sensor 130. Image processing device 105B includes an image processor 150 (including an ISP 154 and a host processor 152), RAM 140, ROM 145, and I / O 160. In some cases, certain components shown in image capture device 105A, such as ISP 154 and / or host processor 152, may be included in image capture device 105A.

[0050] Image capture and processing system 100 may include electronic devices such as mobile or landline handsets (e.g., smartphones, cellular phones, etc.), desktop computers, laptop or notebook computers, tablet computers, set-top boxes, televisions, cameras, display devices, digital media players, video game consoles, video streaming devices, Internet Protocol (IP) cameras, or any other suitable electronic devices. In some examples, image capture and processing system 100 may include one or more wireless transceivers for wireless communication, such as cellular network communication, 802.11 Wi-Fi communication, wireless local area network (WLAN) communication, or combinations thereof. In some embodiments, image capture device 105A and image processing device 105B may be different devices. For example, image capture device 105A may include a camera device, while image processing device 105B may include a computing device, such as a mobile handset, desktop computer, or other computing device.

[0051] Although the image capture and processing system 100 is shown to include certain components, those skilled in the art will understand that the image capture and processing system 100 may include more than [other components]. Figure 1The components shown are further components. Components of the image capture and processing system 100 may include software, hardware, or one or more combinations of software and hardware. For example, in some embodiments, components of the image capture and processing system 100 may include and / or be implemented using electronic circuitry or other electronic hardware, which may include one or more programmable electronic circuits (e.g., microprocessors, GPUs, DSPs, CPUs, and / or other suitable electronic circuits), and / or may include and / or be implemented using computer software, firmware, or any combination thereof to perform the various operations described herein. The software and / or firmware may include one or more instructions stored on a computer-readable storage medium and executable by one or more processors of an electronic device implementing the image capture and processing system 100.

[0052] The host processor 152 can configure the image sensor 130 with new parameter settings (e.g., via an external control interface such as I2C, I3C, SPI, GPIO, and / or other interfaces). In an exemplary example, the host processor 152 can update the exposure settings used by the image sensor 130 based on the internal processing results of the exposure control algorithm from past image frames. The host processor 152 can also dynamically configure the parameter settings of the internal pipelines or modules of the ISP 154 to match the settings from one or more input image frames from the image sensor 130, so that the ISP 154 processes the image data correctly. The processing (or pipeline) blocks or modules of the ISP 154 may include modules for lens / sensor noise correction, demosaicing, color conversion, image attribute correction or enhancement / suppression, denoising filters, sharpening filters, etc. The settings of different modules of the ISP 154 can be configured by the host processor 152. Each module may include a large number of adjustable parameter settings. Additionally, since different modules can affect similar aspects of the image, the modules can be interdependent. For example, denoising and texture correction or enhancement may both affect the high-frequency aspects of the image. Therefore, the ISP uses a large number of parameters to generate the final image from the captured raw image.

[0053] Figure 2 This is a block diagram illustrating an example of an expression recognition system 200. In some embodiments, the expression recognition system 200 may be composed of... Figure 1The image capture and processing system 100 shown is implemented. For example, the expression recognition system 200 may be implemented by the image processor 150, the image sensor 130, and / or any additional components of the image capture and processing system 100. The expression recognition system 200 may be implemented by any additional or alternative computing device or system. As shown, the expression recognition system 200 may include one or more engines, including an image frame engine 202, a landmark feature engine 204, and an expression recognition engine 206. As will be explained in more detail below, one or more of the engines of the expression recognition system 200 may correspond to and / or include machine learning models (e.g., deep neural networks) trained to perform facial expression recognition.

[0054] In one example, image frame engine 202 may receive image frame 208 captured by image sensor (e.g., image sensor 130) of expression recognition system 200. Image frame 208 may be a color image frame (e.g., RGB image frame), grayscale image frame, infrared (IR) image frame, near-infrared (NIR) image frame, or any other type of image frame. In some examples, receiving image frame 208 may initiate facial expression recognition processing and / or become part of facial expression recognition processing. For example, in response to receiving image frame 208, image frame engine 202 may pass image frame 208 to landmark feature engine 204 and / or expression recognition engine 206 to determine one or more facial expressions (e.g., expression classification 212) associated with image frame 208. In some cases, expression recognition system 200 may implement facial expression recognition processing based on one or more facial landmark detection techniques. As used herein, “facial landmark detection” (or simply “landmark detection”) is the task of detecting landmark features within image data corresponding to a human face. Landmark features may include any point, location, and / or region in an image frame associated with all or part of a facial feature. For example, landmark features can indicate and / or be associated with facial features such as the corners of the mouth, the corners of the eyes, the lip borders, the upper curve of the cheek, and the tip of the nose. In some examples, facial features may be associated with and / or defined by multiple landmark features (e.g., 10, 20, 30, etc.). In some cases, facial landmark detection may involve detecting key landmark features (e.g., features that are relatively important and / or relevant to the task). While both facial landmark detection and facial expression recognition are techniques used in facial analysis systems, these techniques often involve separate operations and / or produce different results. For example, many existing facial expression recognition processes can be performed without using any facial landmark detection techniques.

[0055] like Figure 2As shown, the landmark feature engine 204 can determine landmark feature information 210 associated with image frame 208. Landmark feature information 210 may include any information indicating and / or based on one or more landmark features of image frame 208. In one example, landmark feature information 210 may include information obtained using a system for generating 3D models (e.g., a 3D deformable model (3DMM) system). As used herein, a 3DMM system may include any type or form of generative model for creating, adjusting, animate, manipulating, and / or modeling faces and / or heads (e.g., human faces and / or heads). As used herein, a model generated by a 3DMM system may be referred to as a 3D head model (or 3DMM). In one example, a 3DMM system may generate a 3D head model displaying a specific facial expression. The 3D head model may be based on image data captured by a camera system and / or computer-generated image data.

[0056] In some examples, the 3DMM system can utilize blend shape tools to deform and / or model regions of a 3D head model. The deformation of the 3D head model caused by the blend shape tool can be adjusted by tuning one or more blend shape coefficients associated with the tool. As used herein, blend shape coefficients can correspond to an approximate semantic parameterization of a full or partial facial expression. For example, blend shape coefficients can correspond to a complete facial expression or a “partial” (e.g., “delta”) facial expression. Examples of partial expressions include raising an eyebrow, closing one eye, moving one side of the face, etc. In one example, a single blend shape coefficient can approximate the linearization effect of a single facial muscle movement.

[0057] In some cases, a 3DMM system can effectively (e.g., using relatively low processing power) adjust the facial expressions of a 3D head model by changing one or more blending shape coefficients associated with it. For example, Figure 3A An example 3D head model 302 that can be generated by a 3DMM system is shown. In this example, a user can adjust the facial expression of the 3D head model 302 by adjusting the position of one or more slider controls 306. Each slider control can be coupled with a blend shape (e.g., blend shape 0-37, such as...). Figure 3A (As shown). In this example, the blending shape coefficient of the blended shape is represented as a value between 0 and 6000, where a value of 3000 corresponds to a neutral facial expression, and values ​​of 0 and 6000 correspond to the maximum deviation from a neutral facial expression. Note that the raw output includes a value range from -3 to 3. Figure 3AThe output shown is normalized to a numerical range of 0 to 6000 for better resolution. The blending shape coefficient can be represented in any additional way, such as as a percentage and / or a floating-point number between 0.0 and 1.0. When the user adjusts one or more of the slider controls 306, the 3DMM system can adjust the blending shape coefficient accordingly, thereby adjusting the facial expression of the 3D head model 302 accordingly.

[0058] In one example, the hybrid shape coefficients may include identifier coefficients and / or facial expression coefficients. Identifier coefficients may represent facial features associated with a specific face (e.g., unique to that face). Facial expression coefficients may represent variations in facial features associated with a variety of facial expressions (e.g., generic facial expressions unrelated to a specific face). In an illustrative example, a 3D model of the shape of a human head can be represented using an equation... To determine, where S is the overall 3D shape of the human head, It is the average facial shape of a person, A id The feature vectors (e.g., principal components) and α are determined based on a model trained using 3D facial scans of people with neutral expressions. id It is the shape coefficient associated with neutral facial expressions, A exp The feature vectors determined by training the model are based on the offset between 3D facial scans of people with various facial expressions and 3D facial scans of people with neutral expressions, as well as α. exp It is a facial expression coefficient associated with various common facial expressions.

[0059] Based on the identifier coefficients associated with a specific person (e.g., the feature vector A in the above formula) id The 3DMM system can convert a general 3D head model into a 3D head model specific to a particular person. Furthermore, based on facial expression coefficients associated with specific facial expressions (e.g., α in the above formula), exp The 3DMM system can convert a general 3D head model into a 3D head model that represents a specific facial expression. Figure 3B Examples of various 3D head models that can be generated and / or transformed by 3DMM using various types of hybrid shape coefficients are provided. For example, model 310 shows a general 3D head model with a neutral expression, model 312 shows model 310 transformed into a facial expression of surprise, model 314 shows model 310 transformed into a facial expression of happiness (e.g., a smile), and model 316 shows model 310 transformed into a facial expression of disgust.

[0060] Back Figure 2Landmark feature information 210 may include one or more blending shape coefficients associated with image frame 208. In some cases, these blending shape coefficients may represent the variation between the position of a person's facial features within image frame 208 and the position of corresponding facial features in a 3D head model of a person with a neutral expression. For example, landmark feature engine 204 may detect a person's face within image frame 208 (e.g., based on one or more object detection algorithms, object recognition algorithms, face detection algorithms, face recognition algorithms, or any other recognition and / or detection algorithms). Landmark feature engine 204 may compare image data corresponding to a person's face with a 3D head model of that person's face. Based on the comparison, landmark feature engine 204 may determine a set of blending shape coefficients corresponding to the facial expressions associated with the person's face.

[0061] In one example, the determined blending shape coefficients may represent all or part of the landmark feature information 210. In other examples, the landmark feature information 210 may include landmark image frames generated based on the blending shape coefficients. In some examples, the landmark image frame may refer to and / or represent one or more key landmark features of a person's face within the image frame. To generate the landmark image frame, the landmark feature engine 204 may determine multiple landmark features associated with the image frame 208 based on the blending shape coefficients. For example, the blending shape coefficients may indicate the locations within the image frame 208 corresponding to various landmark features.

[0062] Figure 3C An example image frame 318 displaying multiple landmark features, which can be determined by a landmark feature engine 204 based on a set of mixed shape coefficients, is shown. In this example, the landmark features determined by the landmark feature engine 204 are shown as points superimposed on image data corresponding to a person's face within image frame 318. The landmark feature engine 204 can determine any number of landmark features associated with a person's face. For example, the landmark feature engine 204 can determine 100, 200, or 300 landmark features. In some cases, the landmark feature engine 204 can generate landmark image frames by determining a subset of landmark features corresponding to key landmark features. As used herein, key landmark features can correspond to landmark features that are highly correlated with a person's facial expression (e.g., correlation above a threshold correlation). In some cases, the landmark feature engine 204 can determine relevant landmark features by determining landmark features that specifically indicate a person's facial expression (e.g., compared to other landmark features). For example, landmark features associated with certain facial features (such as a person's eyes, nose, and / or mouth) may vary more in structure, appearance, and / or location between different facial expressions than landmark features associated with other facial features (such as a person's chin or forehead). Therefore, landmark feature engine 204 can identify variable (and therefore relevant) landmark features as key landmark features.

[0063] In some cases, the landmark feature engine 204 may generate landmark image frames based on forming one or more connections between key landmark features. For example, the landmark feature engine 204 may determine lines, curves, boundaries, and / or shapes that define one or more facial features associated with key landmark features. The landmark feature engine 204 may draw these connections to generate landmark image frames. In an exemplary example, the landmark image frame may be a binary image frame (e.g., a black and white image frame) that uses pixels set to a pixel value to indicate the connections between key landmark features. Furthermore, in some examples, the landmark feature engine 204 may generate landmark image frames using a specific type and / or subset of blending shape coefficients. For example, the landmark feature engine 204 may generate landmark image frames using landmark features determined based on facial expression blending shape coefficients (rather than identifier blending shape coefficients). In some cases, facial features associated with identifier blending shape coefficients may be unnecessary for facial expression recognition. Therefore, ignoring identifier blending shape coefficients when generating landmark image frames can simplify and / or speed up facial expression recognition processing. Furthermore, in some examples, the landmark feature engine 204 may consider the rotation (e.g., orientation and / or angle) of a person's face when generating landmark image frames. For example, the landmark feature engine 204 can generate a “rotated” landmark image representing the 3D features of a face. However, in other examples, the landmark feature engine 204 can generate a “frontal” landmark image frame representing the two-dimensional (2D) features of a human face.

[0064] Figure 3D and Figure 3E Various examples of landmark image frames that can be generated by the landmark feature engine 204 are shown. For example, Figure 3D Landmark image frames generated based on image frames 320(A) and 320(B) are shown. Image frames 320(A) and 320(B) represent facial images of the same person displaying different facial expressions. Landmark image frame 322(A) corresponds to a rotated landmark image frame generated based on the facial expression blending shape coefficient (but not the identifier blending shape coefficient) associated with image frame 320(A). Landmark image frame 324(A) corresponds to a frontal landmark image frame generated based on the facial expression blending shape coefficient (but not the identifier blending shape coefficient) associated with image frame 320(A). Furthermore, landmark image frame 326(A) corresponds to a frontal landmark image frame generated based on both the facial expression blending shape coefficient and the identifier blending shape coefficient associated with image frame 320(A). Landmark image frames 322(B), 324(B), and 326(B) are corresponding landmark image frames generated based on the blending shape coefficient associated with image frame 320(B). As shown in the figure, the landmark image frame generated based on image frame 320(A) is different from the landmark image frame generated based on image frame 320(B).

[0065] Figure 3E Landmark image frames generated based on image frames 328(A) and 328(B) are shown. Image frames 328(A) and 328(B) represent images of two different people (e.g., people with different average facial shapes). Landmark image frame 330(A) corresponds to a rotated landmark image frame generated based on the facial expression blending shape coefficient (but not the identifier blending shape coefficient) associated with image frame 328(A). Landmark image frame 332(A) corresponds to a frontal landmark image frame generated based on both the facial expression blending shape coefficient and the identifier blending shape coefficient associated with image frame 328(A). Furthermore, landmark image frame 334(A) corresponds to a frontal landmark image frame generated based on the facial expression blending shape coefficient (but not the identifier blending shape coefficient) associated with image frame 320(A). Landmark image frames 330(B), 332(B), and 334(B) are corresponding landmark image frames generated based on the blending shape coefficient associated with image frame 328(B). As shown in the figure, the landmark image frame generated based on image frame 328(A) is different from the landmark image frame generated based on image frame 328(B).

[0066] Figure 3F It is configured to be certain. Figure 2 The block diagram shows an example landmark feature system 300 for the landmark feature information 210 shown. For example, the landmark feature system 300 can be configured to determine blending shape coefficients and / or landmark image frames. In one example, all or part of the landmark feature system 300 may correspond to and / or be implemented by the landmark feature engine 204 of the expression recognition system 200. Figure 3E As shown, the face detection engine 338 of the landmark feature system 300 can receive image frames 336 (e.g., corresponding to...). Figure 2Image frame 208 in the image frame 336). Face detection engine 338 can perform any type or form of object detection, object recognition, face detection, and / or face recognition algorithms to detect one or more faces in image frame 336. In some examples, if face detection engine 338 determines that image frame 336 includes image data corresponding to a human face, 3DMM fitter 340 can determine blending shape coefficients associated with the human face. As mentioned above, in some examples, these blending shape coefficients can correspond to landmark feature information 210. In other examples, landmark image generator 342 of landmark feature system 300 can generate landmark image frame 344 based on the blending shape coefficients determined by 3DMM fitter 340. In some cases, 3DMM fitter 340 can use a machine learning model (e.g., a deep neural network, such as a convolutional neural network (CNN)) to determine the landmark features associated with image frame 336. For example, a CNN can utilize a loss function that compares image frame 366 with a 3D reconstructed version of image frame 366. In an exemplary example, a reconstructed version of image frame 336 can be generated based on an estimated depth map of image frame 366. For example, the 3DMM fitter 340 can use the estimated depth map to ensure that the 3D reconstructed version of image frame 366 is consistent with the (2D) image frame 366. The loss function (which may be referred to as the "shape from shading" loss function) provides an accurate and / or dense set of landmark features. In some examples, landmark image frame 344 (and optional blending shape coefficients) may correspond to landmark feature information 210.

[0067] Figure 4A This is a block diagram of a neural network 400(A) trained for facial expression recognition. In some examples, all or part of the neural network 400(A) may correspond to... Figure 2 The facial expression recognition system 200 and / or implemented therein. As will be explained in more detail below, neural network 400(A) may represent an example of the overall architecture and / or framework of a neural network implemented by the disclosed facial expression recognition system. Reference will be made to... Figure 4B and Figure 4C Examples of more specific implementations of neural networks are provided.

[0068] In some cases, neural network 400(A) can correspond to a neural network trained on image frames associated with various facial expressions. In this example, neural network 400(A) can be trained to output a classification of facial expressions associated with input image frames. In illustrative examples, neural network 400(A) can be a deep neural network, such as a convolutional neural network (CNN). The following will combine... Figure 7 , Figure 8A , Figure 8B , Figure 9, Figure 10A , Figure 10B , Figure 10C and Figure 11 Describe exemplary examples of deep neural networks. Additional examples of neural networks 400(A) include, but are not limited to, time-delay neural networks (TDNN), deep feedforward neural networks (DFFNN), recurrent neural networks (RNN), autoencoders (AE), transformative AEs (VAE), denoising AEs (DAE), sparse AEs (SAE), Markov chains (MC), perceptrons, or some combination thereof.

[0069] In some cases, neural network 400(A) may include one or more convolutional blocks, such as convolutional blocks 402(1)-402(4). As used herein, a convolutional block may represent a portion of a neural network comprising one or more convolutional layers. A convolutional layer may perform one or more functions (e.g., using one or more filters) of the output activation (e.g., as activation data). In an exemplary example, a convolutional layer may implement rectified linear activation units (ReLU). In some examples, a convolutional block may also include one or more other types of layers. For example, a convolutional block may include a pooling layer configured to downsample the activation data output by one or more convolutional layers. Furthermore, in some examples, a convolutional block may include a batch normalization layer configured to normalize the mean and / or standard deviation of the activation data output by one or more convolutional layers. Additionally or alternatively, a convolutional block may include a scaling layer that performs one or more scaling and / or biasing operations to restore the activation data to an appropriate range. In some cases, the batch normalization layer of a convolutional block can perform scaling and / or biasing operations. In these cases, the convolutional block may not include a separate scaling layer.

[0070] In an exemplary example, neural network 400(A) may include four convolutional blocks, each comprising three convolutional layers. However, neural network 400(A) may include any number of convolutional blocks and / or convolutional layers. Neural network 400(A) may also include one or more fully connected layers, such as fully connected layer 404. In one example, convolutional blocks 402(1)-402(4) and fully connected layer 404 may be trained to recognize facial expressions associated with image frame 406 input to neural network 400(A). For example, neural network 400(A) may determine expression classification 412 (e.g., corresponding to...). Figure 2The expression classification 412 in the image frame 400(A) may correspond to and / or indicate a facial expression in a set of candidate facial expressions. Each candidate facial expression may correspond to a class of the neural network 400(A). For example, the neural network 400(A) may be trained to determine which candidate facial expression most closely matches and / or corresponds to the facial expression associated with image frame 406. In an illustrative example, the set of candidate facial expressions may include seven facial expressions: neutral, angry, disgusted, fearful, happy, sad, and surprised. Each candidate facial expression may be assigned a unique identifier (e.g., an integer value from 0 to 6). In some examples, the expression classification 412 output by the fully connected layer 404 of the neural network 400(A) may include an identifier (e.g., an integer value) corresponding to the determined and / or selected candidate facial expression. In some examples, the fully connected layer 404 may output a confidence level or probability associated with each expression (e.g., a first probability for an identifier associated with a neutral facial expression, a second probability for an identifier associated with an angry facial expression, etc.). In such examples, expression category 412 can correspond to the identifier / expression with the highest confidence or probability.

[0071] In some cases, the neural network 400(A) can determine the expression classification 412 based on landmark feature information associated with image frame 406. For example, the neural network 400(A) can receive mixed shape coefficients and / or one or more landmark image frames associated with image frame 406 as input to one or more layers. This landmark feature information can enable the neural network 400(A) to determine the expression classification 412 more efficiently and / or accurately. For example, the landmark feature information can indicate to the neural network 400(A) features in image frame 406 that are important and / or relevant to facial expression recognition.

[0072] In one example, one or more convolutional layers of neural network 400(A) may utilize landmark image frames 408(1), 408(2), and / or 408(3). In some cases, these landmark images may have different sizes (e.g., resolution or scale). For example, the size of the landmark image frames utilized by the convolutional blocks may correspond to the size of the activation data processed by the convolutional layers of the convolutional blocks. Since the pooling layers of the convolutional blocks of neural network 400(A) can downsample the activation data before passing it to the next layer, utilizing landmark image frames of the corresponding size ensures that the landmark image frames can be processed accurately. Additionally or alternatively, neural network 400(A) may utilize a hybrid shape coefficient 414. For example, the hybrid shape coefficient 414 may be input to a fully connected layer 404. In some cases, the type of landmark feature information utilized by different layers of neural network 400(A) may be configured to process data formats based on the different layers. For example, convolutional layers can be configured to process image data (e.g., data format of landmark image frames), while fully connected layers can be configured to process floating-point numbers (e.g., data format of mixed shape coefficients). Furthermore, some implementations of the neural network 400(A) may utilize landmark images (instead of mixed shape coefficients), while other implementations of the neural network 400(A) may utilize mixed shape coefficients (instead of landmark images). However, further implementations of the neural network 400(A) may utilize both landmark images and mixed shape coefficients.

[0073] Figure 4B It corresponds to Figure 4A A diagram of a neural network 400(B) of an example implementation of neural network 400(A). Neural network 400(B) represents a “multi-scale” neural network that can utilize one or more landmark image frames of various sizes. As shown, neural network 400(B) includes convolutional blocks 402(1)-402(4) of neural network 400(A) (and additional convolutional layers 402(5)). In this example, each of convolutional blocks 402(1)-402(4) includes three convolutional layers (in... Figure 4B The diagram shows the unshaded rectangle, thus there are a total of 13 convolutional layers. However, neural network 400(B) can include any appropriate number of convolutional layers. In the illustrative example, convolutional blocks 402(1)-402(4) can perform conv2d convolution operations using kernels of size 3. Convolutional blocks 402(1)-402(4) can also each include a pooling layer (in the diagram). Figure 4B (shown as a shaded rectangle in the image). Furthermore, neural network 400(B) may include a fully connected layer 404 of neural network 400(A).

[0074] In one example, convolutional block 402(1) may receive image frame 406 and landmark image frame 408(1) as input. In some cases, the size of landmark image frame 408(1) may correspond to (e.g., match) the size of image frame 406. In an illustrative example, the size of each image frame may be 56×64 pixels. Each image frame may correspond to a separate channel, resulting in a total input size of 56×64×2. In one example, the convolutional layer of convolutional block 402(1) may output activation data (e.g., feature maps) of size 56×64×31 (e.g., 56×64 pixels for 31 channels). The pooling layer of convolutional block 402(1) may downsample (e.g., reduce its size) the activation data before passing it to convolutional block 402(2). For example, the pooling layer may halve the size of the activation data in each channel. In some cases, downsampling activation data in a convolutional neural network can enable the extraction and / or analysis of various types of features (e.g., coarse-grained features, medium-grained features, and / or fine-grained features). However, downsampling in a neural network 400(B) can lead to the loss of landmark feature information passed between convolutional blocks.

[0075] To account for and / or mitigate the loss of landmark feature information propagated between convolutional blocks, the pooling layer of convolutional block 402(1) may receive landmark image frame 408(2). In one example, landmark image frame 408(2) may be a version of landmark image frame 408(1) that has been downsampled at a rate corresponding to the downsampling rate of the pooling layer of convolutional block 402(1). For example, landmark image frame 408(2) may have a size of 28×32 pixels. The pooling layer may combine landmark image frame 408(2) with downsampled activation data from the convolutional layer. For example, the pooling layer may include landmark image frame 408(2) in separate channels, resulting in 32 data channels of size 28×32 pixels. In other examples, landmark image frame 408(2) may have the same size as landmark image frame 408(1) (e.g., 56×64 pixels). In these examples, the pooling layer may downsample landmark image frame 408(1) using the downsampling rate applied to the activation data input to the pooling layer. For example, the pooling layer can halve the size of the landmark image frame 408(2) to obtain a size of 28×32 pixels. Then, the pooling layer can combine the downsampled landmark image frame 408(2) with downsampled activation data from the convolutional layer (e.g., producing 32 data channels of size 28×32 pixels). Furthermore, in some cases, combining the landmark image frame 408(2) with the activation data can include combining a representative value (e.g., the average value) of the landmark image frame 408(2) with activation data corresponding to 31 channels. After combining the landmark image frame 408(2) with the activation data, the pooling layer can provide the combined data to the convolutional block 402(2).

[0076] In some cases, convolutional block 402(2) can generate 63 channels of activation data with a size of 28×32 pixels. The pooling layer of convolutional block 402(2) can downsample this activation data to produce 63 channels of activation data with a size of 14×16 pixels. To account for and / or mitigate the loss of landmark feature information due to downsampling, the pooling layer can combine the downsampled activation data with the landmark image frame 408(3). In one example, the landmark image frame 408(3) can have a size corresponding to the size of each downsampled channel (e.g., 14×16 pixels). In another example, the landmark image frame 408(3) can have a size corresponding to the size of each channel before downsampling (e.g., 28×32 pixels). In this example, the pooling layer can downsample the landmark image frame 408(3) to the size of the downsampled channels (e.g., 14×16 pixels). Therefore, in either example, the total size of the data output by the pooling layer of convolutional block 402(2) can be 14×16×64. Although in Figure 4BNot shown, but in some examples, one or more additional convolutional blocks of neural network 400(B) may receive a version of landmark image frame 408(1) of an appropriate size (e.g., downsampled). For example, the pooling layer of convolutional block 402(3) may receive a landmark image frame of size 7×8 pixels. The convolutional blocks of neural network 400(B) may utilize any number of full-size and / or downsampled landmark image frames.

[0077] In some examples, the fully connected layer 404 may determine the expression classification 412 based on the output of the final convolutional or pooling layer of the neural network 400(B) (e.g., convolutional layer 402(5)). For example, the fully connected layer 404 may determine a value corresponding to each candidate facial expression (which corresponds to each class of the neural network 400(B)). In an illustrative example, the fully connected layer 404 may use a softmax activation function that determines the probability associated with each class to select the most suitable candidate facial expression. The fully connected layer 404 may output an indication (e.g., a label) associated with the class with the highest probability. The neural network 400(B) may utilize any additional or alternative functions to determine the expression classification 412.

[0078] Figure 4C It corresponds to Figure 4A A diagram of another example implementation of neural network 400(A) is shown for neural network 400(C). As shown, the architecture of neural network 400(C) can be generally similar to that of neural network 400(B). For example, neural network 400(C) may include convolutional blocks 402(1)-402(4) and convolutional layers 402(5). In an illustrative example, convolutional blocks 402(1)-402(4) may each include three convolutional layers and one pooling layer. However, neural network 400(C) may include any number or combination of convolutional layers and / or pooling layers. In some cases, neural network 400(C) may receive landmark feature information at one or more fully connected layers (instead of at one or more convolutional layers, as discussed in conjunction with neural network 400(B). For example, neural network 400(C) may receive blended shape coefficients 414 at a fully connected layer 404(A) instead of receiving landmark image frames 408(1)-408(3) at convolutional blocks 402(1)-402(3). In this example, fully connected layer 404(A) may combine the blended shape coefficients 414 with activation data received from convolutional layer 402(5). For example, fully connected layer 404(A) may concatenate (e.g., sum) the activation data with one or more blended shape coefficients. In some cases, fully connected layer 404(A) (or additional fully connected layer 404(B)) may determine expression classification 412 based on the concatenated data.

[0079] As discussed above, neural network 400(B) can perform facial expression recognition using one or more landmark image frames, and neural network 400(C) can perform facial expression recognition using mixed shape coefficients. Therefore, each neural network can utilize one form of landmark feature information. However, in some cases, the facial expression recognition system described herein can utilize two (or more) types of landmark feature information (e.g., based on...). Figure 4A (The general neural network architecture shown). Furthermore, in some examples, the neural network described herein can be trained using a supervised training process to perform facial expression recognition. For example, the neural network can be trained on a training dataset comprising multiple image frames labeled with associated facial expressions. Additionally, the neural network can be trained on a training dataset comprising multiple landmark image frames labeled with associated facial expressions and / or mixed shape coefficients to utilize landmark feature information. In some cases, such supervised training processes can enable the neural network to utilize landmark feature information most effectively (information that many conventional facial expression recognition systems may not utilize). The disclosed facial expression recognition system can utilize neural networks trained using any additional or alternative types of training processes, including unsupervised and semi-supervised training processes.

[0080] Figure 5A and Figure 5B Example experimental data is shown, demonstrating the advantages of training neural networks to utilize landmark feature information for facial expression recognition. For example, Figure 5A Figure 502 shows the accuracy of a facial expression recognition system trained without using landmark feature information. Figure 5A Figure 504 also shows the accuracy of a facial expression recognition system trained using landmark image frames. Both systems utilize rotated image frames (rather than frontal image frames). As shown, the system trained using landmark image frames achieves higher accuracy (e.g., 0.88) than the system trained without landmark feature information (e.g., 0.86). Figure 5B Figure 506 shows the accuracy of a facial expression recognition system trained without using landmark feature information. Figure 5B Figure 508 also includes a diagram showing the accuracy of a facial expression recognition system trained using hybrid shape coefficients. Furthermore, Figure 5B Figure 510 shows the accuracy of facial expression recognition systems trained using landmark image frames. All systems utilize frontal image frames (rather than rotated image frames). As shown, the accuracy of systems trained using a combination of shape coefficients and landmark image frames (e.g., 0.88 and 0.87, respectively) is higher than that of systems trained without landmark feature information (e.g., 0.86).

[0081] Figure 6This is a flowchart illustrating an example process 600 for improved facial expression recognition. For clarity, refer to... Figure 3F Landmark feature system 300 and Figure 4A , Figure 4B and Figure 4C The neural network description process is described in 600. The steps or operations outlined in this document are examples and can be implemented in any combination thereof, including excluding, adding, or modifying certain combinations of steps or operations.

[0082] At operation 602, process 600 includes receiving an image frame corresponding to a human face. For example, neural network 400(A) may receive image frame 406. At operation 604, process 600 includes determining landmark feature information associated with landmark features of the face based on a 3D model of the human face. In one example, the landmark feature information may include one or more blending shape coefficients determined based on the 3D model. For example, a 3DMM fitter 340 of landmark feature system 300 may generate blending shape coefficients 414. In some cases, 3DMM fitter 340 may generate a 3D model (e.g., a 3D deformable model (3DMM)) and then generate blending shape coefficients 414 based on a comparison between the 3D model and image data corresponding to the face within image frame 406. Additionally or alternatively, landmark feature information may include landmark image frames indicating one or more landmark features of the face. For example, a landmark image generator 342 of landmark feature system 300 may generate landmark image frame 408(1) using blending shape coefficients 414. In one example, landmark image generator 342 may determine multiple landmark features of the face based on blending shape coefficients 414. Landmark image generator 342 can also determine a subset of multiple landmark features corresponding to key landmark features. For example, key landmark features may correspond to landmark features related to human facial expressions. In some cases, landmark image generator 342 may generate landmark image frame 408(1) based on forming one or more connections among the subsets of multiple landmark features corresponding to key landmark features. In an exemplary example, landmark image frame 408(1) may be a binary image frame that uses predetermined pixel values ​​to indicate pixels corresponding to key landmark features.

[0083] At operation 606, process 600 includes inputting image frames and landmark feature information to at least one layer of a neural network trained for facial expression recognition. For example, neural network 400(A) may receive image frame 406 and receive a blended shape coefficient 414 and / or a landmark image frame 408(1). In one example, a fully connected layer of neural network 400(C) (e.g., an embodiment of neural network 400(A)) may receive the blended shape coefficient 414. In one example, the fully connected layer may concatenate the blended shape coefficient 414 with data output from the convolutional layer of neural network 400(C).

[0084] In other examples, the first layer of neural network 400(B) (e.g., another implementation of neural network 400(A)) may receive landmark image frame 408(1). In this example, landmark image frame 408(1) represents a first version of the landmark image frame associated with image frame 406. Furthermore, landmark image frame 408(1) may have a first resolution (e.g., a resolution corresponding to the resolution of image frame 406). In some cases, the second layer of neural network 400(B) may receive landmark image frame 408(2). The second layer may occur after the first layer. Furthermore, landmark image frame 408(2) may correspond to a second version of the landmark image frame having a second resolution lower than the first resolution.

[0085] In some cases, the first and second layers of the neural network 400(B) are convolutional layers. In some examples, the neural network 400(B) may include a pooling layer between the first and second layers. The pooling layer may be configured to downsample the activation data of the first layer output to a second resolution of a second version of the landmark image frame (e.g., landmark image frame 408(2)). The pooling layer may also receive the second version of the landmark image frame and pass the downsampled activation data and the second version of the landmark image frame to the second layer of the neural network 400(B).

[0086] At operation 608, process 600 includes using a neural network to determine facial expressions associated with a face. For example, neural network 400(A) may determine expression classification 412. In one example, expression classification 412 may represent facial expressions from a set of candidate facial expressions trained on neural network 400(A). In some cases, using landmark feature information (e.g., mixing shape coefficients 414 and / or one or more versions of landmark image frames 408(1)) may enable neural network 400(A) to determine expression classification 412 more accurately and / or efficiently.

[0087] In some examples, process 600 may also include training a neural network 400(A) using a training dataset. The training data may include multiple image frames corresponding to the faces of multiple people. In one example, the multiple image frames may be labeled with facial expressions associated with the faces of multiple people. In some cases, the training data may also include multiple landmark feature information associated with the multiple image frames.

[0088] Figure 7 This is a diagram illustrating an example of a visualization model 700 of a neural network. Model 700 can correspond to... Figure 4A The neural network 400(A) in Figure 4B Neural network 400(B) and / or Figure 4CAn example architecture of a neural network 400(C) is shown below. In this example, model 700 includes an input layer 704, an intermediate layer commonly referred to as a hidden layer 706, and an output layer 708. Each layer includes a number of nodes 702. In this example, each node 702 of the input layer 704 is connected to each node 702 of the hidden layer 706. Connections referred to as synapses in brain models are called weights 770. The input layer 704 can receive input and can propagate the input to the hidden layer 706. Also in this example, each node 702 of the hidden layer 706 has connections or weights 770 with each node 702 of the output layer 708. In some cases, neural network implementations may include multiple hidden layers. The weighted sum calculated by the hidden layer 706 (or multiple hidden layers) is propagated to the output layer 708, which may present the final output for different uses (e.g., providing classification results, detecting objects, tracking objects, and / or other suitable uses). The outputs (weighted sums) of the different nodes 702 may be referred to as activations (also known as activation data), consistent with brain models.

[0089] Examples of computations that can occur at each layer in the example visualization model 700 are as follows:

[0090]

[0091] In the above equation, Wij is the weight, xi is the input activation, yj is the output activation, f() is a nonlinear function, and b is the bias term. Using the input image as an example, each connection between a node and its receptive field can learn the weight Wij, and in some cases, the global bias b, allowing each node to learn to analyze its specific local receptive field in the input image. Each node in the hidden layer can have the same weights and biases (called shared weights and shared biases). Various nonlinear functions can be used to achieve different purposes.

[0092] Model 700 can be described as a directed, weighted graph. In a directed graph, each connection to a node, or each connection from a node, indicates a direction (e.g., entering or leaving the node). In a weighted graph, each connection has a weight. Tools used to develop neural networks can visualize them as directed, weighted graphs for easier understanding and debugging. In some cases, these tools can also be used to train the neural network and output the trained weight values. Executing the neural network then involves using these weights to compute values ​​on the input data.

[0093] Neural networks with three or more layers (e.g., more than one hidden layer) are sometimes called deep neural networks. For example, deep neural networks can have anywhere from five to over a thousand layers. Compared to shallower neural networks, neural networks with multiple layers are capable of learning high-level tasks with greater complexity and abstraction. For example, a deep neural network can be taught to recognize objects or scenes in an image. In this example, the pixels of an image can be fed into the input layer of the deep neural network, and the output of the first layer can indicate the presence of low-level features in the image, such as lines and edges. In subsequent layers, these feature combinations can be combined to measure potentially higher-level features: lines can be combined into shapes, and shapes can be further combined into sets of shapes. Given this information, a deep neural network can output the probability of high-level features representing a particular object or scene. For example, a deep neural network can output whether an image contains a cat.

[0094] The learning phase of a neural network is called training the neural network. During training, the neural network is taught to perform a task. During the learning process, the values ​​of the weights (and possibly biases) are determined. The underlying procedures used in the neural network (e.g., the organization of nodes to layers, the connections between nodes in each layer, and the computations performed by each node) do not need to be changed during training. Once training is complete, the neural network can perform the task by computing the results using the weight values ​​(and in some cases, bias values) determined during training. For example, a neural network might output the probability that an image contains a specific object, the probability that an audio sequence contains a specific word, a bounding box around an object in an image, or a suggested action. Running a program for a neural network is called inference.

[0095] There are several methods for training weights. One method is called supervised learning. In supervised learning, all training samples are labeled, so feeding each training sample into the neural network produces a known result. Another method is called unsupervised learning, where the training samples are not labeled. In unsupervised learning, the goal of training is to find structures or clusters in the data. Semi-supervised learning lies between supervised and unsupervised learning. In semi-supervised learning, a subset of the training data is labeled. Unlabeled data can be used to define cluster boundaries, and labeled data can be used to label clusters.

[0096] Different kinds of neural networks have been developed. Examples of neural networks can be categorized into two forms: feedforward and recursive. Figure 8A This diagram illustrates an example of a neural network model 810, which includes feedforward weights 812 between the input layer 804 and the hidden layer 806, and recursive weights 814 at the output layer 808. In a feedforward neural network, computation is a series of operations performed on the outputs of previous layers, with the final layer generating the output of the neural network. Figure 8AIn the example shown, the feedforward is represented by hidden layer 806, whose nodes 802 only operate on the output of nodes 802 in the input layer 804. Feedforward neural networks have no memory, and the output for a given input may always be the same, regardless of any previous inputs given to the neural network. A multilayer perceptron (MLP) is a type of neural network with only feedforward weights.

[0097] Conversely, recurrent neural networks (RNNs) possess internal memory, allowing dependencies to influence the output. In an RNN, intermediate operations can generate internally stored values ​​that can be used as input for other operations, combining with the processing of subsequent input data. Figure 8A In the example, recursion is shown by output layer 808, where the output of node 802 of output layer 808 is connected back to the input of node 802 of output layer 808. These loop connections can be referred to as recursive weights 814. Long Short-Term Memory (LSTM) is a commonly used variant of recursive neural networks.

[0098] Figure 8B This is a diagram illustrating an example of a neural network model 820 including different connection types. In this example model 820, the input layer 804 and hidden layer 806 are fully connected layers 822. In a fully connected layer, all output activations consist of weighted input activations (e.g., the outputs of all nodes 802 in the input layer 804 are connected to the inputs of all nodes 802 in the hidden layer 806). Fully connected layers can require significant storage and computation. A multilayer perceptron neural network is a type of fully connected neural network.

[0099] In some applications, some connections between activations can be removed, for example, by setting the weights of these connections to zero, without affecting the accuracy of the output. The result is an 824-layer sparsely connected system. Figure 8B The weights between hidden layer 806 and output layer 808 are shown in the diagram. Pooling is another example of a method to implement sparse connections at layer 824. During pooling, the outputs of clusters of nodes can be combined, for example, by finding the maximum, minimum, average, or median value.

[0100] A class of neural networks known as convolutional neural networks (CNNs) is particularly effective for image recognition and classification, such as facial expression recognition and / or classification. For example, a convolutional neural network can learn image categories and output the statistical probability that an input image falls into one of the categories.

[0101] Figure 9This is a diagram of an example convolutional neural network model 900. Model 900 illustrates operations that can be included in a convolutional neural network: convolution, activation, pooling (also known as subsampling), batch normalization, and output generation (e.g., fully connected layers). As an example, the convolutional neural network shown in model 900 is a classification network that provides output predictions for different classes of objects (e.g., dog, cat, boat, bird) 914. Any given convolutional network includes at least one convolutional layer and can have many convolutional layers. Additionally, a pooling layer is not required after each convolutional layer. In some examples, pooling layers may appear after multiple convolutional layers, or pooling layers may not appear at all. Figure 9 The example convolutional network shown classifies the input image 920 into one of four categories: dog, cat, boat, or bird. In the example shown, when an image of a boat is received as input, the example neural network outputs "boat" with the highest probability (0.94) in the output prediction 914.

[0102] To produce the output prediction 914 shown, the example convolutional neural network performs a first convolution 902 with rectified linear units (ReLU), pooling 904, a second convolution 906 with ReLU, additional pooling 908, and then classification using two fully connected layers 910 and 912. In the first convolution operation with ReLU 902, the input image 920 is convolved to produce one or more output feature maps 922 (including activation data). The first pooling operation 904 produces an additional feature map 924, which serves as the input feature map for the second convolution operation with ReLU 906. The second convolution operation 906 with ReLU produces a second set of output feature maps 926 with activation data. The additional pooling step 908 also produces a feature map 928, which is fed into the first fully connected layer 910. The output of the first fully connected layer 910 is fed into the second fully connected layer 912. The output of the second fully connected layer 912 is the output prediction 914. In convolutional neural networks, the terms "higher layers" and "higher-level layers" refer to layers that are further away from the input image (e.g., in example model 900, the second fully connected layer 912 is the highest layer).

[0103] Figure 9 The example shown is one example of a convolutional neural network. Other examples may include additional or fewer convolution operations, ReLU operations, pooling operations, and / or fully connected layers. Convolution, non-linearity (ReLU), pooling or subsampling, and classification operations will be explained in more detail below.

[0104] When performing image processing functions (e.g., image recognition, object detection, object classification, object tracking, or other suitable functions), convolutional neural networks can operate on the numerical or digital representation of an image. An image can be represented in a computer as a matrix of pixel values. For example, a video frame captured at 1080p consists of an array of pixels, 1920 pixels wide and 1080 pixels high. Some components of an image are called channels. For example, a color image has three color channels: red (R), green (G), and blue (B), or chroma (Y), red chroma (Cr), and blue chroma (Cb). In this example, the color image can be represented as three two-dimensional matrices, one for each color, where the horizontal and vertical axes indicate the position of the pixels in the image, and values ​​between 0 and 255 indicate the color intensity of the pixel. As another example, a grayscale image has only one channel and can therefore be represented as a single two-dimensional matrix of pixel values. In this example, the pixel values ​​can also be between 0 and 255, where, for example, 0 indicates black and 255 indicates white. In these examples, the upper limit of 255 assumes that the pixel is represented by an 8-bit value. In other examples, pixels can be represented with more bits (e.g., 16 bits, 32 bits or more), and therefore can have a higher upper limit.

[0105] like Figure 9As shown, a convolutional network is a sequence of layers. Each layer of a convolutional neural network transforms one quantity of activation data (also called an activation quantity) into another quantity of activation through a differentiable function. For example, each layer can accept an input 3D volume and transform that input 3D volume into an output 3D volume through a differentiable function. The three types of layers that can be used to build a convolutional neural network architecture include convolutional layers, pooling layers, and one or more fully connected layers. The network also includes an input layer that preserves the original pixel values ​​of the image. For example, an example image might have a width of 32 pixels, a height of 32 pixels, and three color channels (e.g., R, G, and B color channels). Each node in a convolutional layer is connected to a region of nodes (pixels) in the input image. This region is called the receptive field. In some cases, a convolutional layer can compute the output of nodes (also called neurons) connected to local regions of the input, with each node computing its weights with the dot product of its weights and the small region connected to it in the input volume. If 12 filters are used, such a computation can produce a volume of [32×32×12]. ReLU layers can apply element-wise activation functions, such as max(0,x) with a zero threshold, which keeps the volume size constant at [32×32×12]. Pooling layers can perform downsampling operations along spatial dimensions (width, height), producing data with a reduced volume, for example, a volume of [16×16×12]. Fully connected layers can compute class scores, producing a volume of [1×1×4], where each of the four numbers corresponds to a class score, such as class scores for the four categories of dog, cat, boat, and bird. The CIFAR-10 network is an example of such a network and has 10 object categories. Using such neural networks, the raw image can be transformed layer by layer from raw pixel values ​​to the final class scores. Some layers contain parameters, while others may not. For example, the transformation performed by convolutional and fully connected layers is a function of the activations in the input volume and also a function of the node parameters (weights and biases), while ReLU and pooling layers can implement fixed functions.

[0106] Convolution is a mathematical operation used to extract features from an input image. Extractable features include edges, curves, corners, freckles, and ridges. Convolution learns image features by using small squares of input data to preserve the spatial relationships between pixels.

[0107] Figure 10A , Figure 10B and Figure 10C This is a diagram illustrating a simplified example of a convolution operation. Figure 10AAn example input matrix 1010 for pixel values ​​is shown. In this example, the input image represented by input matrix 1010 is five pixels wide by five pixels high. For the purposes of this example, the pixel values ​​are only 0 or 1. In other examples, as mentioned above, the range of pixel values ​​can be 0 to 255. Because there is only one input matrix 1010, the image has only one channel and can be assumed to be a grayscale image.

[0108] Figure 10B An example of filter 1020 is shown, which can also be called a kernel or feature detector. Filter 1020 can be used to extract different features from an image, such as edges, curves, corners, etc., by changing the values ​​in the matrix of filter 1020. In this simplified example, the matrix values ​​are 0 or 1. In other examples, the matrix values ​​can be greater than 1, can be negative, and / or can be fractions.

[0109] Figure 10C The convolution of input matrix 1010 with filter 1020 is illustrated. The convolution operation involves calculating the value of filter 1020 at each possible position on input matrix 1010 by multiplying the values ​​of input matrix 1010 with the values ​​of filter 1020, and summing the resulting products. In one example, such as... Figure 10C As shown, filter 1020 overlaps with the (x, y) positions (0, 0), (1, 0), (2, 0), (0, 1), (1, 1), (2, 1), (0, 2), (1, 2), and (2, 2) of input matrix 1010; this is referred to as the receptive field of the filter. For example, the value of input matrix 1010 at position (0, 0) is multiplied by the value of filter 1020 at position (0, 0) to produce a product of 1 (based on a product of 1 × 1). For each receptive field of filter 1020 in the input matrix, the multiplication is repeated for each position in filter 1020 that overlaps with a position in input matrix 1010. The products are then summed to produce the value of 4 at the filter position shown.

[0110] The value 4 is placed at position (0, 0) in feature map 1030, which can also be called a convolutional feature map or activation map, and includes activation data. For example... Figure 10C As shown, the position (0, 0) corresponds to the position of the filter. To obtain the value at position (1, 0) in feature map 1030, filter 1020 slides one pixel to the right (referred to as stride one) and repeats the multiplication-addition operation. To obtain the value at position (0, 1) in feature map 1030, filter 1020 can be moved to overlap positions (0, 1) to (2, 3) in input matrix 1010. Similar operations can be performed to obtain the values ​​at the remaining positions in feature map 1030.

[0111] In examples with more channels, filter 1020 can be applied to the input matrix 1010 for each channel. For example, a color image can have three channels, and therefore three input matrices. In this example, the convolution of the three input matrices can thus produce three feature maps for each receptive field in the input matrix 1010.

[0112] In practice, filter values ​​(also known as weights) are determined during the training of the neural network. Therefore, the design of a convolutional neural network involves specified factors such as the number of filters used, filter sizes, and network architecture, including the number of layers and the operations performed in each layer.

[0113] The size of a feature map can be controlled by three parameters: depth, stride, and zero-padding. Depth corresponds to the number of filters used in the convolution operation. Applying more filters allows for the extraction of more features, and the neural network may be able to produce more accurate recognitions. However, each additional filter increases the amount of computation required. Additionally, each filter produces a separate feature map, which requires additional storage space. The set of feature maps extracted within the same convolutional network can be viewed as a stack of two-dimensional matrices, sometimes collectively referred to as a single feature map; in this case, the depth of the feature map is the number of two-dimensional matrices.

[0114] The step size is the number of samples (e.g., pixels) that the filter matrix moves across the input matrix. Figure 10C In the example, the filter matrix 1020 moves one pixel at a time, so the stride is equal to 1. As an illustrative example, when the filter stride is 2, the filter moves two pixels between convolution calculations. A larger stride produces a smaller feature map.

[0115] exist Figure 10C In the example, some information at the edges of the input matrix 1010 might not be captured well. This is because, in this example, the filter is applied once (e.g., position (0, 0)) or twice (e.g., position (0, 1)) at some locations, while the filter is applied four times at all other locations. In some cases, losing edge information is acceptable. When it is not desired to lose edge information, zero-padding can be applied, that is, the input matrix is ​​increased by the same number of pixels in all directions, and the new positions are assigned values ​​of zero. Zero-padding can also be used to control the size of the feature map. Adding zero-padding can be called wide convolution, while not using zero-padding can be called narrow convolution.

[0116] In some cases, an operation called ReLU is applied to feature maps. ReLU is an abbreviation for Rectified Linear Unit, a type of activation layer. ReLU is a non-linear operation, and its output can be given by the following formula:

[0117] Output = Max(0, Input)

[0118] ReLU is an element-wise operation applied to each pixel. ReLU replaces all negative pixel values ​​in the feature map with zero. Convolution is a linear operation, including element-wise matrix multiplication and addition. ReLU introduces non-linearity into convolutional neural networks, based on the assumption that most of the real-world data that convolutional neural networks learn is non-linear. Other non-linear functions, such as tanh or sigmoid, can be used.

[0119] Convolutional neural networks also include pooling, which can also be called subsampling or downsampling. Pooling reduces the dimensionality of feature maps while retaining the most important information. Various pooling functions can be used, such as finding the maximum or minimum value, finding the average value, and summing.

[0120] Figure 11 This is a graph illustrating an example of max-pooling applied to the rectified feature map 1110; that is, a feature map to which ReLU or other nonlinear functions have been applied. Figure 11 In the example, a spatial neighborhood of 2 pixels wide by 2 pixels high is defined. Within the 2×2 window, the maximum value is taken and placed in the pooled feature map 1140. Furthermore, in this example, for each maximum / minimum value calculation, the window is shifted by two pixels (also known as a stride of two), thus reducing the 4×4 rectified feature map 1110 to the 2×2 pooled feature map 1140. In other examples, the average, summation, or other calculations of the values ​​within the window can be performed for the pooled feature map 1140. Max pooling is the most common method.

[0121] When pooling is applied, the operation is performed individually on each feature map output by the convolutional layer (or convolutional and ReLU layers). Therefore, the number of pooled feature maps from the pooling layer is the same as the number of feature maps input to the pooling layer.

[0122] Convolutional neural networks can include pooling to progressively reduce the spatial size of the input representation. For example, pooling can make the input representation (such as feature dimension) smaller and more manageable. As another example, pooling can reduce the number of parameters and computations that the neural network needs to perform. As yet another example, pooling can make the neural network unaffected by small transformations, distortions, or translations in the input image. That is, small distortions in the input are unlikely to change the output of the pooling because the maximum (or average, sum, or other operations) is extracted from the local neighborhood. As a further example, pooling can help determine a nearly scale-invariant representation of an image (called an isovariant representation). This means that objects in an image can be detected regardless of their location within the image.

[0123] like Figure 9As shown in the example, a convolutional neural network can include multiple convolutional layers, each refining the features extracted by the previous layer. Each convolutional layer can (but does not necessarily) be followed by pooling. The combined output of these layers represents high-level features of the input image, such as the presence of certain shapes, colors, textures, gradients, etc.

[0124] To transform these feature maps into classifications, convolutional neural networks can include one or more fully connected layers. In some cases, a multilayer perceptron using a function such as softmax activation can be used after the fully connected layers. Fully connected layers can classify input images into various classes based on training data. For example, Figure 9 The convolutional neural network in the image is trained to recognize dogs, cats, boats, and birds, and can classify objects in an input image into one of these classes.

[0125] Besides classification, fully connected layers in convolutional neural networks can also provide an inexpensive way (in terms of computation and / or data storage) to learn non-linear combinations of extracted features. Features extracted by convolutional and pooling layers may be well-suited for classification, but combinations of features may be even better.

[0126] exist Figure 9 In the example, because the output layer uses the softmax activation function, the sum of the output predictions 914 is 1. The softmax function takes an arbitrary vector of real-valued fractions and compresses these values ​​into a vector of values ​​between 0 and 1 that sum to 1.

[0127] As mentioned above, the filter values ​​are determined during the training of the convolutional neural network. For example, training can be accomplished using backpropagation. This technique involves: first, initializing all filter values ​​and related parameters with random values; second, feeding the training image into the neural network. In this step, since the weights are randomly assigned, the output probabilities may also be random. For example, Figure 9 The output vector of the neural network may be [0.2, 0.4, 0.1, 0.3], representing the probability that the training image is a dog, a cat, a boat, or a bird, respectively.

[0128] Next, the total error of the output layer can be calculated, as shown below:

[0129]

[0130] In the above formula, the target probability is a vector representing the expected outcome. For example, for Figure 9 The input image 920 shown has a target probability of [0,0,1,0].

[0131] The fourth step is to use backpropagation to calculate the gradient of the error relative to all weights in the neural network. Furthermore, gradient descent can be used to update all filter values ​​or weight and parameter values ​​to minimize the output error. For example, weights can be adjusted proportionally to their contribution to the total error. When the same training images are input again, the output vector may be closer to the target probability. Backpropagation can be repeated until the output vector is within the expected range of the target probability. The above training steps can be repeated for each image in the training dataset.

[0132] During training, parameters such as the number of filters, filter size, and layer organization remain constant. Only the values ​​of the filter matrix and connection weights change during training.

[0133] Research has found that the more convolutional steps a neural network has, the more features it can learn to recognize. For example, in an image classification example, in the first layer, the neural network can learn to detect edges from raw pixels, then use the edges to detect shapes in the second layer, and in the third layer, the neural network can be able to use the shapes to determine features at higher levels, such as facial shapes at higher levels.

[0134] Figure 12 This is a diagram illustrating an example of a system used to implement certain aspects of this technology. Specifically, Figure 12 An example of computing system 1200 is shown. For example, the computing system can be any computing device constituting an internal computing system, a remote computing system, a camera, or any component thereof, wherein the components of the system communicate with each other using connection 1205. Connection 1205 can be a physical connection using a bus, or it can be a direct connection to processor 1210, such as in a chipset architecture. Connection 1205 can also be a virtual connection, a network connection, or a logical connection.

[0135] In some embodiments, the computing system 1200 is a distributed system, wherein the functions described herein may be distributed across a data center, multiple data centers, a peer-to-peer network, etc. In some embodiments, one or more of the described system components represent a plurality of such components, each performing some or all of the functions of the described components. In some embodiments, the components may be physical or virtual devices.

[0136] Example system 1200 includes at least one processing unit (CPU or processor) 1210 and a connection 1205 that couples various system components to processor 1210, including system memory 1215, such as read-only memory (ROM) 1220 and random access memory (RAM) 1225. Computing system 1200 may include a cache 1212 of high-speed memory that is directly connected to, closely proximate to, or integrated into processor 1210.

[0137] Processor 1210 may include any general-purpose processor and hardware or software services, such as services 1232, 1234, and 1236 stored in storage device 1230, which are configured to control processor 1210 and dedicated processors incorporating software instructions into the actual processor design. Processor 1210 may essentially be a self-contained computing system, including multiple cores or processors, buses, memory controllers, caches, etc. Multi-core processors may be symmetric or asymmetric.

[0138] To enable user interaction, the computing system 1200 includes an input device 1245, which can represent any number of input mechanisms, such as a microphone for voice, a touch-sensitive screen for gesture or graphic input, a keyboard, a mouse, motion input, voice input, etc. The computing system 1200 may also include an output device 1235, which can be one or more of a variety of output mechanisms. In some cases, a multimodal system allows the user to provide multiple types of input / output to communicate with the computing system 1200. The computing system 1200 may include a communication interface 1240, which typically controls and manages user input and system output. The communication interface can use wired and / or wireless transceivers to perform or facilitate the reception and / or transmission of wired or wireless communications, including using audio jacks / plugs, microphone jacks / plugs, Universal Serial Bus (USB) ports / plugs, etc. Ports / plugs, Ethernet ports / plugs, fiber optic ports / plugs, proprietary wired ports / plugs Wireless signal transmission Low-energy (BLE) wireless signal transmission Wireless signal transmission, including radio frequency identification (RFID) wireless signal transmission, near field communication (NFC) wireless signal transmission, dedicated short range communication (DSRC) wireless signal transmission, 802.11 Wi-Fi wireless signal transmission, wireless local area network (WLAN) signal transmission, visible light communication (VLC), global microwave access interoperability (WiMAX), infrared (IR) communication wireless signal transmission, public switched telephone network (PSTN) signal transmission, integrated services digital network (ISDN) signal transmission, 3G / 4G / 5G / LTE cellular data network wireless signal transmission, ad-hoc network signal transmission, radio wave signal transmission, microwave signal transmission, infrared signal transmission, visible light signal transmission, ultraviolet light signal transmission, wireless signal transmission along the electromagnetic spectrum, or combinations thereof. The communication interface 1240 may also include one or more Global Navigation Satellite System (GNSS) receivers or transceivers for determining the location of the computing system 1200 based on one or more signals received from one or more satellites associated with one or more GNSS systems. GNSS systems include, but are not limited to, the US-based Global Positioning System (GPS), the Russian-based Global Navigation Satellite System (GLONASS), the Chinese-based BeiDou Navigation Satellite System (BDS), and the European-based Galileo GNSS. There are no limitations on operating on any particular hardware configuration, and therefore the basic features described here can easily be replaced by improved hardware or firmware configurations as they are developed.

[0139] Storage device 1230 may be a non-volatile and / or non-transitory and / or computer-readable storage device, and may be a hard disk or other type of computer-readable media that can store computer-accessible data, such as magnetic tape, flash memory cards, solid-state storage devices, digital versatile disks, cassette tapes, floppy disks, flexible disks, hard disks, magnetic tapes, magnetic stripes, any other magnetic storage media, flash memory, memristor memory, any other solid-state storage, optical disc read-only memory (CD-ROM), rewritable optical disc (CD), digital video optical disc (DVD), Blu-ray disc (BDD), holographic disc, another optical medium, secure digital storage (SD) cards, micro secure digital storage (microSD) cards, Memory Sticks. Cards, smart card chips, EMV chips, Subscriber Identity Module (SIM) cards, mini / micro / nano / micro SIM cards, another integrated circuit (IC) chip / card, random access memory (RAM), static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash EPROM, cache memory (L1 / L2 / L3 / L4 / L5 / L#), resistive random access memory (RRAM / ReRAM), phase-change memory (PCM), spin-transfer torque RAM (STT-RAM), another memory chip or cassette, and / or combinations thereof.

[0140] Storage device 1230 may include software services, servers, etc., which enable the system to perform functions when the code defining this software is executed by processor 1210. In some embodiments, hardware services that perform a particular function may include software components stored in a computer-readable medium that are connected to necessary hardware components such as processor 1210, connection 1205, output device 1235, etc., to perform that function.

[0141] As used herein, the term "computer-readable medium" includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other media capable of storing, containing, or carrying instructions and / or data. Computer-readable media may include non-transitory media on which data can be stored, excluding carrier waves and / or transient electronic signals propagated via wireless or wired connections. Examples of non-transitory media may include, but are not limited to, magnetic disks or magnetic tapes, optical storage media such as optical discs (CDs) or digital versatile optical discs (DVDs), flash memory, memory, or storage devices. Code and / or machine-executable instructions may be stored on a computer-readable medium, which may represent processes, functions, subroutines, programs, routines, subroutines, modules, software packages, classes, or any combination of instructions, data structures, or program statements. Code segments can be coupled to another code segment or hardware circuitry by passing and / or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc., can be passed, forwarded, or transmitted using any suitable means, including memory sharing, messaging, token passing, network transmission, etc.

[0142] In some embodiments, computer-readable storage devices, media, and memories may include wired or wireless signals containing bit streams, etc. However, when referred to, non-transitory computer-readable storage media explicitly excludes media such as energy, carrier signals, electromagnetic waves, and the signals themselves.

[0143] Specific details are provided in the foregoing description to provide a thorough understanding of the embodiments and examples provided herein. However, it will be understood by those skilled in the art that these embodiments can be practiced without these specific details. For clarity, in some instances, the technology may be presented as comprising a single functional block, including functional blocks containing devices, device components, steps or routines in methods embodied as software or a combination of hardware and software. Additional components may be used in addition to those shown in the figures and / or described herein. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form so as not to obscure the embodiments with unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail to avoid obscuring the embodiments.

[0144] A single embodiment may be described above as a process or method, which may be depicted as a flowchart, flow diagram, data flow diagram, structure diagram, or block diagram. Although a flowchart may describe operations as a sequential process, many operations may be performed in parallel or simultaneously. Furthermore, the order of operations may be rearranged. When an operation of a process completes, the process terminates, but there may be additional steps not included in the flowchart. A process may correspond to a method, function, process, subroutine, subprogram, etc. When a process corresponds to a function, its termination may correspond to the function returning to the calling function or the main function.

[0145] The processes and methods described in the examples above can be implemented using computer-executable instructions stored in or otherwise made available from a computer-readable medium. For example, such instructions may include instructions and data that cause or otherwise configure a general-purpose computer, special-purpose computer, or processing device to perform a particular function or group of functions. Some of the computer resources used may be accessible via a network. Computer-executable instructions may be, for example, binary files, intermediate format instructions (e.g., assembly language), firmware, source code, etc. Examples of computer-readable media that may be used to store instructions, information used, and / or information created during the methods according to the described examples include disks or optical discs, flash memory, USB devices with non-volatile memory, network storage devices, etc.

[0146] Devices implementing these disclosed processes and methods may include hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof, and may employ any of a variety of forming factors. When implemented as software, firmware, middleware, or microcode, program code or code segments (e.g., computer program products) for performing necessary tasks may be stored in a computer-readable or machine-readable medium. One or more processors may perform the necessary tasks. Typical examples of forming factors include laptop computers, smartphones, mobile phones, tablet devices or other small forming factors, personal computers, personal digital assistants, rack-mounted devices, standalone devices, etc. The functions described herein may also be embodied in peripheral devices or add-in cards. By way of further example, these functions may also be implemented on different chips on a circuit board or in different processes executed in a single device.

[0147] Instructions, media for transmitting these instructions, computing resources for executing these instructions, and other structures for supporting these computing resources are example components for providing the functionality described in this disclosure.

[0148] In the foregoing description, various aspects of this application have been described with reference to specific embodiments thereof, but those skilled in the art will recognize that this application is not limited thereto. Therefore, while exemplary embodiments of this application have been described in detail herein, it should be understood that the concepts of the invention can be embodied and employed in other ways, and the appended claims are intended to be construed as including such variations unless limited by the prior art. Various features and aspects of the applications described above may be used alone or in combination. Furthermore, embodiments may be utilized in any number of environments and applications beyond those described herein without departing from the broader spirit and scope of this specification. Therefore, this specification and the accompanying drawings should be considered illustrative rather than restrictive. For illustrative purposes, the methods have been described in a particular order. It should be understood that in other embodiments, the methods may be performed in a different order than that described.

[0149] Those skilled in the art will understand that the less than (“<”) and greater than (“>”) symbols or terms used herein can be replaced with less than or equal to (“≤”) and greater than or equal to (“≥”) symbols, respectively, without departing from the scope of this description.

[0150] When a component is described as being “configured” to perform certain operations, such configuration can be accomplished, for example, by designing electronic circuits or other hardware to perform the operations, by programming programmable electronic circuits (e.g., microprocessors or other suitable electronic circuits), or any combination thereof.

[0151] The phrase “coupled to” means any component that is physically connected directly or indirectly to another component, and / or any component that communicates directly or indirectly to another component (e.g., connected to another component via a wired or wireless connection and / or other suitable communication interface).

[0152] The use of "at least one" and / or "one or more" in the language of a claim or other language set indicates that one or more members of that set (in any combination) satisfy the claim. For example, the claim language stating "at least one of A and B" refers to A, B, or A and B. In another example, the claim language stating "at least one of A, B, and C" refers to A, B, C, or A and B, or A and C, or B and C, or A and B and C. The use of "at least one" and / or "one or more" in the language set does not limit the set to items listed in that set. For example, the claim language stating "at least one of A and B" can refer to A, B, or A and B, and may additionally include items not listed in the set of A and B.

[0153] Various exemplary logic blocks, modules, circuits, and algorithm steps associated with the embodiments disclosed herein can be implemented as electronic hardware, computer software, firmware, or a combination thereof. To clearly illustrate this interchangeability between hardware and software, various exemplary components, modules, circuits, and steps have been generally described above with respect to their functional aspects. Whether these functions are implemented as hardware or software depends on the specific application and the design constraints imposed on the overall system. Those skilled in the art can implement the described functions in different ways for each specific application, but such implementation decisions should not be construed as departing from the scope of this application.

[0154] The techniques described herein can also be implemented as electronic hardware, computer software, firmware, or any combination thereof. These techniques can be implemented in any of a variety of devices, such as general-purpose computers, wireless communication handsets, or multi-purpose integrated circuit devices, including applications in wireless communication handsets and other devices. Any feature described as a module or component can be implemented together in an integrated logic device or separately as a discrete but interoperable logic device. If implemented in software, these techniques can be implemented at least in part by a computer-readable data storage medium comprising program code, including instructions that, when executed, perform one or more of the methods described above. The computer-readable data storage medium can form part of a computer program product, which may include packaging materials. The computer-readable medium can include memory or data storage media, such as random access memory (RAM) (e.g., synchronous dynamic random access memory (SDRAM)), read-only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), FLASH memory, magnetic or optical data storage media, etc. Additionally or alternatively, these technologies can also be implemented, at least in part, through a computer-readable communication medium that carries or communicates program code in the form of instructions or data structures and can be accessed, read, and / or executed by a computer, such as by propagating signals or waves.

[0155] The program code can be executed by a processor, which may include one or more processors, such as one or more digital signal processors (DSPs), general-purpose microprocessors, application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), or other equivalent integrated or discrete logic circuits. Such a processor can be configured to perform any of the techniques described in this disclosure. A general-purpose processor may be a microprocessor, but alternatively, the processor may be any conventional processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, multiple microprocessors, a combination of one or more microprocessors with a DSP core, or any other such configuration. Therefore, the term "processor" as used herein may refer to any of the foregoing structures, any combination of the foregoing structures, or any other structure or apparatus suitable for implementing the techniques described herein. Furthermore, in some aspects, the functionality described herein may be provided within a dedicated software module or a hardware module configured for encoding and decoding, or incorporated into a combined video encoder-decoder (CODEC).

[0156] The exemplary aspects of this disclosure include:

[0157] Aspect 1. An apparatus for facial expression recognition, the apparatus comprising: a memory; and one or more processors coupled to the memory, the one or more processors being configured to: receive image frames corresponding to a human face; determine landmark feature information associated with landmark features of the face based on a three-dimensional model of the face; input the image frames and landmark feature information to at least one layer of a neural network trained for facial expression recognition; and use the neural network to determine facial expressions associated with the face.

[0158] Aspect 2. The apparatus according to aspect 1, wherein the landmark feature information includes one or more hybrid shape coefficients determined based on a three-dimensional model.

[0159] Aspect 3. The apparatus according to Aspect 2, wherein one or more processors are configured to: generate a three-dimensional model of a face; and determine one or more hybrid shape coefficients based on a comparison between the three-dimensional model of the face and image data corresponding to the face within an image frame.

[0160] Aspect 4. The apparatus according to either Aspect 2 or Aspect 3, wherein one or more processors are configured to input one or more hybrid shape coefficients into a fully connected layer of a neural network.

[0161] Aspect 5. The device according to aspect 4, wherein a fully connected layer concatenates one or more hybrid shape coefficients with data output from the convolutional layer of a neural network.

[0162] Aspect 6. The apparatus according to any one of aspects 2 to 5, wherein one or more processors are configured to generate a landmark image frame indicating one or more landmark features of a face using one or more hybrid shape coefficients.

[0163] Aspect 7. The apparatus of aspect 6, wherein one or more processors are configured to: determine a plurality of landmark features of a face based on one or more hybrid shape coefficients; determine a subset of the plurality of landmark features corresponding to key landmark features; and generate a landmark image frame based on forming one or more connections among the subset of the plurality of landmark features corresponding to key landmark features.

[0164] Aspect 8. The apparatus of aspect 7, wherein one or more processors are configured to determine a subset of multiple landmark features corresponding to key landmark features based on determining landmark features related to human facial expressions.

[0165] Aspect 9. The apparatus according to any one of Aspects 7 to 8, wherein the landmark image frame includes a binary image frame, the binary image frame indicating pixels corresponding to key landmark features using predetermined pixel values.

[0166] Aspect 10. The apparatus according to any one of aspects 6 to 9, wherein one or more processors are configured to input landmark image frames into one or more layers of a neural network.

[0167] Aspect 11. The apparatus of aspect 10, wherein one or more processors are configured to: input a first version of a landmark image frame to a first layer of a neural network, the first version of the landmark image frame having a first resolution; and input a second version of a landmark image frame to a second layer of the neural network occurring after the first layer, the second version of the landmark image frame having a second resolution lower than the first resolution.

[0168] Aspect 12. The apparatus according to aspect 11, wherein the first and second layers of the neural network are convolutional layers.

[0169] Aspect 13. The apparatus according to any one of aspects 11 or 12, wherein the neural network includes a pooling layer located between a first layer and a second layer, the pooling layer being configured to: downsample activation data output from the first layer to a second resolution of a second version of a landmark image frame; receive the second version of the landmark image frame; and pass the downsampled activation data output from the first layer and the second version of the landmark image frame to the second layer.

[0170] Aspect 14. An apparatus according to any one of aspects 1 to 13, wherein one or more processors are configured to train a neural network using a training dataset comprising: multiple image frames corresponding to the faces of multiple persons, the multiple image frames being labeled with facial expressions associated with the faces of the multiple persons; and multiple landmark feature information associated with the multiple image frames.

[0171] Aspect 15. The apparatus according to any one of aspects 1 to 14, wherein the three-dimensional model includes a three-dimensional deformable model (3DMM).

[0172] Aspect 16. The apparatus according to any one of aspects 1 to 15, wherein the apparatus includes a camera system for capturing image frames corresponding to a human face.

[0173] Aspect 17. The apparatus according to any one of aspects 1 to 16, wherein the apparatus includes a mobile device.

[0174] Aspect 18. The apparatus according to any one of aspects 1 to 17 further includes a display.

[0175] Aspect 19. A facial expression recognition method, the method comprising: receiving an image frame corresponding to a human face; determining landmark feature information associated with landmark features of the face based on a three-dimensional model of the face; inputting the image frame and the landmark feature information to at least one layer of a neural network trained for facial expression recognition; and using the neural network to determine a facial expression associated with the face.

[0176] Aspect 20. According to the method of aspect 19, the landmark feature information includes one or more hybrid shape coefficients determined based on the three-dimensional model.

[0177] Aspect 21. The method according to aspect 20 further includes: generating a three-dimensional model of a face; and determining one or more blending shape coefficients by comparing the three-dimensional model of the face with image data corresponding to the face within an image frame.

[0178] Aspect 22. The method according to either aspect 20 or 21 further includes inputting one or more hybrid shape coefficients into a fully connected layer of the neural network.

[0179] Aspect 23. According to the method of aspect 22, the fully connected layer concatenates one or more hybrid shape coefficients with the data output from the convolutional layer of the neural network.

[0180] Aspect 24. The method according to any one of aspects 20 to 23 further includes generating a landmark image frame indicating one or more landmark features of a face using one or more hybrid shape coefficients.

[0181] Aspect 25. The method according to aspect 24 further includes: determining multiple landmark features of a face based on one or more hybrid shape coefficients; determining a subset of multiple landmark features corresponding to key landmark features; and generating a landmark image frame based on forming one or more connections among the subset of multiple landmark features corresponding to key landmark features.

[0182] Aspect 26. The method according to aspect 25 further includes determining a subset of multiple landmark features corresponding to key landmark features based on determining landmark features related to human facial expressions.

[0183] Aspect 27. The method according to either aspect 25 or 26, wherein the landmark image frame includes a binary image frame, the binary image frame indicating pixels corresponding to key landmark features using predetermined pixel values.

[0184] Aspect 28. The method according to any one of aspects 24 to 27 further includes inputting landmark image frames into one or more layers of a neural network.

[0185] Aspect 29. The method according to aspect 28 further includes: inputting a first version of a landmark image frame into a first layer of a neural network, the first version of the landmark image frame having a first resolution; and inputting a second version of the landmark image frame into a second layer of the neural network occurring after the first layer, the second version of the landmark image frame having a second resolution lower than the first resolution.

[0186] Aspect 30. According to the method of aspect 29, the first and second layers of the neural network are convolutional layers.

[0187] Aspect 31. The method according to any one of Aspects 29 or 30, wherein the neural network includes a pooling layer located between a first layer and a second layer, the pooling layer being configured to: downsample activation data of the first layer output to a second resolution of a second version of a landmark image frame; receive the second version of the landmark image frame; and pass the downsampled activation data of the first layer output and the second version of the landmark image frame to the second layer.

[0188] Aspect 32. The method according to any one of aspects 19 to 31 further includes training a neural network using a training dataset, the training dataset comprising: multiple image frames corresponding to the faces of multiple people, the multiple image frames being labeled with facial expressions associated with the faces of multiple people; and multiple landmark feature information associated with the multiple image frames.

[0189] Aspect 33. The method according to any one of aspects 19 to 32, wherein the three-dimensional model includes a three-dimensional deformable model (3DMM).

[0190] Aspect 34. A computer-readable storage medium storing instructions that, when executed by one or more processors, cause one or more processors to perform any one of the operations of aspects 1 to 33.

[0191] Aspect 35. An apparatus comprising components for performing any of the operations of aspects 1 to 33.

Claims

1. A device for facial expression recognition, the device comprising: Memory; as well as One or more processors coupled to the memory, the one or more processors being configured to: Receive image frames corresponding to a person's face; Generate a three-dimensional model of the face; Based on the three-dimensional model of the face, landmark feature information indicating landmark features of the face is determined, wherein the landmark feature information includes one or more blending shape coefficients determined based on a comparison between the three-dimensional model of the face and image data corresponding to the face within the image frame; The image frame and the landmark feature information are input into at least one layer of a neural network trained for facial expression recognition; as well as The neural network is used to determine the facial expressions associated with the face.

2. The apparatus of claim 1, wherein the one or more processors are configured to input the one or more hybrid shape coefficients into a fully connected layer of the neural network.

3. The apparatus of claim 2, wherein the fully connected layer concatenates the one or more hybrid shape coefficients with the data output from the convolutional layer of the neural network.

4. The apparatus of claim 1, wherein the one or more processors are configured to generate a landmark image frame indicating one or more landmark features of the face using the one or more hybrid shape coefficients.

5. The apparatus of claim 4, wherein the one or more processors are configured to: Multiple landmark features of the face are determined based on the one or more hybrid shape coefficients; Determine a subset of the plurality of landmark features corresponding to the key landmark features; and The landmark image frame is generated by forming one or more connections between the subsets of the plurality of landmark features corresponding to the key landmark features.

6. The apparatus of claim 5, wherein the one or more processors are configured to determine the subset of the plurality of landmark features corresponding to the key landmark features based on determining landmark features associated with human facial expressions.

7. The apparatus of claim 6, wherein the landmark image frame comprises a binary image frame, the binary image frame indicating pixels corresponding to the key landmark feature using predetermined pixel values.

8. The apparatus of claim 4, wherein the one or more processors are configured to input the landmark image frame into one or more layers of the neural network.

9. The apparatus of claim 8, wherein the one or more processors are configured to: A first version of the landmark image frame, having a first resolution, is input into the first layer of the neural network; and A second version of the landmark image frame is input into a second layer of the neural network that occurs after the first layer. The second version of the landmark image frame has a second resolution lower than the first resolution.

10. The apparatus of claim 9, wherein the first layer and the second layer of the neural network are convolutional layers.

11. The apparatus of claim 10, wherein the neural network includes a pooling layer between the first layer and the second layer, the pooling layer being configured to: The activation data output from the first layer is downsampled to the second resolution of the second version of the landmark image frame; Receive the second version of the landmark image frame; as well as The downsampled activation data output from the first layer and the second version of the landmark image frame are passed to the second layer.

12. The apparatus of claim 1, wherein the one or more processors are configured to train the neural network using a training dataset, the training dataset comprising: Multiple image frames corresponding to the faces of multiple people, wherein the multiple image frames are labeled with facial expressions associated with the faces of the multiple people; as well as Multiple landmark feature information associated with the multiple image frames.

13. The apparatus of claim 1, wherein the three-dimensional model comprises a three-dimensional deformable model (3DMM).

14. The apparatus of claim 1, wherein the apparatus comprises a camera system for capturing the image frames corresponding to the face of the person.

15. The apparatus of claim 1, wherein the apparatus comprises a mobile device.

16. The apparatus of claim 1, further comprising a display.

17. A facial expression recognition method, the method comprising: Receive image frames corresponding to a person's face; Generate a three-dimensional model of the face; Landmark feature information indicating landmark features of the face is determined based on the three-dimensional model of the face, wherein the landmark feature information includes one or more blending shape coefficients determined based on a comparison between the three-dimensional model of the face and image data corresponding to the face within the image frame; The image frame and the landmark feature information are input into at least one layer of a neural network trained for facial expression recognition; as well as The neural network is used to determine the facial expressions associated with the face.

18. The method of claim 17, wherein: The landmark feature information includes one or more hybrid shape coefficients; as well as Inputting the landmark feature information into at least one layer of the neural network includes inputting the one or more hybrid shape coefficients into a fully connected layer of the neural network.

19. The method of claim 18, wherein the fully connected layer concatenates the one or more hybrid shape coefficients with the data output from the convolutional layer of the neural network.

20. The method of claim 17, further comprising generating a landmark image frame indicating one or more landmark features of the face using the one or more hybrid shape coefficients.

21. The method of claim 20, wherein generating the landmark image frame comprises: Multiple landmark features of the face are determined based on the one or more hybrid shape coefficients; Determine a subset of the plurality of landmark features corresponding to the key landmark features; as well as The landmark image frame is generated by forming one or more connections between the subsets of the plurality of landmark features corresponding to the key landmark features.

22. The method of claim 21, further comprising determining, based on determining landmark features related to a person's facial expression, the subset of the plurality of landmark features corresponding to the key landmark features.

23. The method of claim 22, wherein the landmark image frame comprises a binary image frame, the binary image frame indicating pixels corresponding to the key landmark feature using predetermined pixel values.

24. The method of claim 20, wherein: The landmark feature information includes the landmark image frame; as well as Inputting the landmark feature information into at least one layer of the neural network includes inputting the landmark image frame into one or more layers of the neural network.

25. The method of claim 24, wherein inputting the landmark image frame into the one or more layers of the neural network comprises: A first version of the landmark image frame, having a first resolution, is input into the first layer of the neural network. as well as A second version of the landmark image frame is input into a second layer of the neural network that occurs after the first layer. The second version of the landmark image frame has a second resolution lower than the first resolution.

26. A non-transitory computer-readable storage medium for facial expression recognition, the non-transitory computer-readable storage medium comprising: The stored instructions, when executed by one or more processors, cause the one or more processors to: Receive image frames corresponding to a person's face; Generate a three-dimensional model of the face; Based on the three-dimensional model of the face, landmark feature information associated with landmark features of the face is determined, wherein the landmark feature information includes one or more hybrid shape coefficients determined based on a comparison between the three-dimensional model of the face and image data corresponding to the face within the image frame; The image frame and the landmark feature information are input into at least one layer of a neural network trained for facial expression recognition; as well as The neural network is used to determine the facial expressions associated with the face.