[0011]Recent developments in computer vision and artificial intelligence technology make it possible to detect and track people's behavior from video sequences to further analyze their mental processes—intentions, interests, attractions, opinions, etc. The development in visual tracking technology makes it possible to track shoppers throughout the retail space, and to recognize their engagement and interaction with products. Facial image analysis has been especially matured, so that faces can be detected and tracked from video images, and the motion of the head and facial features can also be estimated. Especially, the head orientation and eye gaze can be measured to estimate the fine-level interest of the shopper. The facial appearance changes due to facial expression can also be measured to estimate the internal emotional state of the person. The estimated facial feature locations help to normalize the facial images, so that machine learning-based demographic classifications can provide accurate demographic information—gender, age, and ethnicity. The proposed invention aims to solve these problems under realistic scenarios where people show natural responses toward visual elements belonging to consumer products—such as product display, product information, packaging, etc. While each instance of such measurement can be erroneous, an accumulated measurement over time will provide reliable information to assess the collective response to a given visual element.
[0012]The invention adopts a series of both well-established and novel approaches for facial image processing and analysis to solve these tasks. Body detection and tracking locates shoppers and estimates their movements, so that the system can estimate each shopper's interest to or engagement with products, based on the track of movements. The direction toward which the shopper is facing can also be measured for the same purpose. Face detection and tracking handle the problem of locating faces and establishing correspondences among detected faces that belong to the same person. To be able to accurately locate the facial features, both the two-dimensional (position, size, and orientation) and three-dimensional (yaw and pitch) pose of the face should be estimated. Based on the estimated facial pose, the system normalizes the facial geometry so that facial features—eyes, iris, eyebrows, nose, and mouth—are aligned to standard positions. The estimated positions of irises relative to eyes along with the estimated head orientation reveal the shopper's direction of attention. The invention also introduces a novel approach to extract facial appearance changes due to facial expressions; a collection of image gradient filters are designed that match the shapes of facial features or transient features. A filter that spans the whole size of the feature shape does a more robust job of extracting shapes than do local edge detectors, and will especially help to detect weak and fine contours of the wrinkles (transient features) that may otherwise be missed using traditional methods. The set of filters are applied to the aligned facial images, and the emotion-sensitive features are extracted. These features train a learning machine to find the mapping from the appearance changes to facial muscle actions. In an exemplary embodiment, the 32 Action Units from the well-known Facial Action Coding System (FACS, by Ekman & Friesen) are employed. The recognized facial actions can be translated into six emotion categories: Happiness, Sadness, Surprise, Anger, Disgust, and Fear. These categories are known to reflect more fundamental affective states of the mind: Arousal, Valence, and Stance. The invention assumes that these affective states, if estimated, provide information more directly relevant to the recognition of people's attitudes toward a retail element than do the six emotion categories. For example, the degree of valence directly reveals the positive or negative attitude toward the element. The changes in affective state will then render a trajectory in the three-dimensional affect space. Another novel feature of the invention is to find a mapping from the sequence of affective state to the end response. The central motivation behind this approach is that, while the changes in affective state already contain very useful information regarding the response of the person to the visual stimulus, there can be still another level of mental process to make a final judgment—such as purchase, opinion, rating, etc. These are the kind of consumer feedbacks ultimately of interest to marketers or retailers, and we refer to such process as the “end response.” The sequence of affective state along with the shopper's changing level and duration of interest can also be interpreted in the context of the dynamics of the shopper behavior, because the emotional change at each stage of the shopping process conveys meaningful information about the shopper's response to a product. One of the additional novel features of this invention is to model the dynamics of a shopper's attitude toward a product, using a graphical Bayesian framework such as the Hidden Markov Model (HMM) to account for the uncertainties between the state transitions and the correlation between the internal states and the measured shopper responses.