SDK Emotion AI Science

Emotion AI 101: All About Emotion Detection and Affectiva’s Emotion Metrics


Artificial emotional intelligence, or Emotion AI, is also known as emotion recognition or emotion detection technology. Humans use a lot of non-verbal cues, such as facial expressions, gesture, body language and tone of voice, to communicate their emotions.

At Affectiva, our vision is to develop Emotion AI that can detect emotion just the way humans do, from multiple channels. Our long term goal is to develop “Multimodal Emotion AI”, that combines analysis of both face and speech as complementary signals to provide richer insight into the human expression of emotion.

Let’s dive into our exact emotion metrics that we offer, how we calculate and map them to emotions, and how we determine accuracy of those metrics.  


Affectiva Emotion Metrics

The face provides a rich canvas of emotion. Humans are innately programmed to express and communicate emotion through facial expressions. Our technology scientifically measures and reports the emotions and facial expressions using sophisticated computer vision and machine learning techniques.

When you use the Affectiva SDK in your applications, you will receive facial expression output in the form of seven emotion metrics, 20 facial expression metrics, 13 emojis, and four appearance metrics.

Screen Shot 2017-10-30 at 10.49.06 AM.png

Furthermore, the SDK allows for measuring valence and engagement, as alternative metrics for measuring the emotional experience. Let’s explain further on what we mean by engagement & valence.

How do we calculate engagement?

Engagement is defined as a measure of facial muscle activation that illustrates the subject’s expressiveness. The range of values is from 0 to 100. Engagement or expressiveness is a weighted sum of the following facial expressions:

  • Brow raise
  • Brow furrow
  • Nose wrinkle
  • Lip corner depressor
  • Chin raise
  • Lip pucker
  • Lip press
  • Mouth open
  • Lip suck
  • Smile

How do we calculate valence?

Valence is a measure of the positive or negative nature of the recorded person’s experience. The range of values is from -100 to 100. The Valence metric likelihood is calculated based on a set of observed facial expressions:


How do we map facial expressions to emotions?

The emotion predictors use the observed facial expressions as input to calculate the likelihood of an emotion. Our Facial expression to emotion mapping builds on EMFACS mappings developed by Friesen & Ekman. A facial expression can have either a positive or a negative effect on the likelihood of an emotion. The following table shows the relationship between the facial expressions and the emotions predictors.


Read more about emotion mapping here


Screen Shot 2017-10-30 at 10.50.35 AM.png

Using the Metrics

Emotion, Expression and Emoji metrics scores indicate when users show a specific emotion or expression (e.g., a smile) along with the degree of confidence. The metrics can be thought of as detectors: as the emotion or facial expression occurs and intensifies, the score rises from 0 (no expression) to 100 (expression fully present).

In addition, we also expose a composite emotional metric called valence which gives feedback on the overall experience. Valence values from 0 to 100 indicate a neutral to positive experience, while values from -100 to 0 indicate a negative to neutral experience.

Determining Accuracy

We continuously train and test our emotion-sensing metrics to provide the most reliable and accurate classifiers. Our key emotions achieve accuracy in the high 90th percentile.

Our emotion metrics are trained and tested on very difficult datasets. We sampled our test set, comprised of hundreds of thousands of facial frames, from more than 6 million facial videos. This data is from more than 87 countries, representing real-world, spontaneous facial expressions, made under challenging conditions, such as varying lighting, different head movements, and variances in facial features due to ethnicity, age, gender, facial hair and glasses.

How do we measure our accuracy?

Affectiva uses the area under a Receiver Operating Characteristic (ROC) curve to report detector accuracy as this is the most generalized way to measure detector accuracy. The ROC score values range between 0 and 1 and the closer the value to 1 the more accurate the classifier is. Many facial expressions, such as smile, brow furrow, inner brow raise, brow raise, and nose wrinkle have an ROC score of over 0.9.

Some, more nuanced, facial expressions, which are much harder for even humans to reliably identify, include lip depressor, lip pucker and eye closure. These have an ROC score of over 0.8.

The classifiers for emotions have ROC scores greater than or equal to 0.8, with expressions of joy, disgust, contempt and surprise the most accurately detected. Expressions of anger, sadness and fear tend to be more nuanced and subtle and are therefore harder to detect resulting in scores at the lower end of the range.

The gender classifier uses the face bounding box tracked over a window of time, if available, to build confidence in its decision. If the confidence level does not meet the threshold within a window of 10 seconds, the gender is reported as unknown. The ROC score of the classifier is 0.95 and the average length of time taken to reach a decision is 3.4 seconds. The ROC score of the glasses classifier is 0.9.

At the current level of accuracy, the ethnicity and age classifiers are more useful as a quantitative measure of demographics than to correctly identify the age and ethnicity on an individual basis. We are always looking to diversify the data sources included in training those metrics to improve their accuracy levels.

The emojis are driven by the expression classifiers. Classifiers for Tongue Out, Wink and Eye Widen expressions were introduced to increase the range of emojis supported. These have an ROC score of over 0.8.

Cultural differences

Many scientific studies demonstrate the universality of facial expressions of emotions; however, each culture employs what we call “display rules”—culturally-specific rules that govern when people amplify, dampen or altogether mask a facial expression of emotion. The research demonstrating the effect of display rules is extensive, covers the past 50 years, and is widely acknowledged. In Southeast Asia there are very clear display rules around how to display emotion, especially in the presence of strangers (a work meeting, a moderator in a research study, etc.): namely, dampen their expressions, especially negative ones.

Again, our classifiers are trained against our massive emotion data repository that reflects data from 87 countries. This has hardened our technology to account for cultural differences with high accuracy.


Our SDKs also provide the following metrics about the physical appearance:



The age classifier attempts to estimate the age range. Supported ranges: Under 18, from 18 to 24, 25 to 34, 35 to 44, 45 to 54, 55 to 64, and 65 Plus.


The ethnicity classifier attempts to identify the person’s ethnicity. Supported classes: Caucasian, Black African, South Asian, East Asian and Hispanic.

At the current level of accuracy, the ethnicity and age classifiers are more useful as a quantitative measure of demographics than to correctly identify the age and ethnicity on an individual basis. We are always looking to diversify the data sources included in training those metrics to improve their accuracy levels.


The gender classifier attempts to identify the human perception of gender expression.

In the case of video or live feeds, the Gender, Age and Ethnicity classifiers track a face for a window of time to build confidence in their decision. If the classifier is unable to reach a decision, the classifier value is reported as “Unknown”.


A confidence level of whether the subject in the image is wearing eyeglasses or sunglasses.

Face Tracking and Head Angle Estimation

facial_landmarks.pngThe SDKs include our latest face tracker which calculates the following metrics:

Facial Landmarks Estimation

The tracking of the cartesian coordinates for the facial landmarks. 

Head Orientation Estimation

Estimation of the head position in a 3-D space in Euler angles (pitch, yaw, roll).

Interocular Distance

The distance between the two outer eye corners.

Ready to Get Started?

We know that this 101 post is a lot to digest - and now you are an Emotion AI Expert! Detecting emotions with technology is a highly complex challenge to address, but I’m sure you can imagine the many applications when integrated correctly. We also wanted to convey the level of sophistication that goes into accurately mapping facial and vocal expressions into emotions.

Also, in the interest of transparency, we wanted to be sure that you understood not only what features and functionality we have available, but how we arrive at them. You can always reference our science resources section to learn more about our technology, or check out our emotion recognition patents we have been awarded. 


SDK Emotion AI Science