Emotion Recognition for Speech

Considerations for Future Conversational Voice Interfaces


Screen Shot 2018-07-19 at 2.49.07 PMBy Taniya Mishra, Director of AI Research and Lead Speech Scientist, Affectiva

2018 appears to be the year of conversational agents having made the transition from chatbots which dominated the last couple years. Google Duplex’s recent splashy launch and the buzz it generated certainly solidifies this trend and the growth of voice interfaces is on the rise. In late 2017, it was reported that more than 20 million Amazon Echoes and over seven million Google Homes were being used in our homes and offices; and this number continues to grow everyday.

Conversational agents have a ton of positives. They leverage our most used communication modality, thus mitigating the substantial learning curve that often accompanies adoption of new technologies or gadgets. They support easy language-based interactions that almost anyone can use to converse with and through technology, whether you are a child who doesn’t yet know how to type or a person for whom dialing is difficult due to health reasons, or simply someone engaged in a hands-busy eyes-busy task.

Given their rise in popularity, it becomes imperative that we build voice interfaces to be:

  • Inclusive and accessible:

They are able to meet the various needs/abilities of the user in different contexts and to support the interaction by anticipating the user behavior in different scenarios. They don’t give privileged access to particular demographics. For example, speech recognition systems that works equally well for men versus women, for people speaking different accents; voiced systems that cater to kids and the elderly as well as the adults in the 18–65 range.

  • Support natural conversational interactions:
They move beyond single or 2-turn transactional interactions, which is the current norm, to support multiple user and system turns.
  • Emotionally aware:
In natural human-human conversations, people respond to stated and unstated information.
  • Emotionally expressive:
Speaking in the right emotional voice; so say providing sombre news in a sober voice; while providing happy exciting news in a happy voice.
  • Contextual:
They take into account the environmental context such as time of day, location, personal context and global contexts in order to have knowledge grounded conversations.
  • Tentative:
They can identify when things are going wrong in the conversation and ask clarifying questions.
  • Personalizable:
They can remember individual preferences and past choices in order to make personalized recommendations.
  • Ubiquitous but ambient:
It is there when you need it but then recedes into the background.

One of the goals of building conversational agents is that it makes for a “comfortable” natural interaction. While this goal in isolation is laudable, there is potential for malicious use. In the last couple years, we have already seen how text bots can be used for spreading misinformation. Conversational voice interfaces backed by high quality speech synthesis have the possibility of giving voice to these bots. Consequently, it is even more essential that privacy and transparency issues are considered from the beginning, rather than retrofitted as an after-thought. My thoughts in this regard, in no particular order:

  • Disclosure: Conversational agents need to disclose to the end-user that they are talking to an AI system at the beginning of the conversation. Otherwise it is deceptive and thus disrespectful to the end-user. Some people people have objected to the use of filled pauses, uptalk, etc. in Google’s Duplex as being too deceptive and an attempt to pretend to be a human. I expect that people would find use of these aspects of natural conversational speech to be less “creepy” and more “charming” as long as the system adheres to proper disclosure guidelines.
  • Consent: Data from which the underlying models are built need to be collected with consent. And I think that a lot of organizations are doing well in terms of this. But we need to go further. There needs to be easy ways for participants to revoke consent and for their data to be removed from the underlying models.
  • Benefits Considerations: As we build conversational agents with more cognitive  — and perhaps even emotional — intelligence than ever before, we need to continue to ask ourselves: “Why are we building this?”, “What are the benefits?”, “Is there potential for misuse?”, and finally, “Should we build it?” Ultimately, like all sophisticated technological advances, the benefits of building conversational agents with cognitive --- and emotional --- intelligence need to be weighed against the risks. And in my opinion, the benefits DO indeed outweigh the risks.

Future conversational agents that embody the naturalness of human-human conversations have the potential to truly transform for our modern day lives, but in order to be used with comfort, ease and peace of mind in our private spaces, we need to these conversational agents to be inclusive, accessible, and maintain respect for the end-user through proper disclosure and consent.

At our Emotion AI Summit: Trust in AIwe explored these and other related themes. If you missed out, download session content here

Download Emotion AI Summit 2018 content now

Emotion Recognition for Speech