Emotion recognition is mostly ineffective. Why are companies still investing in it?


Have you ever been angry at your computer? Rosalind Picard knew plenty of people that had. There were those minor cases, of course: a slap on the chassis, say, or the shouts of frustration when software was taking too long to load. Then there were the more extreme examples. “I love the story of the chef in New York who threw his computer in a deep-fat fryer,” the MIT computer science professor told Wired in 2012. “There was a guy who fired several shots through the monitor and several through the hard drive. You don’t do that because you’re having a great experience.”

What the world needed to prevent these outbursts, Picard reasoned, weren’t just faster computers, but ones capable of anticipating the build-up of frustration through the analysis of contextual signals: the furious tapping of a mouse, say, or the haptic feedback from a keyboard being prodded harder and harder.

In 1995, she debuted this vision of machine learning in her book Affective Computing. In it, Picard envisioned a future where artificial intelligence would be capable of interpreting anger and frustration in the user, but also the expression of all types of emotions – leading not only toward more intelligent product design, but new applications in medicine, learning and in the creative arts.

Since then, affective computing has acquired another name: emotion recognition, or ER. Recent years have seen numerous applications for ER emerge, from using facial analysis to spot when drivers are on the brink of dozing at the wheel, to assessing the suitability of job candidates during interviews or by staff at call centres to obtain early warning of customers who sound particularly irate.


Emerging Technology

Meta’s new Sphere AI tool will check the accuracy of Wikipedia entries

Emerging Technology

NHS drones to deliver chemotherapy treatment to patients

Emerging Technology

Sound of the metaverse: Meta creates AI models to improve virtual audio

Emerging Technology

Google claims ‘quantum advantage’ for machine learning

Along the way, though, these systems’ capabilities seemed to inflate. Many weren’t just claiming that they could analyse physical expressions, but also that they could use this data to infer an interior emotional state – in other words, know exactly what the user was feeling.

For many academics, this was ludicrous: a smile, for example, could denote deep-seated frustration as it does happiness. A backlash against ER quickly developed. Leading the charge was psychologist Lisa Feldman Barrett, who explained in a review article she co-authored in 2019, the dangers of drawing such profound conclusions from the curve of a brow or the intonation of someone’s voice.

The following two years saw this debate spill over into mainstream coverage, using the premise that ER was inherently unreliable to question whether it was inflicting undue emotional labour, preventing people from being hired, wrongly identifying people as criminally suspicious, or being used to oppress Uyghurs in Xinjiang. The apotheosis of this debate, however, came in an article in The Atlantic that not only attacked ER as fundamentally flawed, but charted a link between these applications and the controversial theories of psychologist Paul Ekman, who claimed that certain emotions are expressed in universal ways.

Content from our partners

How clinical trials infrastructure is undergoing digital transformation

Webinar – Top 3 Ways to Build Security into DevOps

Tech sector is making progress on diversity, but advances must accelerate

Perhaps that should have been the end of it. Indeed, some tech companies, like Microsoft, have walked back prior endorsements of ER, while Hirevue has suspended its use of the technology to assess job candidates. But others haven’t. In April, Zoom stepped into the field of emotion recognition software when it announced its intention to incorporate AI models capable of analysing user engagement during video calls, while a US start-up called EmotionTrac claims its facial analysis software can help law firms assess how juries might react to certain arguments (neither company responded to interview requests).

In the face of such excoriation by the academic community, continued interest in these applications seems baffling. But so far, the allure of supposedly “mind reading” technology, and the simple intuition that our emotions are reflected in our faces, are proving to be more persuasive than the nuanced argument that debunks it.

Emotion recognition technology analyses facial expressions to divine insights about an individual but scientists warn that the promised capabilities are largely overblown. (Photo by sdominick/iStock)

Does emotion recognition work?

It was never meant to be this way, says Picard. Her work at the Affective Computing Lab and Affectiva, a start-up she co-founded in 2009, wasn’t aimed at reading people’s minds, but instead attempted to automate the very human action of looking at someone’s face and guessing how they were reacting to a situation in the moment. Using those signals to infer an interior emotional state, she explains, is a step too far.

“I might see you nod, and I think, ‘Okay, he’s trying to look like he’s paying attention,’” says Picard. “Does it mean that you’re happy with what I said? No. It’s a social gesture.”

Context is key when making these judgements. Just as an interviewee can guess correctly that a journalist is nodding their head is a sign that they’re listening, so too does an AI need to be aware of precisely what situation it is being asked to judge the reactions of an individual.

This was precisely how Picard wanted Affectiva to apply affect recognition in ad testing (she eventually left the firm in 2013 and founded her own start-up, Empatica). Companies including Coca-Cola or Kellogg’s have used its technology to look for outward signs that a commercial is funny or interesting, findings which are then benchmarked against self-reported data from the subject. But using the software outside of those constraints, explains Picard, would see its effectiveness diminish dramatically.

The limits of emotion recognition AI

When taken in context, emotion recognition technology can be of use, says Andrew McStay, professor of digital life at Bangor University. “Even some of the most vociferous critics will kind of agree that, if you understand an expression in context, in relation to a person – who they’re with and what they’re doing, and what’s occurring at that time – there is greater confidence in the result.”

Even so, there’s a ceiling in how effective even this approach can be: the subject might have a muted reaction to an advertisement because they’ve had a bad day, for example. The essential unknowability of how that wider context influences an individual’s outer expressions means emotion recognition AI can only ever provide limited insight – something Picard herself acknowledges.

“In my book, I said you need context, you need the signals, and you can still be wrong,” she says. “Because it’s not the feeling.” 

But even burying ER systems within larger contextual frameworks has its downsides, argues Os Keyes, a researcher into technology ethics at the University of Washington. Even if its contribution is small, an ER component drawing the wrong conclusions from physical expressions of affect can still contaminate that larger system’s decision-making process. “Some human, somewhere, has decided that 65% is the threshold for, 'this is true or not,’” he explains. “If you are at 64%, and emotion recognition brings you to 66%, there is a different outcome.”

Physical affect also varies according to the nationality of the individual. Even the cultural biases of AI researchers themselves can’t help but influence the output of their systems. “If you're looking at eye movements, or body movement, or facial muscle movements…I’m sure there is something they can tell you,” says Shazeda Ahmed, a researcher at Princeton University and an expert on ER applications in China. “But, it's very hard to design systems like this and not having imposed your cultural presumptions about what emotion is, how you define it.”

As a field, emotion recognition also suffers from a basic problem of semantics. For scientists like Picard, ‘emotion’ is a technical term used in reference to the uncertain analysis of physical affect. Most other people define emotion as an all-encompassing feeling – making it much harder conceptually to untangle physical expression from mental states. “I’m partly to blame for this,” says Picard, conceding that another term, such as ‘affect processing’, might have proven less controversial.

It doesn’t help that there’s no agreement in psychology about what terms like ‘affect’ and ‘emotion’ actually mean. Picard originally approached dozens of psychologists about how best to define the outputs under scrutiny by facial analysis software. What she found was a balkanised field where researchers jealously guarded the dogma of their sub-theories of emotion. “It’s very antagonistic,” says Picard. “You plant a flag, you stand by that flag.”

That lack of consensus might explain why so many start-ups take inspiration from Paul Ekman’s theories about the universalities of emotion, explains McStay. In the many technology conferences he’s attended, he says, “there’s a deep focus on technical method rather than psychological method,” focusing disproportionately on the latest advances in computer vision instead of the limitations and ethics of analysing affect.

In that context, “the Ekman model works really well for technologists,” says McStay, insofar as his idea that certain emotions are innate and have common expressions imply that machines can be trained to identify them. 

A spectrogram of a human voice used in artificial intelligence training. Some technology companies claim that emotion recognition can play a role in detecting emotions in a human voice - even signs of deception. (Photo by Smith Collection/Gado/Getty Images)

The allure of emotion recognition technology

If people understood that emotion recognition isn’t capable of making a judgement about internal emotional states, says Keyes, the technology would be a lot less popular – even if it was embedded within a larger framework of contextual data. “If I make you a coffee and I tell you that it has 15 ingredients, one of which is rat shit,” they say, “do you feel more comfortable about drinking that coffee?”

Despite this, the allure of emotion recognition remains strong – and not just among technology start-ups. Having spent several years studying emerging ER applications in China alongside Shazeda Ahmed, lawyer and researcher Vidushi Marda is regularly invited to address authorities in India and the EU about the technology and its limitations. In almost every conversation, she’s had to try harder than she thought to convince her audience that ER is flawed.

“The knowledge that these systems don't work isn't compelling enough for governments to not throw money at it,” says Marda. “The allure of power, and computational power, is so high when it comes to emotion recognition that it's almost too interesting to give up, even though there's overwhelming scientific evidence to show that it actually doesn't work.”

This could be because it’s just easier to argue for emotion recognition rather than against it, Marda continues. After all, people do tend to smile when they’re happy or frown when they’re angry, and it’s a simple idea to imagine technology capable of interpreting those signals.

Explaining that physical expressions or intonations of voice do not always correspond with an interior state of mind is a more complicated argument to make. “I think we can talk about Ekman until the cows come home,” says Marda, “but if people still believe that this works, and people still watch shows like The Mentalist, it’s difficult to fully get them on board.”

Data, insights and analysis delivered to you View all newsletters By The Tech Monitor team Sign up to our newsletters

It’s little wonder, then, that the impassioned arguments of Barrett et al haven’t stopped technology companies trading on the inflated expectations of what ER can achieve. Emotion Logic, for example, has sought $10m from VC firms to bring to market “an AI engine that understands emotions and cannot only relate to what a person says, but also what the person really feels,” according to the CEO of its parent company. And while Picard maintains that she was clear with clients during her time at Affectiva that the insights to be gained from audience reactions were inherently limited, the company states on its website that its Emotion AI service is capable of detecting ‘nuanced human emotions’ and ‘complex cognitive states’.

In most places around the world, these companies operate in a regulatory vacuum. What would help, says Picard, are laws requiring fully informed consent before such systems are used and banning its operation in certain circumstances while safeguarding others, most obviously in healthcare applications like helping people with autism to interpret basic emotions (though Keyes points out that this latter use has also inflated expectations of what ER can achieve.)

A useful template, Picard explains, might be found in the polygraph, a machine of which the use is restricted in US law precisely because its effectiveness is highly contingent on the context of its operation. Even then, she says, “we know that there’s even trained polygraph people [that] can still screw it up.”

Getting to that point, however, will require a concerted effort to educate lawmakers about the limitations of ER. “I think even people like me, who would rather do the science than the policy, need to get involved,” says Picard, “because the policymakers are clueless.” 

Reform on this scale will take time. It may not be possible until the public, and not just academics, believe the technology is failing in its goals, explains Nazanin Andalibi, a professor at the University of Michigan. That’s made even more difficult by the invisibility of failures not only to those having their emotions ‘recognised’ but also those using the services, given the lack of transparency on how such models are trained and implemented.  

Meanwhile, the “processes of auditing these technologies remains very difficult,” says Andalibi. This also plays into a groundswell of hype around artificial intelligence in recent years, where the emergence of voice assistants and powerful language models give the lie that neural networks are capable of achieving almost anything, so long as you feed them the right data.

For his part, Keyes is convinced technologists will never get that far. Developing an AI capable of parsing all the many nuances of human emotion, they say, would effectively mean cracking the problem of general AI, probably just after humanity has developed faster-than-light travel and begun settling distant solar systems.

Instead, in Keyes’ view, we’ve been left with a middling technology: one that demonstrates enough capability in applications with low-enough stakes to convince the right people to invest in further development.  

It is this misunderstanding that seems to lie at the root of our inflated expectations of emotion recognition. “It works just well enough to be plausible, just well enough to be given an extra length of rope,” says Keyes, “and just poorly enough that it will hang us with that length of rope.”

Read more: The fight against facial recognition

Topics in this article: emotion recognition, Facial Recognition

你曾经对你的电脑生气过吗?罗莎琳德·皮卡德认识很多这样做的人。当然,也有一些次要的情况:比如对底盘的一记耳光,或者当软件加载时间太长时发出的挫败声。还有一些更极端的例子。2012年,麻省理工学院计算机科学教授告诉《连线》杂志:“我喜欢纽约大厨把电脑扔进深油炸锅的故事。”“有个人朝显示器开了几枪,还朝硬盘开了几枪。你这样做不是因为你有很棒的体验。皮卡德认为,要防止这些情绪爆发,我们需要的不仅仅是更快的计算机,而是能够通过分析上下文信号来预测沮丧情绪积聚的计算机:比如,猛烈地敲击鼠标,或者越来越用力地敲击键盘产生的触觉反馈。1995年,她在《情感计算》一书中首次提出了机器学习的设想。在这本书中,皮卡德设想了这样一个未来:人工智能将能够理解用户的愤怒和沮丧,而且还能表达各种类型的情绪——这不仅会导致更智能的产品设计,还会在医学、学习和创意艺术方面产生新的应用。从那时起,情感计算获得了另一个名字:情感识别(ER)。近年来,ER的应用出现了很多,从使用面部分析来发现司机在开车时何时快要打瞌睡,到在面试时评估求职者的适用性,或者由呼叫中心的工作人员对听起来特别生气的客户进行预警。然而,随着时间的推移,这些系统的能力似乎在膨胀。许多人不仅声称他们可以分析身体表情,而且他们可以使用这些数据来推断内部情绪状态——换句话说,确切地知道用户的感受。对于许多学者来说,这是可笑的:例如,微笑可以表示根深蒂固的沮丧,就像它表示幸福一样。对ER的抵制很快就形成了。牵头这项研究的是心理学家丽莎·费尔德曼·巴雷特(Lisa Feldman Barrett),她在2019年与人合著的一篇综述文章中解释了从一个人的眉毛曲线或语音语调得出如此深刻的结论的危险。接下来的两年,这场辩论蔓延到了主流媒体的报道中,以《急诊室的故事》本身就不可靠为前提,质疑它是否造成了过度的情绪劳动、阻止人们被雇佣、错误地将人们认定为犯罪嫌疑人,或者被用来压迫新疆的维吾尔人。然而,《大西洋月刊》(The Atlantic)上的一篇文章将这一争论发扬光大。文章不仅抨击ER存在根本缺陷,还将这些应用与心理学家保罗·埃克曼(Paul Ekman)的争议理论联系起来。埃克曼声称,某些情绪是用普遍的方式表达的。也许这一切就该结束了。事实上,微软(Microsoft)等一些科技公司已经收回了此前对ER的支持,而Hirevue也暂停使用该技术评估求职者。但其他人没有。今年4月,Zoom宣布打算纳入能够分析用户在视频通话中参与情况的人工智能模型,从而进军情感识别软件领域。而美国初创企业EmotionTrac声称,其面部分析软件可以帮助律师事务所评估陪审团对某些论点可能做出的反应(两家公司都没有回应采访请求)。面对学术界的严厉批评,继续对这些应用感兴趣似乎令人困惑。但到目前为止,所谓的“读心”技术的诱惑,以及我们的情绪反映在我们脸上的简单直觉,被证明比揭穿它的微妙论点更有说服力。 皮卡德说,事情从来就不是这样的。她在情感计算实验室(Affective Computing Lab)和2009年与人联合创办的Affectiva公司工作,目的不是读懂人们的心思,而是试图让非常人类的行为自动化,即通过观察别人的脸来猜测他们当时对某种情况的反应。她解释说,用这些信号来推断一个人的内心情绪状态,这有点过头了。皮卡德说:“我可能会看到你点头,然后我想,‘好吧,他试图让自己看起来很专注。’”“这是不是说你对我说的话很满意?”没有。这是一种社交姿态。“背景是做出这些判断的关键。就像受访者可以正确地猜测记者点头是他们在听的信号一样,人工智能也需要准确地知道它被要求判断个体反应的情况。这正是皮卡德希望Affectiva在广告测试中应用影响识别的方式(她最终在2013年离开了该公司,成立了自己的初创公司Empatica)。包括可口可乐(Coca-Cola)和家乐氏(Kellogg’s)在内的公司都使用这项技术来寻找广告是否有趣或有趣的外部迹象,然后将这些发现与受试者自我报告的数据进行基准分析。但是Picard解释说,如果超出了这些限制使用软件,它的效率就会大大降低。班戈大学(Bangor University)数字生活教授安德鲁•麦克斯特(Andrew McStay)表示,在实际情况下,情感识别技术可以派上用场。“即使是一些最激烈的批评者也会同意,如果你理解了一个表达的语境,与一个人有关——他们和谁在一起,他们在做什么,当时发生了什么——就会对结果有更大的信心。”即便如此,这种方法的有效性也存在一个上限:例如,被试可能会因为某天过得不太好而对广告产生温和的反应。更广泛的环境如何影响一个人的外在表达这一本质上的不可知意味着情感识别人工智能只能提供有限的洞察力——皮卡德自己也承认这一点。“在我的书中,我说过你需要背景,你需要信号,但你仍然可能是错的,”她说。“因为这不是感觉。”但是,华盛顿大学技术伦理研究人员Os Keyes认为,即使将ER系统埋藏在更大的上下文框架中也有其缺点。即使它的贡献很小,从身体情感表达中得出错误结论的ER成分仍然会污染更大系统的决策过程。他解释道:“有些人,在某个地方,已经决定65%是判断这是否正确的临界值。”“如果你的支持率是64%,而情感识别将你的支持率提高到66%,那么结果就不一样了。”身体上的影响也因国籍而异。就连人工智能研究人员自身的文化偏见也会不由自主地影响他们系统的输出。普林斯顿大学(Princeton University)研究员、中国急诊应用专家沙泽达·艾哈迈德(Shazeda Ahmed)说:“如果你在观察眼球运动、身体运动或面部肌肉运动……我相信它们能告诉你一些事情。”“但是,设计这样的系统,并且没有将你对情感是什么以及你如何定义它的文化假设强加于人,是非常困难的。”作为一个领域,情感识别还存在一个基本的语义问题。对皮卡德这样的科学家来说,“情绪”是一个专业术语,指对身体影响的不确定性分析。其他大多数人将情感定义为一种包罗万象的感觉——这使得从概念上把身体表达和精神状态区分开来变得更加困难。皮卡德说:“我对此负有部分责任。”他承认,另一个术语,如‘情感处理’,可能没有那么多争议。 心理学上对于“影响”和“情绪”这类术语的实际含义没有达成一致,这也无济于事。皮卡德最初找了几十位心理学家,研究如何在面部分析软件的审查下最好地定义输出。她发现的是一个分裂的领域,在那里,研究人员小心翼翼地捍卫着他们的情感次级理论的教条。“这是非常敌对的,”皮卡德说。“你插上一面旗帜,你就站在那面旗帜旁边。”麦克斯特解释说,缺乏共识或许可以解释为什么这么多初创企业从保罗·埃克曼(Paul Ekman)关于情感普遍性的理论中获得灵感。他说,在他参加的许多技术会议中,“人们对技术方法的关注远远超过了心理学方法”,他们过多地关注计算机视觉的最新进展,而不是分析情感的局限性和伦理。在这种情况下,“埃克曼模型对技术人员来说非常有效,”麦克斯特说,因为他认为某些情绪是天生的,有常见的表达方式,这意味着机器可以通过训练来识别它们。Keyes说,如果人们明白情绪识别并不能对内部情绪状态做出判断,那么这项技术就不会那么受欢迎了——即使它被嵌入到一个更大的上下文数据框架中。“如果我给你煮了一杯咖啡,告诉你它有15种成分,其中一种是老鼠屎,”他们说,“你会觉得喝这种咖啡更舒服吗?”尽管如此,情感识别的吸引力仍然很强——而且不只是在科技初创企业中。律师兼研究员Vidushi Marda与Shazeda Ahmed一起花了几年时间研究中国新兴的ER应用,他经常受邀向印度和欧盟当局发表关于该技术及其局限性的演讲。几乎在每一次谈话中,她都要比想象中更加努力地让观众相信《急诊室的故事》是有缺陷的。Marda说:“知道这些系统不起作用,并不能让政府有足够的说服力不去砸钱。”“当涉及到情感识别时,力量和计算能力的吸引力是如此之高,以至于它几乎太有趣了,以至于不能放弃,尽管有压倒性的科学证据表明它实际上不起作用。”Marda继续说,这可能是因为支持情感识别比反对它更容易。毕竟,人们在高兴的时候会微笑,生气的时候会皱眉,想象一下能够解读这些信号的技术是很简单的。解释身体的表情或语调并不总是与内心的精神状态相一致是一个更复杂的论点。“我认为我们可以一直谈论埃克曼,”玛达说,“但如果人们仍然相信这是有效的,人们仍然观看像《超感神探》这样的节目,就很难完全说服他们。”因此,难怪巴雷特等人的激烈争论没有阻止科技公司利用ER所能达到的过高预期进行交易。例如,Emotion Logic母公司的首席执行官表示,该公司已向风投公司寻求1000万美元资金,希望将“一款能够理解情绪的人工智能引擎推向市场,它不仅能与一个人说的话相关,还能与这个人的真实感受相关”。虽然皮卡德坚称,她在Affectiva工作期间与客户很清楚,从观众的反应中获得的洞察力本质上是有限的,但该公司在其网站上表示,其情感AI服务能够检测“微妙的人类情感”和“复杂的认知状态”。 在世界上大多数地方,这些公司都在监管真空中运营。Picard说,在使用这类系统之前,法律要求完全知情的同意,并禁止在某些情况下使用,同时保护其他人,最明显的是在医疗保健应用中,比如帮助自闭症患者理解基本情绪(尽管Keyes指出,后者的使用也夸大了对ER能达到的效果的期望)。皮卡德解释说,在测谎仪中可能会发现一个有用的模板。美国法律限制测谎仪的使用,正是因为它的有效性在很大程度上取决于它的操作环境。即便如此,她说,“我们知道,即使是受过训练的测谎人员也可能把事情搞砸。”然而,要做到这一点,就需要各方共同努力,让立法者认识到ER的局限性。皮卡德说:“我认为,即使是像我这样宁愿研究科学而不愿制定政策的人,也需要参与进来,因为决策者是无知的。”如此规模的改革需要时间。密歇根大学教授Nazanin and不在场证明解释说,除非公众,而不仅仅是学术界,相信这项技术没有达到目标,否则这可能是不可能的。考虑到这类模型的训练和实施缺乏透明度,不仅是那些情绪被“识别”的人,还有那些使用这些服务的人,都看不到失败,这让这一点变得更加困难。与此同时,“审核这些技术的过程仍然非常困难,”and不在场证明说。这也助长了近年来围绕人工智能的炒作热潮,语音助手和强大的语言模型的出现掩盖了这样一个谎言:只要你提供正确的数据,神经网络几乎可以实现任何事情。对他来说,凯斯相信技术人员永远不会走到那一步。他们说,开发一种能够解析人类情感的所有细微差别的人工智能,将有效地意味着解决一般人工智能的问题,这可能是在人类刚刚发展出超光速旅行并开始定居遥远的太阳系之后。相反,在Keyes看来,我们得到的是一种中等水平的技术:一种以足够低的成本在应用中展示出足够的能力,从而说服合适的人投资于进一步的开发。这种误解似乎是我们对情感识别过高期望的根源。凯斯说:“它的工作效果刚刚好,足以让人信服,刚刚好,可以给它一根额外的绳子,但又刚好够糟糕,它会用那根绳子把我们吊起来。”