The AV industry is waking up to the possibilities of speech recognition as it becomes an important control tool. But how much more can be achieved if we can understand not just what is being said, but how it’s being said? Anna Mitchell speaks with Rana Gujral, CEO of Behavioral Signals.
You can tell a lot from someone’s voice and we all instinctively detect a wide and nuanced range of emotions when we listen to other people speaking. But as natural as this seems for humans, it’s not something we usually expect from speech recognition systems, even as they become more adept at understanding natural language.
It’s an area of the market that AI developer Behavioral Signals is cornering with its Oliver API. The company was founded in 2016 by academics Alex Potamianos and Shri Narayanan. At the end of 2018 Rana Gujral, an entrepreneur, joined the company as CEO to help commercialise the product and grow the business into targeted vertical markets.
Gujral says the technology allows users to deduce intelligent and actionable insights from voice conversations. “We offer three buckets of capabilities. The first is the ability to decipher essential elements of a voice conversation,” he says. “In a conversation between multiple parties our engines can deduce speaking ratios, listening ratios, diarisation [who is talking], who is speaking more, who is speaking less, speech overlap etc. We can also deduce age and gender of the participants.
“The second offering is the elements of what’s being said. This is the natural language processing (NLP). We have a very high performing homegrown speech to text engine that delivers towards a variety of industry KPIs in addition to advanced transcription capabilities.
“Where we really shine is in our abilities to deduce signals from a voice conversation. These are emotional signals such as anger, happiness, sadness and behavioural signals such as agitation, engagement, empathy, politeness. We take speech processing capabilities, NLP speech to text capabilities (what we call ‘what’ is being said), and signal detection (how something is being said) and apply it towards essential business outcomes. We’ve gone a step further and used these capabilities to produce a handful of very specialised prediction engines which predict how a person will act in the near future.”
Gujral [pictured right] explains that 80% of the signal detection is focused on tonality,
only 20% on the words being said. “The technology we have is very different to everything else out on the market,” he adds.
Accurate prediction opens a range of possibilities, some of which are used in call centre applications with one example being debt collectors being alerted to someone’s propensity to pay. Other applications can be found in healthcare, where the technology can be used to predict a propensity for suicidal behaviour.
In conferencing and business communication the API can be used to powerful effect. “We are working with some leaders in conference equipment,” confirms Gujral although he won’t be drawn on who. “They’re taking their core value proposition to the next level. They’re thinking ‘We’re selling this conference equipment into workforces and corporations, what other insights can we gather? What statistics can we gather?’
“These companies can start to ascertain whether interactions are productive or not and what the speaking and listening ratios are. One interesting exploration we did was with a company that carried out most of its work remotely. We looked at gender and age ratios in terms of speaking and listening and found out that there was a huge gap in speaking ratios between male and female participants with males dominating the conversations. With that data we can ask ‘Is there a problem here?’”
Gujral says that conversation analytics can be taken to a much deeper level.
“We can process speech and dissect a conversation by saying: ‘we have three participants in different age groups, two male and one female, x person spoke for this number of minutes and the interruption and listening ratios are as such’,” he outlines.
“But we can also come up with more advanced capability signals and say the general tone was engaged or disengaged, positive or angry. And if you want to dig in deeper and say participant one was generally engaged, participant two was mostly disengaged, participant three was angry throughout the call you can.
“So that’s a lot of data. A lot of interesting stuff. But what do you do with it? That’s where business KPIs start to come into play. Businesses need to ask what is important to them and then the technology can be built into a tool. We apply these capabilities to address use cases such as agent performance, customer satisfaction, outcome prediction and sales enablement.” Gujral outlines that in addition to dissecting and analysing discussions after calls, it can also provide instruction in real time to correct an interaction as it’s happening with a system providing alerts to slow down or speed up speech, or warnings that the other party sounds angry or disengaged for example.
This sort of insight immediately brings privacy considerations to the fore. Gujral says it’s something that Behavioral Signals has had to take seriously because of their access to data. “There’s a lot of responsibility on us in terms of storage and usage and privacy and security measures,” he says. “Our business model typically involves deploying our capability within the private cloud of a client inside their firewall. That data never leaves the firewall.”
Gujral also notes that Behavioral Signals’ technology alleviates many concerns. “Our technology is focused on tonality, which means for the most part we’re not focused on processing text much,” he explains. “We’re less interested in what is being said and more in how it’s being said. Often we don’t need to know what people are saying and as a result privacy is intact.”
It’s that aspect of the technology that again extends it outside the possible applications many voice recognition systems are currently used for. In another use case, Gujral says a large technology company is looking at monitoring for distressed people within certain facilities.
“One example would be geared towards old people who can be abused in senior
homes,” he says. “We can build a device that tells if someone is in distress. If you were to build a device using current technology you would have to listen to everything that’s being said, process it and then trigger alerts. Who is going to be ok with that? Most people wouldn’t, even if it can provide value. It’s different if you build that capability within a device that is not listening to words, in fact it’s discarding actual words, it’s just listening to tonality and pitch variance to deliver alerts if it senses someone in distress or an abuse situation.”
Machines can outstrip humans in the volume of data they can process and the speed with which they do it. However, all too often we’re reminded of shortcomings in AI when it comes to emotion. Gujral points out that a sarcastic ‘yes’ will be processed as any other sort of ‘yes’ by Amazon’s Alexa or Apple’s Siri. Behavioral Signals is changing all that. According Gujral, humans currently just have the edge with around 80% success in detecting emotions. But Behavioral Systems’ technology is now scoring over 70% and is only getting better. Watch this space.