Building Toward Natural Interactions with Voice Assistants

Best practices to reduce friction and improve user experience through voice assistants.

Ben Steele
Sr. Producer

I don’t know what to say.  It seems to me that’s the biggest barrier to having natural conversations with voice assistants.  I’m bad at talking to them.  I stutter over any command I haven’t repeatedly used before to the point where even another human being wouldn’t be able to derive a coherent sentence out of it, let alone a clever AI desperately failing to filter out all of my hums and haws.  I’m just not able to process my side of the interaction in the same way that I would with another person.  Ultimately, the reason is my orientation.

Let’s assume for a moment that I am not a professional in the voice space.  I don’t know what the assistant knows.  I have a limited window into things it can do based on what I’ve needed to get it to do in the past and maybe a few things I’ve derived from commercials.  Even beyond the features, I don’t really know how I should structure my sentences and I don’t know if I mess up my sentence if I need to just give up and start again or push through.  I certainly don’t know if the conversation is being tailored to some contextual marker that I myself may or may not be considering, and in the back of my mind I’m afraid it’s going to try calling someone on my contact list from a decade ago.  The issue is compounded at the application level where I don’t really know what phrases are attached to functionality even if my words are understood, so I’m just sitting there trying to shout keywords at it.  Me and my voice assistant, we’re not on the same wavelength.

The threshold for natural conversations with our voice assistants is a long way off, even with recent notable tech innovations.  Don’t get me wrong, we’re able to design a pretty frictionless experience, but what we’re really looking forward to is for our assistants to interact on our level instead of us on theirs, right?  Still, there are plenty of best practices that can reduce friction and improve the perception that you and your voice assistant are in an ongoing contextualized conversation:

Drive the experience, but avoid the spotlight. This kind of depends on whether your voice application is meant for utility or entertainment, but in general your user is looking to get to a point.  We frequently differentiate onboarding sessions from later sessions and then we might add logic to further account for a user’s familiarity.  Paraphrase in a way that it’s clear that even though the words are different, the function is the same.  Store contextual information at a session level and a longer-term user level as applicable, but make it clear that the information is being referenced in context so that the delivery doesn’t come off as a non-sequitur.

Over time, remove as many barriers to get to the point as reasonable, but be aware that you’re closing doors when you do so.  Weigh the value of an efficient experience against the constraints it places on the user and offer avenues for them to remove or reconsider those constraints.

Prompts make the cogs turn.  That is to say: Compliance.  You might think that notifications are annoying, suggestion chips are patronizing, and reminders are invasive.  Sometimes they are all of those things. Invocation phrases are generally very clunky as implemented right now.  “Alexa, ask Spotify to play my 2016 Kaylor Swerry mixup” isn’t a very easy phrase to get through.  Make sure your skill’s name is memorable and short. Use Alexa’s device notifications to surface your skill.  I personally get frustrated with very long stretches of spoken dialogue.  Google’s suggestion chips are AMAZING if I just want to cut to the chase but don’t know what to say to move the dialogue forward.

In general, be easier than a mobile app. Then, once you are, ween your users off their screens by facilitating habits around voice.

Personalization can be risky.  A while back we made a voice application that would remember your child’s name and use it verbally when the context called for it.  Parents will love that, right?  Problem is there are a lot of names, with a lot of different spellings, pronounced a lot of different ways.  You’d end up trying to spell to the assistant (red flag here already) only to have it pronounce the name wrong.  Turns out people don’t like such keen insight into the fate they’ve condemned their child to by giving them unusual names.  Jokes aside, our designer came up with the awesome idea of nicknames that the kids could choose from, like “T-Rex” and “Ninja Master”.  Instead of polarizing our users into ‘better than normal experience’ vs ‘worse than normal experience’, he came up with a generalized approach that was more fun for the kids and an all-around better experience for the parent as a result.

Lastly, it’s worth noting that there is a pretty obvious speed bump when designing a voice experience, which is that users and stakeholders aren’t able to visualize it well before it’s actually been implemented because often the wireframe is just a very complex flow chart.  That’s a lot of time and money to invest into coded functionality that could be dramatically altered based on feedback.  Do as much validation as possible, as early as possible.  Grab teammates and run through the script with them.  If you’re using a testing service, give them scripts to run through together.  Listen to Alexa speak your copy before it’s implemented.  Basically anything to avoid major changes to your interaction model late in the game.

Portions of this content originally appeared as part of SoundHound’s
Finding Your Brand Voice: 6 Ways to Build a Better VUI Guide

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of"

nested selector system.

About RAINdrops 

Created and curated by our team of experts, RAINdrop articles cover the many ways voice technology is transforming your industry.

See All Articles

Get Voice on Voice 

Every Tuesday, our industry leading briefing covers the latest updates on voice and beyond. Join over 12,000 subscribers and sign up today.

Voice on Voice

Don't miss another briefing