Voice Forward, Not Voice Only

Our conversations with technology need to be multimodal — which is why the next generation of voice-driven experiences actually demands a screen.

Steven Hansen
Chief Technology Officer

This article originally appeared on VentureBeat

At this year’s annual I/O developer conference, Google spotlighted the ambitions for its digital assistant as “naturally conversational” and “there when you need it.” But as voice-driven and conversational technologies become more ubiquitous, there are still undeniable moments where we find ourselves thinking, “Why can’t I just read this part?”

In an increasingly digital world, conversations with technology need to be multimodal — which is why the next generation of conversational experiences actually demands a screen. And because of its connection to Android, Google Assistant is set up to provide users with unique multimodal experiences.

The Third Screen

Years before Amazon’s Alexa became your personal assistant, we had chatbots that allowed conversation with technology via keyboard and screen. In 2011, Apple launched Siri on the iPhone (a device I carried with me everywhere) with the promise that I could simply ask Siri for whatever I needed using my voice. But the fact of the matter is, I rarely used those chat-based experiences, and I didn’t use Siri.

Amazon has already introduced two visually driven devices (Show and Dot), and now Google joins them with a series of hardware integrations set to go on sale in July. Through the newly launched display updates, highly customized visual experiences augment voice for the first time in Google’s ecosystem. While the expectation is that users will still be using voice as the primary input method, Google has introduced screen-specific features that will come into play if a user is accessing an Action via a screen rather than voice.

Fostering New Behavior

Google’s Assistant is also becoming more accessible and simpler to use through updates that allow for multiple requests, and it’s easy to see how improvements like these will have the biggest impact in the near term.

For example, Amazon’s Alexa was first introduced to my home through a black cylinder-shaped speaker, which was an instant hit for a family that loves listening to music around the house. Sure, I could have asked Siri to play music on my phone, but the Echo dedicated speaker was already in the perfect spot on our kitchen counter. We soon found ourselves asking new questions, testing the limits of the technology’s knowledge and abilities. I played with Siri like that when the technology was first released as well, but eventually, Siri was lost in the abyss of apps I no longer use (Siri isn’t an app, exactly, but the point is I didn’t use it). All the while, our Echo stayed in the kitchen, playing music on a daily basis, reminding me of things here and there every day by simply always being in the room. Alexa taught me a new behavior: actually talking to technology.

And here’s where it gets more interesting: I talk with Siri more now, too.

This leads me to believe that as conversations with technology become increasingly “natural,” they will continue to ingrain the behavior of talking to technology.

Voice Design 2.0

In the earliest days of building skills for Amazon’s Echo, brands began with voice-only experiences — but they quickly started running into roadblocks, mostly because they came from a background of building primarily screen-based experiences. The industry had to learn how users would interact with the voice assistant version of a “touch” or “click” activity. Companies zeroed in on the voice experience as the entire experience, eliminating any dependence on a screen.

But Skill developers kept bumping up against limitations. The new technology, of course, made many things simpler, but there were also tasks that fit better on a screen — things consumers were used to seeing in front of them that couldn’t really be replaced with voice. A user doesn’t need to hear a list of options if their digital assistant can recommend the option they are most likely going to pick. But intelligence combined with voice is part of the magic of a voice experience, and it’s what kept my attention.

When context and information allow, we as developers can provide an intelligent recommendation.

Google has recognized some of those same use cases, announcing the Lenovo Smart Display for Google Assistant at CES 2018. At Google I/O, the company took this one step further with a renewed focus on how developers can take advantage of a screen again when it comes to Google Assistant experiences.

Augmenting Through a Screen

When it comes to the screen, Google has given the Action developer full control of how the device displays their content with additional formats and styling capabilities. It won’t make sense to attempt replicating a traditional web experience onto these new screens, because that wouldn’t improve the voice experience. The screen also shouldn’t simply display the same text that the voice is speaking aloud. Instead, the screen can provide specific controls or additional context to the conversation that will allow the user to save time. The Action or Skill controls what (if anything) is on the screen, while the user is actively having a conversation.

An example shared on the I/O main stage this year was Google Maps navigation — sometimes, such as while you’re driving, voice technology should stand on its own, but there are instances where it should be more visually immersive. Whether a user wants to touch and select or speak out loud, the same results are attainable depending on the context.

Screen functionality will also add new opportunities for brands to bring their visual look and feel back into the discussion of a branded voice experience. It will still be a voice-first experience, but there will be features that only having a screen makes possible.

Google originally launched the Assistant app to merge a voice and chat experience into one and is doubling down on allowing users to easily transition between a variety of devices. Microsoft followed suit, building Cortana to accommodate both spoken and typed conversation.

We are now experiencing a new development trajectory, one where all of our digital interfaces can come together in an intelligent, seamless, and contextually relevant way — and this is where the real fun begins.

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.

Static and dynamic content editing

A rich text element can be used with static or dynamic content. For static content, just drop it into any page and begin editing. For dynamic content, add a rich text field to any collection and then connect a rich text element to that field in the settings panel. Voila!

How to customize formatting for each rich text

Headings, paragraphs, blockquotes, figures, images, and figure captions can all be styled after a class is added to the rich text element using the "When inside of"

nested selector system.

About RAINdrops 

Created and curated by our team of experts, RAINdrop articles cover the many ways voice technology is transforming your industry.

See All Articles

Get Voice on Voice 

Every Tuesday, our industry leading briefing covers the latest updates on voice and beyond. Join over 12,000 subscribers and sign up today.

Voice on Voice

Don't miss another briefing