With​ ​the​ ​appearance​ ​of​ ​voice​ ​user​ ​interfaces,​ ​AI​ ​and​ ​chatbots,​ ​what is​ ​the​ ​future​ ​of​ ​graphical​ ​user​ ​interfaces (GUIs)?​ ​Don’t​ ​worry: Despite​ ​some dark​ ​predictions1​,​ ​GUIs​ ​will​ ​stay​ ​around​ ​for​ ​many​ ​years​ ​to​ ​come.

Let​ ​me​ ​share​ ​my​ ​personal, humble predictions​ ​and​ ​introduce​ ​multi-modal​ ​interfaces as​ ​a​ ​more​ ​human​ ​way​ ​of​ ​communication​ ​between​ ​user​ ​and​ machine.

What​ ​Are​ ​Our​ ​Primary​ ​Sensors? Link

The​ ​old​ ​wisdom​ ​that​ ​a​ ​picture​ ​is​ ​worth​ ​a​ ​thousand​ ​words​ ​​is​ ​still​ ​true​ ​today.​ ​Our​ ​brain​ ​is​ ​an incredible​ ​image​-​processing​ ​machine.​ ​We​ ​can​ ​understand​ ​complex​ ​information​ ​faster​ ​when we​ ​see​ ​it​ ​visually.​ ​According​ ​to​ ​studies,​ ​even​ ​when​ ​we​ ​talk​ ​with​ ​someone​ ​else,​ n​​onverbal communication2​​ ​represents​ ​two third​s ​of​ ​the​ ​conversation.​ ​According​ ​to​ ​​other​ ​studies3​​, ​we​ absorb most​ ​information​ ​from​ ​our​ ​sight​ ​(83%​ ​sight,​ ​11%​ ​hearing,​ ​3%​ ​smell,​ ​2%​ ​touch​ ​and​ ​1% taste).​ ​In​ ​short,​ ​our​ ​eyes​ ​are​ our ​primary​ ​sensors​.

Our​ ​ears​ ​are​ ​the​ ​second​-​most​ ​important​ ​sensors​ ​we​ ​have,​ ​and​ ​in​ ​some​ ​situations​, ​voice conversation​ ​is​ ​a​ ​very​ ​effective​ ​communication​ ​channel.​ ​Imagine​ ​for​ ​a​ ​moment​ ​a​ ​simple shopping​ ​experience.​ ​Ordering​ ​your​ ​favorite​ ​pizza​ ​is​ ​much​ ​easier​ ​if​ ​you​ ​pick​ ​up​ ​the​ ​phone and​ ​order​ ​it,​ ​instead​ ​of​ ​going​ ​through​ ​all​ of ​the​ ​different​ ​offers​ ​on​ ​a​ ​website.​ ​But​ ​in​ ​a​ ​more complex​ ​situation,​ ​rely​ing ​just​ ​on​ ​verbal​ ​communication is not enough.​ ​For​ ​example,​ ​would you​ ​buy​ ​a​ ​shoe​ ​without​ ​seeing​ ​it​ ​first?​ ​Of​ ​course​ ​not.

Even​ ​traditionally​ ​text-based​ ​messaging​ ​platforms​ have ​started​ ​introducing​ ​visual​ ​elements.​ ​It’s not ​coincidence​ ​that​ ​visual​ ​​UI​ ​snippets4​​ ​were​ ​the​ ​first​ ​thing​ ​Facebook​ ​implemented​ ​when it ​created​ its ​chatbot​ ​platform.​ ​Some​ ​information​ ​is​ ​just​ ​easier​ ​to​ ​understand​ ​when​ ​we see​ ​it.

Text-only​ ​and​ ​voice-only​ ​interfaces​ ​can​ ​do​ ​a​ ​good​ ​job​ ​in​ ​some​ ​use​ ​cases,​ ​but​ ​today​ ​it’s​ ​clear they​ ​are​ ​not​ ​optimal​ ​for​ ​everything.​ ​As​ ​long​ ​as​ ​visual​ ​image​-​processing​ ​remain​s people’s main​ ​information​ ​source, ​and​ ​we​ ​are​ ​able​ ​to​ ​process​ ​complex​ ​information​ ​faster​ ​visually, the GUI​ ​is​ ​here​ ​to​ ​stay.​ ​On​ ​the​ ​other​ ​hand,​ ​more​ ​traditional​ ​GUI​ ​patterns​ ​can​​not​ ​survive​ ​in their​ ​current​ ​form​ ​either.​ ​So,​ ​instead​ ​of​ ​radical​ ​predictions,​ ​I​ ​suggest​ ​another​ ​idea:​ User interfaces​ ​will​ ​adapt​ ​to​ ​our​ ​sensors​ ​even​ ​more.

Adaptive​ ​Multi-Modal​ ​Interfaces Link

Humans​ ​have​ ​different​ ​input​ ​and​ ​output​ ​devices,​ ​just​ ​like​ ​computers.​ ​Our​ ​eyes​ ​and​ ​ears​ ​are our​ ​main​ ​input​ ​sensors.​ ​We​ ​are​ ​very​ ​good​ ​at​ ​pattern​ ​recognition ​and​ at ​processing​ ​images. This​ ​means​ ​we​ ​can​ ​process​ ​complex​ ​information​ ​faster​ ​visually.​ ​On​ ​the​ ​other​ ​hand​, ​our reaction​ ​time​​6 ​to​ ​sound ​is​ ​faster,​ ​so​ ​voice​ ​is​ ​a​ ​good​ ​option​ ​for​ ​warnings.

We​ ​have​ ​output​ ​devices​, ​too:​ ​we​ ​can​ ​talk,​ ​and​ ​we​ ​can​ ​gesture.​ ​Our​ ​mouth​ ​is​ ​the​ ​most effective​ ​output​ ​device​ ​we​ ​have,​ because ​obviously​ ​most​ ​people​ ​can​ ​talk​ ​faster​ ​than​ they ​type,​ ​write​ ​or make ​signs.

Because​ ​humans​ ​are​ ​good​ ​at​ ​combining​ ​different​ ​channels,​ ​I​ ​predict​ ​that​ ​machines​ ​will​ ​follow ​and that they​ ​will​ ​use​ ​multi-modal​ ​interfaces​ ​to​ ​adapt​ ​to​ ​human’s​ ​capabilities.​ ​These​ ​interfaces​ ​will use​ ​different​ ​channels​ ​for​ ​input​ ​and​ ​output,​ ​and​ ​different​ ​mediums​ ​for​ ​different​ ​information types​ ​(for example,​ asking ​short​ ​questions​ ​versus ​presenting​ ​complex​ ​information).

Interfaces​ ​will​ ​adapt​ ​to​ ​humans​ ​by​ ​using​ ​the​ ​medium​ ​and​ ​message​ ​format​ ​that​ ​is​ ​most convenient​ to ​humans​ ​in​ ​the​ ​given​ ​situation.​ ​Let’s​ ​look​ ​at​ ​some​ ​examples,​ ​including the​ ​ones​ ​we​ ​explored​ ​at​ ​​UX​ Studio7​​, ​as​ ​well​ ​as​ ​some​ ​established​ ​commercial​ ​products.

Chatbots​ ​Are​ ​Getting​ ​More​ ​And​ ​More​ ​Visual Link

Nuru8​​ ​is​ ​a​ ​chatbot​ ​concept​ ​that​ ​helps​ ​with​ ​day-to-day​ ​problems​ ​in​ ​Africa.​ ​Starting​ to ​design ​it as​ ​a​ ​pure​ ​chat​ ​application,​ ​we​ ​soon​ ​discovered​ ​the​ ​limits​ ​of​ ​text-only​ ​conversational interfaces.

For​ ​basic​ ​communication​, ​chat​ ​is​ ​more​ ​effective​ ​than​ ​traditional​ user interfaces (​UIs).​ ​In​ ​Africa​, ​for​ ​example, chat​ ​can​ ​be​ ​used​ ​to​ ​boost​ ​local​ ​commerce.​ ​Sellers​ ​and​ ​buyers​ ​can​ ​find​ ​each​ ​other​ ​and negotiate​ ​different​ ​deals.​ ​In​ ​this​ ​case​, ​chat​ ​is​ ​optimal​ ​because​ ​of​ ​the​ ​one-on-one communication.​ ​But​ ​when​ ​it​ ​comes​ ​to​ ​more​ ​sophisticated​ ​interaction,​ ​like​ ​comparing​ ​many different​ ​job​ ​postings,​ ​we​ ​need​ ​a​ ​more​ ​advanced​ UI.​ ​In​ ​this​ ​case​, ​we​ ​added​ ​cards to​ ​the​ ​chat​ ​interface,​ ​which​ ​users​ ​can​ ​swipe​ ​through.

(View large version10)

Some​ ​other​ ​companies,​ such as ​China’s Tencent, ​went​ ​even​ ​further​ ​and​ ​let​ ​developers​ ​build mini​-​apps​ ​that​ ​run​ ​within​ its ​chat​ ​app, WeChat.​ ​This​ ​inspired Western​ ​designers​ ​to imagine​ ​a​ ​conversational​ ​interface​ in which ​every​ ​single​ ​message​ ​could ​contain​ ​a​ ​different​ ​app, each with​ ​its​ ​own​ ​rich​ ​interface.​ ​For​ ​example,​ ​you​ ​caould ​play​ ​little​ ​games​ ​together​ ​with​ ​your​ ​chat partner,​ ​like​ ​we​ ​did​ ​15​ ​years​ ​ago​ ​in​ ​MSN​ Messenger.​ ​This​ ​is​ ​also​ ​an​ ​attempt​ ​to​ ​enhance​ ​the simple​ ​conversational​ ​interface​ that ​people​ ​love​ ​with​ ​rich​ ​UI​ ​functions.

(Image: Medium12) (View large version13)

Self-Driving​ ​Cars​ ​With​ ​Mixed​ ​Interfaces Link

A​ ​year​ ​ago​, our ​team​ ​imagined​ ​the​ ​interface​ ​of​ a ​self-driving​ ​car ​as​ ​a​ ​pure​ ​exercise​ ​in multi-modal​ ​design.​ ​We​ ​imagined​ ​the​ ​whole​ ​process​ ​and​ ​tried​ ​to​ ​optimize​ the ​interaction​ ​at​ ​each step.

To ​order​ ​a​ ​car​, ​you​ would ​push​ ​a​ ​button​ ​on​ ​your​ ​phone.​ ​This​ ​is​ ​the​ ​most​ ​simple​ ​interaction, and​ ​it’s​ ​enough​ ​to​ ​order​ ​a​ ​car.​ ​Obviously,​ ​there’s​ ​no​ ​need​ ​to​ ​talk​ on the ​phone​ if just​ ​push​ing ​a​ ​button is enough.

Then,​ ​once​ ​you​ ​enter​ ​the​ ​car,​ ​you​ would ​spend​ ​some​ ​time​ ​with​ getting ​comfortable,​ ​placing your​ ​belongings​ ​and​ ​fasten​ing ​your​ ​seatbelt.​ Following that, verbal ​communication would be easier,​ ​so​ ​the​ ​car​ ​asks​ ​you​ ​where​ ​to​ ​go.​ ​It​ ​is​ ​also​ ​faster​ ​to​ say ​the​ ​place​, rather ​than​ ​typing​ ​the location​ ​on​ ​a​ ​touch​​screen.​ ​In​ ​order​ ​for​ ​this​ ​to​ ​work​ ​properly,​ ​the​ ​car​ would have ​to​ ​understand​ any ambiguous ​instruction you give it.


Trust​ ​is​ ​an​ ​important​ ​issue​ ​in​ ​self-driving​ ​cars.​ ​When​ ​we​ ​are​ ​on​ ​the​ ​road,​ ​we​ ​want​ ​to​ ​see​ whether we​ ​are​ ​headed​ in ​the​ ​right​ ​direction​ ​and​ ​whether ​our​ ​self-driving​ ​car​ ​is​ ​aware​ ​of​ ​the​ ​bicycle​ ​in​ ​front of​ ​us.​ Having to ​ask​ ​the​ ​car​ ​every​ ​time​ for ​its​ ​status would be impractical, especially​ if you’re ​travel​ling ​with​ ​others.​ ​A​ ​tablet-like​ ​interface​, ​visible​ ​to​ ​all occupants,​ would ​solve​ ​this​ ​issue.​ ​It​ would ​always​ ​show ​what​ ​the​ ​car​ detects in ​its surroundings​, ​as​ ​well​ ​as​ your ​position​ ​on​ ​the​ ​map.​ ​The​ ​fact​ ​that​ ​it’s​ ​always​ ​there​ would ​build ​trust. And​, ​of​ ​course,​ ​show​ing ​map​ ​information​ would be easier visually ​than​ ​in​ ​any​ ​conversational​ ​form.

In​ ​this​ ​example​, ​you​ could ​order​ ​a​ ​car​ ​using​ ​a touch​screen,​ ​give​ ​voice​ ​commands​, ​receive auditory​ ​feedback,​ ​as​ ​well​ ​as​ ​check​ ​the​ ​status​ ​on​ ​a​ ​screen.​ ​The​ ​car​ ​always​ ​uses​ ​the most​ ​convenient​ ​medium.

Home​ ​Entertainment​ ​And​ ​Digital​ ​Assistants Link

The​ ​Xbox​ ​console​ ​with​ ​the​ ​Kinect​ ​controller​ ​is​ ​another​ ​example​ ​of​ ​a​ ​mixed​ ​interface.​ ​You can​ ​control​ ​its G​UI​ ​with​ ​both​ ​voice​ ​and​ ​hand​ ​gestures.​ In​ ​the​ ​video​ ​below​, ​you​ ​can see​ ​that​ ​the​ ​gesture​-​recognition​ ​technology​ ​is​ ​not​ ​perfect​ ​yet,​ ​but​ ​it​ ​will​ ​certainly​ ​get​ ​better​ ​in the​ ​future.​ ​The​ ​voice​ ​recognition​ ​is​ ​also​ ​a​ ​bit​ ​awkward​ ​because​ ​you​ ​always​ ​have​ ​to​ ​say​ ​the magic​ ​word​, “​Xbox,”​ ​before​ ​every​ ​command.

Despite​ ​the​ ​technical​ ​flaws​, ​it​ ​is​ ​a​ ​good​ ​example​ ​of​ ​how​ a ​machine​ can ​gives ​continual ​visual feedback​ ​to​ ​voice​ ​and​ ​gesture​ ​commands.​ ​When​ ​you​ ​use​ ​your​ ​hand​ ​as​ ​a​ ​control,​ ​you can​ ​see​ ​a​ ​small​ ​hand​ ​on​ ​the​ ​screen​ ​as​ ​a​ ​cursor,​ ​and​ as ​you​ ​move​ ​it​ ​above​ ​different content​ ​tiles,​ ​it​ ​always​ ​highlights​ ​the​ ​current​ ​one​ ​below​ ​your​ ​cursor,​ ​to​ ​show​ ​which​ ​one​ ​you are​ ​about​ ​to​ ​activate.​ ​When​ ​you​ ​say​ ​the​ ​word​ ​“Xbox” ​to​ give ​a​ ​command,​ ​the​ ​console displays​ ​a​ ​command​ ​word​ ​on​ ​each​ ​tile​ ​with​ ​green,​ ​so​ that ​you​ ​know​ ​what​ ​to​ ​say​ ​to​ ​select​ ​an item.

Of​ ​course​, ​the​ ​goal​ ​here​ ​is​ ​to​ ​help​ ​you​ voice-control ​an​ ​interface​ ​that​ ​was​ ​really designed​ ​for​ ​voice​ ​in​ ​the​ ​first​ ​place.​ ​In​ ​the​ ​future​, ​more​ ​accurate​ ​voice​-​recognition​ ​and language​-​processing​ ​will​ ​help​ ​people​ to ​say​ ​commands​ ​in​ ​their​ ​own​ ​words.​ ​That​ ​is​ ​an important​ ​and​ ​necessary​ ​step​ ​to​ ​make​ ​mixed​ ​interfaces​ ​more​ ​mainstream.

Amazon​ ​is​ ​without​ ​a ​doubt​ ​one​ ​of​ ​the​ ​great​ ​pioneers​ ​of​ ​voice​ ​interfaces​ ​and​ ​“no​ ​GUI” interfaces.​ ​But​ ​even​ it ​added​ ​a​ ​screen​ ​to​ its ​new​ ​generation​ ​of​ ​Echo​ ​device,​ ​after​ ​an arguably failed​ ​attempt​ ​to​ ​push​ ​the​ ​GUI​ in​to​ ​an​ ​app​ ​on​ ​the​ ​user’s​ ​phone.

The​ ​freedom​ ​that​ ​a voice​ ​UI​ ​gives​ ​you​ ​is​ ​truly​ ​fascinating,​ ​especially​ ​the​ ​first​ ​time​ ​you​ ​try​ ​it. For​ ​example,​ ​standing​ ​in​ ​the​ ​kitchen​ ​and​ ​saying​ ​“play​ ​Red​ ​Hot​ ​Chili​ ​Peppers” ​is​ ​easier​ ​than scrolling​ ​through​ ​Spotify​ ​albums​ ​with​ ​dirty​ ​hands.

But​ ​after​ ​a​ ​while,​ ​when​ ​you​ ​want​ ​to​ ​use​ ​it​ ​for​ ​more​ ​advanced​ ​tasks,​ ​it​ ​just​ ​doesn’t​ ​work.​ ​In one​ ​video​ ​review​, ​a​ ​user​ ​pointed​ ​out​ ​how​ ​weird​ ​it​ ​is​ ​that​ ​once​ ​you​ ​start​ ​a​ ​kitchen​ ​timer,​ ​you have​ ​to​ ​ask​ ​the​ ​device​ ​for​ ​the​ ​status,​ ​because​ no ​screen​ ​exists. Now​, ​with​ ​the​ ​​Echo​ ​Show​15,​ ​you​ ​can​ ​see​ ​multiple​ ​timers​ ​on​ ​the​ ​same​ ​dashboard.

And​ ​what’s​ ​more​ ​important​ ​for​ ​Amazon​ ​than​ ​shopping?​ ​With​ ​the​ ​old​ ​Echo​, ​you​ ​could​ ​add things​ ​to​ ​your​ ​shopping​ ​list,​ ​but​ ​then​ ​you​ ​had​ ​to​ ​open​ ​up​ ​the​ ​mobile​ ​app​ ​to​ ​actually​ ​purchase something.​ Hearing ​Alexa​ ​read ​out​ ​long​ ​product names​ ​and​ ​descriptions​ ​from​ ​the​ ​Amazon​ store was just a terrible experience.​ ​Now​, ​you​ ​can​ ​handle​ ​these​ ​tasks​ ​on​ ​the Echo​ ​easily,​ ​because​ ​it​ ​show​s ​you​ ​products​ ​and​ ​you​ ​can​ ​choose​ ​the​ ​ones​ ​you​ ​like.

(View large version17)

Unlike​ ​the​ ​Xbox​ ​with​ ​the​ ​Kinect,​ ​the​ ​Echo​ ​Show​ ​is​ ​a​ ​voice-first​ ​device.​ ​Its​ ​home​ ​screen​ ​is not​ ​loaded​ ​with​ ​app​ ​icons.​ ​But​ ​when​ ​you​ ​issue​ ​an initial ​voice​ ​command,​ ​the​ ​screen​ ​shows​ ​you all​ ​related​ ​information.​ ​It​ ​is​ ​very​ ​simple:​ When​ ​you​ ​need​ ​to​ ​know​ ​more,​ ​you​ ​just​ ​look​ ​at​ ​the screen. It’s​ ​a​ ​bit​ like ​how​ a person ​work​s ​in​ ​the​ ​kitchen:​ We​ ​can​ ​maintain​ ​a​ ​basic​ ​conversation​ ​while we​ ​focus​ ​on​ ​cooking,​ ​but​ ​when​ ​an​ ​important​ ​or​ ​complex​ ​question arises,​ ​we​ ​stop​ ​and look​ ​at​ ​our​ ​partner’s​ ​face.​ ​This​ ​is​ ​why​ ​the​ ​Echo​ ​Show​’s ​direction​ ​to​wards ​a​ multi-modal​ ​interface is more natural.

(View large version19)

Here’s​ ​another​ ​design​ ​detail.​ ​On​ ​the​ ​home​ ​screen​, the ​Echo​ will ​display a ​news​ ​headline ​and​ highlight ​a​ ​word​ in the headline in ​bold​, making it the​ ​command​ ​word​ you would ​say​ ​if​ ​you​ ​want​ed ​to​ ​hear​ ​the​ ​full story.​ ​In​ ​this​ ​way,​ ​the​ ​capabilities​ ​of​ ​the​ ​products​ ​are ​clear​, ​and​ ​it’s​ ​obvious​ ​how​ ​you would ​use​ ​it.​ The ​Echo effectively ​sets ​expectations​ ​and​ ​gives ​tips​ ​through​ its ​visual​ ​interface.

One​ ​of​ ​the​ ​main​ ​advantages​ ​of​ ​Google​ ​Home,​ ​Echo’s​ ​main​ ​competitor,​ ​is​ ​that​ ​you​ ​can​ ​ask follow-up​ ​questions.​ ​After​ ​ask​, “How​ ​many​ ​people​ ​live​ ​in​ ​Budapest?,” ​you​ ​could ​also​ ​ask, “What’s​ ​the​ ​weather​ like ​there?”​ ​Google​ ​Home​ ​will​ ​know​ ​that​ ​you’re​ ​talking​ ​about​ ​the​ ​same place.​ ​Context​-​awareness​ ​is​ ​a​ ​great​ ​feature​ ​and​ will ​be​ ​a​ ​must​-​have​ ​in​ ​future products.

When​ we’re ​designing​ an ​interface,​ ​if​ ​we​ ​know​ ​the​ ​context​, ​we​ ​can​ ​remove​ ​friction.​ Will​ ​the​ ​product be used​ ​in​ ​the​ ​kitchen​ ​when​ ​the​ ​user’s​ ​hands​ ​are​ ​full?​ ​Use​ ​voice​ ​control;​ ​it’s​ ​easier​ ​than​ ​a touch​​screen.​ ​Will ​they​ ​use​ ​it​ ​on​ ​a​ ​crowded​ train?​ Then ​touch​​ing a screen​ would ​feel ​far​ ​less awkward​ ​than​ ​talking​ ​to​ ​a​ ​voice​ ​assistant​.​ ​Will ​they​ ​need​ ​a​ ​simple​ ​answer to​ ​a​ ​simple​ ​question?​ ​Use​ a ​conversational​ ​interface.​ Will​ ​they​ ​have​ ​to​ ​see​ ​images​ ​or understand​ ​complex​ ​data?​ ​Put​ ​it​ ​on​ ​a​ ​screen.​ To​ ​improve​ ​interaction,​ ​we​ ​can​ ​ask questions​, ​such​ ​as​ ​which​ ​screen​ ​is​ ​closer​ ​to​ ​them​, ​or​ ​which​ ​one​ would ​be​ ​more​ ​convenient​ ​to use​ given​ the ​situation.

One​ ​thing​ ​that​ ​is​ ​still​ ​missing​ ​from​ ​Google​ ​Home​ ​is​ ​​multiuser​ ​support.​ ​Devices​ ​like​ ​this​ ​will be​ ​used​ ​by​ ​many​ ​different​ ​people,​ ​bringing​ ​us​ ​back​ ​to​ ​the​ ​shared​ ​computer​ ​phenomenon of ​the​ ​early​ ​PC​ ​age.​ ​Switching​ ​between​ ​users​ ​seamlessly​ ​will​ ​be​ ​a​ ​tough​ ​challenge. Security​ ​and​ ​UX​ ​are​ ​not​ ​easy​ ​to​ ​align.​ ​Imagine​ ​that​ ​at​ ​one​ ​moment​ ​you are​ ​talk​ing ​to​ ​your​ ​virtual assistant​, ​with​ ​access​ ​to​ ​all​ of ​your​ ​apps​ ​and​ ​data,​ ​then​ ​a​ ​second​ ​later​ ​someone​ ​else​ ​enters​ ​the room​ and does ​the​ ​same.

Both​ ​Amazon​ ​Echo​ ​and​ ​Google​ ​Home​ ​give ​nice​ ​visual​ ​feedback​ ​when​ they are ​listening to​ ​you ​or​ ​searching​ ​for​ ​an​ ​answer.​ ​They​ ​use​ ​LED​ ​animation.​ ​For​ ​multi-modal interfaces,​ ​keep​ing ​the​ ​voice​ ​and​ ​visual​ ​outputs​ ​in​ ​sync is essential; ​otherwise​, ​people​ ​will get​ easily ​confused​.​ ​For​ ​instance,​ ​when​ ​talk​ing ​to​ someone, ​we​ ​can​ ​easily​ ​look​ ​at​ ​their face​ ​to​ ​see​ ​if​ ​they​ ​are​ ​getting​ ​the​ ​message.​ ​We​ would ​probably​ ​want​ ​to​ ​be​ ​able​ ​to​ ​do​ ​the​ ​same when​ ​talking​ ​to​ a ​product.

Healthcare​ ​Products Link

PD​ ​Measure20​​ ​is​ ​an​ ​app​ ​to​ ​measure​ ​pupillary​ ​distance​ ​for​ ​people​ who ​wear ​prescription glasses.​ ​It​ ​is​ ​a​ ​good​ ​example​ ​of​ ​syncing​ ​and​ ​combining​ ​visual​ ​and​ ​voice​ ​interfaces.

Any customer​ ​need​s ​to​ ​know​ ​their​ ​pupillary​ ​distance​ ​in​ ​order​ ​to​ ​purchase​ ​glasses​ ​online. If they don’t know, then ​they’d​ ​have​ ​to​ ​go​ ​to​ ​a​ ​retail​ ​store​ ​and​ ​measure​ ​there. A​ ​measurement​ ​tool that is available​ to anyone​ ​at​ ​home​ ​would​ ​open​ ​up​ ​a​ ​huge​ ​market​ ​for​ ​online​ ​optics.

With​ ​PD​ ​Measure​, ​the​ ​customer​ ​stands​ ​in​ ​front​ ​of​ ​a​ ​mirror​ ​and​ ​takes​ ​a​ ​photo​ ​of​ ​themselves, keeping​ ​their​ ​phone​ ​in​ ​a​ ​particular​ ​position,​ ​following​ ​precise​ ​instructions.​ ​The​ ​app​ ​then automatically​ ​calculates​ their ​pupillary​ ​distance​ ​using​ ​an​ ​advanced​ internal ​algorithm.​ ​It​ ​is​ ​precise enough​ ​to​ ​make​ ​ordering​ ​glasses​ ​online​ ​possible.

(View large version22)

​PD​ ​Measure’s​ ​UI​ ​is​ ​a​ ​combination​ ​of​ ​animated​ ​illustrations​ ​on​ ​the​ ​screen​, which ​show​ ​you how​ ​to​ ​hold​ ​your​ ​phone​, ​and​ ​voice​ ​instructions,​ which ​tell​ ​you​ ​what​ ​to​ ​do.​ The user has ​to​ ​move their​ ​hands​ ​to​ ​the​ ​right​ ​position​, ​and​ ​the​ ​app​ will ​uses ​its​ ​sensors​ ​to​ ​give​ ​feedback​ ​when​ ​they​ ​are there.​ ​When​ ​the​ ​app​ ​finally​ ​take​s ​the​ right ​image​, ​it​ ​provides​ ​the​ ​user​ ​with​ ​auditory feedback​ ​(a​ ​bell​ ​rings).​ ​This​ ​way,​ the ​user ​gets ​used​ ​to​ ​the​ ​confirmation​ sound ​and will take ​each​ subsequent ​measurement​ more efficiently.

During​ ​the​ ​prototyping​ ​phase​, ​we​ ​conducted​ ​a​ ​lot​ ​of​ ​user​ ​tests,​ ​and​ ​it​ ​turns ​out​ ​that​ ​people are​ ​more​ ​likely​ ​to​ ​follow​ ​voice​ ​instructions​ ​than​ ​visual​ ​ones.

In​ ​this​ ​example​, ​visual​ ​and​ ​voice​ ​interfaces​ ​work​ ​together:​ The​ ​animated​ ​illustrations​ ​show you​ ​how​ ​to​ ​hold​ the ​phone,​ while ​the​ ​voice​ ​instruction​ ​helps​ ​you​ ​to​ get in ​the​ ​perfect​ ​position.

Examples​ ​From​ Publishing Link

Back​ ​in​ ​2013​, ​a​ ​company​ named ​​Volio​​23 ​experimented​ ​with​ ​mixed​ ​interfaces.​ ​One​ ​of​ its flagship​ ​clients​ was ​Esquire​ ​magazine,​ which ​created​ ​an​ ​interactive​ ​experience​ in which ​people could​ ​talk​ ​with​ ​Esquire’s​ ​columnists.​ ​As​ ​you​ ​can​ ​see​ ​in​ ​the​ ​video​ ​below,​ ​this​ ​was​ ​a series​ ​of​ ​videos​, ​and​ ​you​ ​could​ ​choose​ ​the​ ​next​ ​one​ ​based​ ​on​ ​the​ ​answer​ ​you​ ​gave​ ​to​ ​the question​ in the current ​video.​ ​Of​ ​course,​ ​you​ ​could​ ​just​ ​choose​ ​from​ ​a​ ​few predefined​ ​answers,​ ​but​ ​the​ ​interaction​ ​still​ ​felt​ ​like​ ​a​ ​live​ ​conversation.​ ​It​ ​also​ had ​a​ ​good combination​ ​of​ ​media:​ ​voice​ ​as​ ​input​ ​for​ ​commands​ ​and​ ​the​ ​screen​ ​to​ ​display​ ​the​ ​content.

Many​ ​people​ ​think​ ​of​ ​today’s​ ​multi​-screen​ ​world​ ​as​ ​separate​ ​output​ ​channels​ ​for​ ​our​ ​content. Mixed​ ​interfaces​ ​will​ ​be​ ​much​ ​more​ ​than​ ​that.​ ​People​ will be able to ​use​ ​your​ ​app​ ​on​ ​different​ ​devices simultaneously,​ ​at​ ​the​ ​same​ ​time​ ​(for example, ​using the ​Alexa​ ​for​ ​voice​ ​input,​ ​while​ ​see​ing ​the​ ​data on​ their ​tablet).

Combining ​voice​ ​and​ ​GUI​ ​in​ ​that​ way is not necessary ​either.​ ​A​ ​sport​s-​streaming​ ​app we​ ​designed​ ​recently​ enables ​people​ to ​comment​ on a ​football​ ​game​ ​and​ ​talk​ ​with​ ​other​ ​fans while​ ​watching​ ​the​ ​match​ ​live​ ​on​ ​their​ ​smart​ ​TV.​ ​The ​two​ ​screens​ ​perfectly​ ​complete​ ​each other.

Such​ ​advanced​ ​interfaces​ ​offer​ ​functionality ​available​ ​through​ ​many​ ​different​ ​devices​ ​and media ​simultaneously.​ ​This​ ​is​ ​redundant,​ ​which​ ​programmers​ ​and​ ​designers​ ​don’t​ ​really like.​ ​But​ ​it​ ​also​ ​has​ ​advantages,​ ​because​ ​it​ ​gives​ ​people​ ​backup​ ​options,​ ​in case ​the​ ​main​ ​option​ ​is not​ ​available.​ ​It​ ​also​ ​helps​ ​disabled​ ​people​ ​who​ ​can’t​ ​use​ ​voice​ ​or​ ​visual​ ​interfaces.

How​ ​To​ ​Choose​ ​The​ ​Primary​ ​Mode? Link

Having ​discussed​ ​trends​ ​and​ ​some​ ​current​ ​products,​ ​let’s​ ​sum​marize ​when to​ ​use​ ​voice​ ​and​ when to use a ​visual​ ​user​ ​interface.

Visual​ ​user​ ​interfaces​ ​work​ ​better​ ​with:

  • lists​ ​with​ ​many​ ​items​ ​(where ​read​ing ​all items out loud would take too long);
  • complex​ ​information​ ​(graphs,​ ​diagrams and ​data​ ​with​ ​many​ ​attributes);
  • things​ you ​have​ ​to​ ​compare​ ​or​ ​things​ you ​have​ ​to​ ​choose​ ​from;
  • products​ you would ​want​ ​to​ ​see​ before buying;
  • status​ ​information​ that you would ​want​ ​to​ ​quietly​ ​check​ from ​time​ ​to​ ​time​ ​(the time, a​ ​timer, your​ ​speed, a​ ​map, etc.).

Voice​ ​user​ ​interfaces​ ​work​ ​better​ ​for:

  • commands (i.e. any​ ​situation in which you ​know​ ​exactly​ ​what​ you ​want,​ ​so​ you ​can ​skip​ the ​navigation​ ​and​ ​just​ dictate your ​command);
  • user​ ​instructions,​ ​because​ ​people​ ​tend​ ​to​ ​follow​ ​voice​ ​instructions​ ​better​ ​than​ ​written instructions;

  • audio​ ​feedback​ for ​success​ ​and​ ​error​ ​situations​, ​with​ ​different​ ​signals;
  • warnings​ ​and​ ​notifications (because the reaction​ ​time​ ​to​ ​voice​ ​is​ ​faster);
  • simple​ ​questions​ ​that​ ​needs ​relatively​ ​simple​ ​answers.

What’s​ ​Next? Link

When​ ​I​ ​asked​ ​my​ ​designer​ ​friends​ ​what​ ​mixed​ ​interfaces​ ​they​ ​know​ ​about,​ ​some​ ​of​ ​them mentioned​ ​the​ ​legendary​ ​MIT​ ​Media​ ​Lab​ ​video​ ​from​ ​1979, ​“The​ ​Put​ ​That​ ​There.”​ Nostalgia aside,​ ​it​ ​is​ ​shocking​ ​that​ ​this​ ​technology​ ​had​ ​a​ ​working​ ​prototype​ ​38​ ​years ago.​ ​Is​ our ​super-fast​ ​progress​ ​just​ ​an​ ​illusion?

Voice​ ​recognition​ ​still​ ​has​ ​some​ ​obvious​ ​challenges​ ​today​, ​and​ ​just​ ​a​ ​few​ ​major​ ​players provide​ ​platforms​ ​for​ ​products​ ​based​ ​on​ ​voice​ ​recognition, including ​apps such as ​WeChat​ and ​hardware​ ​devices​ such as ​the​ ​Amazon​ ​Echo.

A​ ​good​ ​start​ would be to ​develop a ​mini​-​app ​or​ ​bot ​that​ integrates with ​these​ ​systems.​ ​Here​ ​are some ​tips​ ​from​ ​our​ ​own​ ​experience​ of ​working​ ​with​ ​multi-modal​ ​interfaces:

  • Speed​ ​and​ ​accuracy​ ​are​ ​deal​-breakers.
  • Sync​ ​voice​ ​and​ ​visual​ ​interfaces. Always​ ​have​ ​visual​ ​feedback​ ​of ​what’s happening.
  • Show​ ​visual​ ​indicators​ ​when​ ​the​ ​device​ ​is​ ​listening ​or​ ​thinking​ ​about​ ​an​ ​answer.
  • Highlight​ ​voice​-​command​ ​words​ in​ ​the​ ​graphical​ ​interface.
  • Set​ ​the​ ​right​ ​expectations​ with ​users​ ​about​ ​the​ ​interface’s​ ​capabilities​, ​and make​ ​sure​ ​the​ ​product​ ​explains​ ​how​ ​it​ ​works.
  • The​ ​product​ ​should​ ​be​ ​aware​ ​of​ ​the​ ​physical​ ​and​ ​social​ ​context​ ​of​ ​the​ ​device​ ​and​ ​the conversation,​ ​and​ should ​respond​ ​accordingly.
  • Think​ ​about​ ​the​ ​context​ ​of​ ​the​ ​user,​ ​and​ identify ​which​ ​medium​ ​and​ ​device​ would ​reduce​ ​friction​ ​and​ ​make​ ​it​ ​easier​ ​to​ ​perform​ ​a​ ​task.
  • Give​ ​users​ ​options​ ​to​ access ​a​ ​function​ ​through​ alternative ​devices​ ​or​ ​media. This​ ​will​ ​help​ ​in​ ​situations​ ​where something​ ​breaks,​ ​and​ it will ​also​ ​make ​your​ ​product more​ ​accessible​ to ​disabled​ ​people.
  • Don’t ignore ​security​ ​and​ ​privacy​. Enable​ ​people​ ​to​ ​turn​ off components ​(for example, the microphone),​ ​and​ ​build​ ​trust​ ​by​ ​being​ transparent.​ ​Don’t​ ​be​ ​too​ ​pushy,​ ​or​ else ​you​ ​will frighten​ ​everyone​ ​away​ ​(for example,​ ​voice spam​ ​is​ ​very​ ​annoying).
  • Don’t​ ​read out ​long​ ​audio​ ​monologues.​ ​If​ ​it​ ​can​​not​ ​be​ ​summarized​ ​in​ ​a​ ​few​ ​words​, ​display​ ​it ​on​ ​a​ ​screen instead.
  • Take​ ​time​ ​to​ ​understand​ ​the​ ​specifics​ ​of​ each ​platform,​ ​and​ ​choose​ ​the​ ​right​ ​one to​ ​build​ ​on.

Before​ ​starting​ ​out​, ​though,​ ​keep​ ​in​ ​mind​ ​that​, ​compared​ ​to​ ​other​ ​digital​ ​designs,​ ​multi-modal interfaces​ ​are​ ​still​ ​quite​ ​an​ ​unexplored​ ​area.

First,​ ​we​ ​don’t​ ​really​ ​have​ ​a​ ​general​-​purpose​ ​language​ ​or​ ​programming​ ​framework​ ​to describe​ ​mixed​ ​interfaces.​ ​Such​ ​a​ ​language​ ​could​ ​make​ ​it​ ​possible​ ​to​ ​define​ ​voice​ ​and GUI ​elements​ ​in​ ​one​ ​coherent​ ​code​ ​base,​ ​making​ ​it​ ​easier​ ​to​ ​design​ ​and develop​ ​these​ ​interfaces.​ ​It​ ​would​ ​also​ ​support​ ​multiple​ ​output​ ​and​ ​input​ ​options,​ ​enabling​ ​us to​ ​design​ ​omni-channel,​ ​multi-screen​ ​or​ ​multi-device​ ​experiences.

Secondly,​ ​designers​ ​have​ ​to​ ​come​ ​up​ ​with​ ​new​ ​design​ ​patterns​ ​to​ ​support​ ​the​ ​special needs​ ​of​ ​multi-modal​ ​interfaces​. ​(For example,​ how​ would you ​give​ ​visual​ ​and​ ​audio​ ​feedback​ ​at​ ​the​ ​same time?)

Although​ ​the​ ​future​ ​looks​ ​exciting,​ ​and​ ​it​ ​will​ ​happen​ ​fast,​ ​we​ ​still​ ​need​ ​to​ ​reach​ ​the​ ​tipping point​ ​in​ ​voice​ ​recognition​ ​and​ ​language​ ​processing​: ​where​ ​the​ ​usability​ ​of​ ​the​ ​voice​ ​medium will​ ​reach​ ​a​ level of ​quality​ ​that​ ​would​ ​indeed​ ​make​ ​it​ ​the​ ​best​ ​option​ ​in​ ​a​ ​range​ ​of​ ​applications. We​ ​will​ ​also​ ​need​ ​better​ ​tools​ ​to​ ​design​ ​and​ ​code​ ​multi-modal​ ​interfaces.

Once​ ​we​ ​accomplish​ ​these​ ​goals, ​then ​nothing​ will be ​hold​ing ​these​ ​natural​ ​interfaces​ ​back, and​ ​they​ ​will​ ​become​ ​mainstream.

History​ ​Repeats​ ​Itself:​ ​Be​ ​A ​Part​ ​Of​ ​It Link

Humans​ ​have​ ​multiple​ ​senses.​ ​Technology​ ​and​ ​interfaces​ ​that​ ​use​ ​more​ ​than​ ​just​ ​one have​ ​a​ ​better​ ​chance​ ​of​ ​facilitating​ strong ​human-computer​ ​interaction.

A​ ​similar​ ​multi-modal​ ​evolution​ ​happened​ ​before.​ ​Radio​ ​and​ ​silent​ ​movies​ ​were​ ​combined into​ ​the​ ​movies,​ ​which​ ​were​ ​further​ ​enhanced​ ​with​ ​3D​ ​and​ ​so​ ​on.​ ​I’m​ ​positive​ that ​this​ ​process will​ ​happen​ ​in​ ​the​ ​interactive​ ​digital​ ​world,​ ​too.​ ​Exciting​ ​times,​ ​indeed.

