Summoning music by voice still feels gimmicky at best, and intrusive at worst. On the eve of the so-called voice revolution, why do we feel so awkward, and how will we get over it?
Most of us have tried it by now: either with Siri on our iPhones, or on a mate’s Echo. 2.7m British households currently use Amazon or Google’s smart speaker offerings (Kantar, 2017) , and the activity they’re used for most is playing music (Alpine, 2017).
“Hey Siri, play The Weeknd.” That phrase was spoken proud and clear by my partner recently, at the Covent Garden Apple Store’s “showcase room”. There, stood a lone HomePod in an empty room, removed from the bustle of the Saturday afternoon crowd. A more mundane scenario, I’ve gotten into the habit of using my friend’s Google Home Assistant whenever I’m at his place. It always feels a bit odd. A plethora of commentaries exist as to why conversing with a machine is not (yet) in our human nature. But when it comes to music, they’re not the full story.
The problems we face now when conversing with smart assistants are largely divisible into two buckets: logical issues, somewhat easier to fix. And organic ones. Harder to solve with maths, and even societal in nature.
The Obvious Bits
Sometimes Alexa just doesn’t understand us. Sometimes she doesn’t “wake up”, and others she wakes up for no good reason. These flaws can be frustrating and even troubling, depending on your view of personal data privacy. And although Amazon, and their peers, are working flat out to smoothen this experience (Apple just hired Google’s ex-head of AI to, quoting The Verge, “help fix Siri”, the problems when invoking music on the platforms suddenly multiply.
My partner’s request for Starboy, that weekend afternoon in the Apple Store? He was hoping for:
“I’m tryna put you in the worst mood, ah
P1 cleaner than your church shoes, ah”
Because who wouldn’t want to fill a gloriously aseptic room with the atmospheric beats with Abel’s audio world? He got a lesser known song. Not a big deal, granted, but the type of mild annoyance that would end up grating on you if this was your main mode of music playback. This is the tip of that proverbial, music selection ice-berg: think what happens when artists, songs, or albums share names and titles. Or when the artist releases eponymous tracks and albums. Recently asking Google for some Black Honey, and hoping for the melting Brit-Americana flavour they deliver on “Headspin”, I instead was showered with death metal. Thanks, AI. There’s also the small but incredibly important task of understanding what the words “best of” mean. The smart assistants in circulation right now have some pretty strange ideas of what constitutes an artists best effort, and you can absolutely end up listening to the back catalogue (which might be fine, depending on your mood).
Lastly, there’s the old connectivity problem. “Sorry, I can’t connect to Spotify right now” might be the most blood-boiling utterance to emerge from a machine in the past ten years, if your hands are covered in onion, your eyes are burning, and you just want some Lana to soothe you through your spag bol efforts. Apple mitigate this by only allowing music to be summoned on the HomePod if your streaming service is Apple Music. (You’re able to play if you’re a Spotify user, but only through your phone.)
These wildly frustrating issues will be solved, soon. These are the glitches one expects with emerging technology, and the price we pay to taste the future of ambient computing now. I am not concerned for these issues. I am concerned for those outlined below.
The Less Obvious Bits
Talking to machines is still a relatively novel notion. It can feel unnatural for two reasons: firstly, most of us are not in the habit of giving orders vocally and certainly not to an entity we’re expecting a reply from. Telling your dog to sit doesn’t count, because you’re not expecting it validate your request with “OK Dave, I’ll sit right here”. Secondly, we struggle to speak with these machines because the only mental model we have ever built to converse with others, was built to converse with other humans. Human conversation is at once deliciously simple and devilishly complicated.
Every human whom converses with another follows a set of rules a scientist named Paul Grice coined as the “Cooperative Principle” in 1975. His four rules define what each participant should do in a conversation, and what they’re looking to get out of it. Each of us learns, during childhood, the art of “turn-taking”. We know when to jump in, and when to pull back. We know what to contribute to move the conversation forwards, but not commander it. We know how to do this so well, we even do with text. Crucially, each human in a conversation knows when they’re being spoken to, and can jump in when it’s relevant in a natural way. There is so need to call on Dave to partake; Dave knows when his contribution adds to the exchange.
This is a problem conversational designers for AI need to solve for, if voice interactions with machines are to truly become natural. In March of this year, Amazon released an Alexa update allowing her to continue the conversation without the user having to re-invoke her using “Hey, Alexa”, provided something was said within 5 seconds of the last ask (The Verge, 2018). This is a step in the right direction, but is only made possible by the device continuing to listen for the human voice. Which begs the question, how long are we happy for our devices to be listening?
Now to muddy the waters further, and leave the tech behind for a moment, we must also consider the power of music. Music touches something deeper inside of us. Music is an extremely personal experience. So much so, I spent my twenties not wanting to listen to the music I really wanted to listen to, around those closest to me, because I was worried they’d judge me for it. Music is part of our own unique identities, and it speaks volumes about not only who you are, but also what you believe in and aspire to be. Music speaks to all the softest, deepest parts of our being. And with musical stigma rife in pop culture (it’s still un-admittable in many places to tell your mates you actually quite like Bieber. Purpose was a banger, there, I’ve said it) we’re constantly trying to reconcile what we truly like, with what the world tells us it’s OK to like. Couple this with the intimacy of a moment of musical choice, and you’re into some pretty vulnerable territory. If you’re feeling deflated, and you need some Savage Garden in your life, playing that sweet melancholy tells those around you something of how you’re feeling. Music becomes a vector for overt admission of your true state of being.
That is a powerful thing. And what could strip you even more of your built-up armour, in that intimate moment? Saying it out loud. For the majority of us, we’re simply not used to verbalising our vulnerable moods.
The nuts and bolts will get better. Selection of Bob Dylan’s best hits will become flawless, and the experience will deliver what you’re expecting at a higher rate. This is the easiest and most immediate improvement we’ll see to the world of voice-activated speakers. Smart speakers will, slowly, get better at participating in human conversations. But we’ll need to decide how comfortable we are with them listening to our lives for extended periods of time, for the sake of enjoying the interaction more when we do summon them. And as consumers, we need to voice our decision on this topic, before it is made for us.
Finally, society will become more accepting of outward expressions of our human vulnerability. And this will go a long way to making the voice-activated musical experience less downright weird.
I’ve noticed a divide in my peer group when talking to their devices: there are those who thank, and those who do not. I’m a thanker. I’m postulating real acceptance of these devices into our lives and homes will depend on the development of empathy towards them. But will we ever think of them as anything other than an entity we can unplug, and if not, what relationship will we form with them in the decades to come?