How did three quotes in a documentary — totaling 45 seconds worth of material, and each verifiably written by its biographical subject Anthony Bourdain — set off such a firestorm of controversy about ethics in technology and filmmaking unlike anything in recent memory? From outraged Twitter users to film critics, much has been written about Roadrunner director Morgan Neville’s dubious use of vocal synthesis AI technology to seemingly enable the late travel chronicler to utter, from beyond the grave, lines he’d only ever written.
The topic touched such a nerve that it’s become the focal point of discussion on an otherwise positively-reviewed film. Rather than piling onto a well-trodden debate about this filmmakers’ decision to use vocal synthesis techniques — and how he (in my opinion) failed to adequately obtain full approvals and disclose his methods in the finished creative product — I’d like to explore the deeper implications of voice synthesis technology.
The technology has arrived — it’s powerful, relatively inexpensive, and doesn’t even require much by way of training data. Much like every AI application, it holds significant potential for good — societal, personal, and economic — but also carries considerable practical risk and ethical weight. The question is not if this technology will be used, but how.
Some voice actors fear this technology might put them out of work, while proponents of the technology see its potential to earn voice actors money while they sleep. Aspiring politicians worry opponents could weaponize deepfakes against them, while yet other politicians salivate at the idea of having a credible version of their voice deliver campaign messages in languages they don’t speak. For every frivolous enablement of a celebrity AI voice on Alexa, there’s a deeply serious medical case where a patient preserves their voice before it’s lost to disease.
I’d like to explore both the significant upsides and risks of this technology, and offer a path forward to shape how, when and why we use AI to synthesize human voices.
First, why is synthetic voice technology such a sensitive topic?
“The voice is a deep reflection of character…voice is the fingerprint of the soul.” The English actor Daniel Day-Lewis uttered these words in describing the magnitude of the challenge of preparing to play the late American President, Abraham Lincoln, on the silver screen.
Much like every AI application, it holds significant potential for good — societal, personal, and economic — but also carries considerable practical risk and ethical weight.
It is, of course, commonplace for actors of both great and little renown to studiously practice the nuanced inflections of the character they seek to embody on screen, on stage or in an audiobook. The process can take months, even years. We’re quick to applaud dramatic feats of vocal impersonation, much like we’re universally impressed by good comedic impressionists. As long as you’re convincing, we are generally accepting of actors of a given nationality or native language playing characters who differ on these variables (as in the case of an iconic English actor playing an iconic American president). But when we believe actors’ vocal performances fall short, we’re merciless in our critiques. Only the few and the practiced can be vocal chameleons, because every audience member has an innate keenness of vocal interpretation.
Our voices are arguably the most precious tools we have to assert agency in our lives. There’s a sacredness with which we treat voices such that any improper attribution of an utterance concerns us. We never want to “put words in others’ mouths.” We need to hear important information “from the horse’s mouth.” Our reverence for voice as our most unadulterated mode of expression explains the outrage we see when technological meddling seems to disrupt the “natural order.” We react with visceral concern when we see deep fakes, partly because of the uncanny valley effect, but also because we intellectually appreciate the harm that weaponized synthetic media could unleash.
It’s tempting to write off voice synthesis technology, whether we find it creepy or we harbor more intellectual concerns about its danger to society (or both), but first we should explore the potential benefits of this technology.
First, how does voice cloning work?
Plenty of primers and blogs go deep into this subject, so we’ll keep it high level. The technology has improved significantly in the last decade with the introduction of better AI tools. There are contrasting approaches out there, but suffice to say that through digital signal processing algorithms and deep learning techniques, a small volume of speech data — even seconds or minutes — can train a system to accurately reproduce a human voice for a wide range of possible spoken outputs (or create a novel voice, from scratch).
The quality of the training data and nuances of the processes used to recreate the voice are what separates the great from the so-so voice clones. To modulate a synthetic voice across a spectrum of styles for a given target speaker, training with a spectrum of data is important such that there is source material from which to derive variations in tone and nuance.
We react with visceral concern when we see deep fakes, partly because of the uncanny valley effect, but also because we intellectually appreciate the harm that weaponized synthetic media could unleash.
Big tech is pouring significant amounts of money into this technology for their assistants, but the startup scene is also thriving with many well-funded companies advancing their own approach, from VocaliD to WellSaid Labs. The end result is an increasingly low barrier to entry to create a high-quality voice clone.
So, who stands to benefit from the advancement of voice synthesis technology? First, the human lens.
Giving voice to the voiceless
Perhaps the least controversial application of voice synthesis technology, and likely its most important, is providing a voice to those who cannot speak or suffer from a severe vocal impairment. As VocaliD’s Rupal Patel details in this interview, those born with cerebral palsy or who develop a condition like ALS or Parkinson’s disease can benefit greatly from a synthetic voice. Making this process simple but with a high-quality output is the focus of initiatives like the Voice Preservation Clinic. And while “banking” your voice accurately is important for some, interestingly, accuracy isn’t always the end goal. Many seek to have these voices differ (younger, or accentuated on one dimension or another) from those they may have had previously, or may seek to artificially “age” their voice over time through filters. Regardless of the accuracy or creative license taken with a synthesized voice, it’s a massive boon for people who desire to express their thoughts through speech.
Scaling content accessibility for audio
The sheer volume of written information produced every minute creates a big market for voice overs – for those who are unable to read visually, but also for sighted people on the go, driving or listening to headphones. Producing high-quality voice-overs of all that content in a timely fashion is a tall task that synthetic voice can help address.
The appeal of that voiced-over content will fluctuate based on the attributes of the voice delivering it — not only its quality, but also its regionality, its overall style, and other factors that make our response to individual voices so personal and profound. For some, content that sounds like it’s coming from a voice similar to those in their community may land much better than the same content voiced by someone who sounds like they come from a very different background (though this won’t always be the case). As voice acting faces its own reckoning with a lack of diversity and representation, the synthetic voice industry must also ensure that the AI-generated voices we hear better reflect our diversity as people.
Preserving and enhancing legacies
Companies like HereAfter.AI and Project December are betting that our love for our family and friends will lead us to seek to preserve a digital version of them which can persist after they’re gone. The same way we capture people’s likenesses in photos and videos — and think nothing of displaying these prominently in our most personal spaces – will interactive bots of the deceased become the default at some point? The SF Chronicle explored this question recently through the lens of romantic love cut short by tragedy, and looked at how the generative language models enabled by technologies such as GPT-2 and GPT-3 could produce a staggering level of perceived accuracy when trained on a person’s messaging history. Our digital data immortalizes us by default today; it seems inevitable that digital legacies will evolve beyond a collage of memory-jogging photos, videos and messages and into something more intentionally vivid that shifts our norms around grieving, if so desired.
Beyond these ethics-oriented applications, synthetic voices hold significant commercial & creative upsides.
Augmenting, rather than replacing, professionals
Voice actors are presumably most at risk from the emergence of this technology, but the technology can also be used to scale and complement their voice acting work. Voice actors are limited by their humanity — one voice can only be used at one time, and as a bodily organ, must not be overworked for fear of lasting damage. Sharing the profits for every usage of their synthesized voices could be an attractive passive revenue stream for these actors. What’s more, certain voice acting jobs especially in gaming require loud screaming or other contortions that can be taxing on the vocal chords, where synthetic voices wouldn’t be subjected to the same (or any) risk. Some jobs may be more important for human actors, while others might be nearly as well-serviced by synthetic voices, opening more opportunities to earn in parallel.
Enhancing entertainment & education
Whenever we create a voice experience for a client, we ask: should we use a synthetic voice or a human voice-over, or both (in the case of multiple personas)? The downside to the human voice option is always about flexibility — if we change or add content, we need the talent to record it, which takes time and money. This adds up quickly for games or films. For studios and actors alike, synthetic voices are a useful fallback option for edits, rather than a wholescale substitute for voice acting work.
The same way we capture people’s likenesses in photos and videos – and think nothing of displaying these prominently in our most personal spaces – will interactive bots of the deceased become the default at some point?
Efficiency in a pinch is nice, but the creative potential of synthetic voices will be far more profound. Wouldn’t it be magical if students could engage with a voicebot which actually sounded like Martin Luther King or Margaret Thatcher? Or giving students the ability to bring their creative writing to life through convincing voices representing their characters? Whether they’re on the receiving end or using synthetic voices in their own creative process, the ability to make content more engaging can’t be ignored.
So, how do we pursue a responsible agenda with synthetic voices, and avoid the risks of unauthorized appropriation, misrepresentation, misinformation, and fraud?
Anticipating the problems
As a society, we’ve been woefully inadequate at anticipating the consequences of technological innovation, and we shouldn’t let this technology’s evolution outpace our anticipation of its challenges. The AITHOS Coalition has introduced a guide to ethics in synthetic media, which hopes to imbue the industry with “mindful technology,” and the Open Voice Network is advocating for ethical guidelines for voice synthesis as part of its broader agenda. These are critical endeavors to lay out the questions that need to be answered — such as how this technology might misrepresent an individual or a whole community or demographic, whether data used to inform models should represent the world of today or the world as we’d like it to be, or how transparent should voice synthesis companies be about their methods and IP?
Developing intelligent moderating tools
Companies like Facebook, Twitter, and Google are between a rock and a hard place right now as they mostly moderate textual content on their platforms, with many believing they simply are choosing not to do enough, while others likening current moderating practices to some form of state censorship. There are no easy answers to protecting free expression while preventing the spread of false, dangerous, hurtful or deceitful content; so much of these decisions is ultimately a subjective exercise that is shaped by a fragmentation of belief about what’s true. Throw synthetic media into the mix and you’ve got an even greater challenge. What’s the line between satire and slander?
Tech companies will need to find reliable methods of flagging, disclaiming, or removing content that is synthesized, and whether it has been created and posted within rights. Tech companies could also monetize this technology through official, “verified voices,” a digital watermark of synthetic veracity. Heck, I’d probably pay a one-time fee to have Barack Obama’s tweets read to me in his voice.
Developing proper standards for disclosures
When a synthetic voice is used in any piece of media, how should it be labeled to the listener/viewer and what metadata should accompany it (e.g., proof of licensing for the instance, posthumous estate approvers, etc.)? A common vocabulary and set of practices around this will be critical to avoid mishaps like Roadrunner. So much of our concern around this technology stems from the potential for something that’s fake passing for something true. Synthetic content can perhaps be accompanied with a blockchain-like provenance of derivation sources, approvals granted, and rationale for use (e.g., entertainment value, educational value, familial/private use, etc.), such that there is a “paper trail” that can be demonstrated.
Tech companies will need to find reliable methods of flagging, disclaiming, or removing content that is synthesized, and whether it has been created and posted within rights.
Pairing voice authentication with other factors
Voice biometrics are already helping prove identity, but they’re inadequate as a sole means of identity verification. When part of a multi-factor security system — for example, reading a texted one-time code aloud – voice can be part of a strong solution. The bigger risk surrounds the voice phishing attacks that could be launched through voice-impersonated messages. It’s one thing to see an email impersonating your loved one; it’s another to hear what sounds like their voice, or even to converse with what sounds like their live person. This is where education comes in. We need to raise awareness of the eventuality of this sort of scam becoming commonplace, not only for public figures, but ordinary citizens.
Developing a strong global legal framework for responsible use
Should synthetic voices be used in the courtroom to humanize victims? Or to dramatize readings of harassing messages sent by the accused? Will audio recordings be relied upon as evidence?
Should future meme-makers having fun with an open-source voice cloning technology be sued by angry celebrities? Can governments take action against others based solely on audio surveillance when that could be easily manipulated?
These questions represent the tip of the iceberg for scholars to debate. It’s not enough to share opinions that synthesized voices feel “creepy” or “wrong,” or to speculate that someone as “authentic” as Anthony Bourdain would never approve creating a voice likeness. What we should be getting ahead of is how we manage this technology thoughtfully, in service of the most human values of honesty, transparency, individuality, connection, and freedom of expression. To do so, we must elevate the intellectual response over the visceral, and try to build the commercial, legal, cultural and technical infrastructure through which our most personal tools can be extended into the digital realm.