The Secret Behind Authentic Text-To-Speech Voices
These days, quality isn’t one of the things that you sacrifice when it comes to text-to-speech voices. It’s one of the things you gain. Text-to-speech now sounds so surprisingly real that most people can’t tell the difference between AI-generated text-to-speech and actual human speech. There are a few reasons why this is the case, and where AI-powered text-to-speech shines.
What Makes Text-To-Speech Voices Sound So Un-Naturally… Natural?
Below are a few ways to ensure text-to-speech sounds less machine-like and more life-like.
One of the reasons why early text-to-speech sounds robotic is because the software pronounces every single word exactly the same way. When humans talk, they naturally vary how they say words, even the exact same ones. They add inflections, varying tones, and different emphases.
“When you think about the human voice, what makes it natural… is the inconsistencies,” says Matt Hocking, CEO of WellSaid Labs, an AI-powered text-to-speech platform for learning and development companies.
WellSaid Labs worked with hundreds of voice actors, feeding their audio into the WellSaid Labs system. The result: the WellSaid text-to-speech voices sound remarkably similar to the humans they learned from. The AI practiced how to speak from listening to, well, how humans speak—which is in many different ways, even for the exact same words.
Another quality of human speech is that there are pauses. Humans need air, so they naturally pause to inhale, exhale, swallow, and start again. These pauses create rhythmic, natural-sounding variations. Whereas early text-to-speech forgot this nuance (robots, after all, don’t typically need to pause for oxygen) today’s text-to-speech sounds much more life-like because of it.
In today’s text-to-speech editors, you can further simulate these pauses by adding in commas, dashes, periods, and ellipses, cueing the text-to-speech to take breaks, just as a human would. These punctuation marks function more as sheet music to the TTS than grammar—instructing the text-to-speech to pause, hold, and create natural silences just like humans do.
When you speak, you naturally emphasize certain words through intonations. Today’s text-to-speech does, too. Because the AI learned from humans using intonations, the AI incorporated it into their way of speaking. It’s kind of like children learning how to speak from the adults around them—only, in this case the child is a very sophisticated data tool that can analyze loads of speech, languages, and voices at once.
If there is anywhere you want to call out specific words that may be unclear to text-to-speech, you can simply note this in the editor. For example, you can put words in quotation marks, capitalize entire words, or capitalize parts of words if you want them emphasized. Today’s text-to-speech reads these punctuation marks just as a voice actor would, understanding where to adjust intonation.
Another challenge that early text-to-speech faced was that even the same words are pronounced differently depending on usage. Take the example of ‘read’. The past tense is pronounced ‘red’ while the present tense is pronounced ‘reed’. The text-to-speech of yore may have missed the difference, but today’s text-to-speech captures the subtleties with ease.
In the chance that any words or acronyms could be less clear, you can easily add phonetic spelling to the editor to ensure the text-to-speech picks up on the nuance. This is just like how you might help a voice actor. For example, instead of typing ‘COO’, you might spell out ‘C-O-O’ so the reader knows to pronounce the acronym versus blending the letters together.
In many cases, text-to-speech platforms like WellSaid Labs handle long words and numbers even better than human actors. For example, try to read the word ‘antidisestablishmentarianism’ in one go. A text-to-speech voice is able to naturally piece the syllables together, creating a natural-sounding pronunciation that might escape most voice actors without a few practice runs.
Variations in pronunciation also occur—not just with words that are pronounced differently in past vs. present tense—but depending on one’s locality or culture. For example, ‘caramel’ can be pronounced either as ‘care-a-mel’ or ‘car-mel’. Similarly, ‘aunt’ can be pronounced as either ‘ant’ or ‘ont’. Adding a different spelling in a text-to-speech editor teaches the AI to swiftly pick up on this, overriding any inherent pronunciations that a voice actor may have.
What The Research Says
Obviously, we’re big fans of text-to-speech. But what do actual listeners say?
In July 2019, text-to-speech platform WellSaid Labs asked participants to listen to a set of randomized recordings created by both synthetic and voice actors. For each file, participants were asked:
“How natural (i.e. human-sounding) is this recording?”
Each text-to-speech recording was then ranked on a scale of 1 (bad: completely unnatural speech) to 5 (excellent: completely natural speech).
Voice actors achieved an average score of around 4.5, likely because some recordings had obscuring background noise or mispronunciations.
In June 2020, WellSaid Labs matched this, with their synthetic TTS ranking just as highly as actual human voice actors. WellSaid Labs even hired a third-party company to verify the results.
So the data (and the AI) speak for themselves: today’s synthetic text-to-speech sounds undeniably, shockingly human-like, and—as is the nature of AI—it’s only getting better.
To hear actual examples of human-sounding TTS, check out comparisons of voice actors to synthetic TTS for everything from complex words to numbers, acronyms, punctuations, and more. We think you’ll be shocked how hard it is to tell the difference.
Download the eBook Text-To-Speech For L&D Pros: The Next Frontier Of Storytelling to learn how to leverage AI voice generators for your remote learning programs and boost employee engagement. Also, join the webinar to learn how you can update eLearning voiceovers on time and under budget!