Voicing Our Concerns

AI Voice Generators

[01] Can this transform our lives?
[01] Can this transform our lives?

AI Voice Generators

I find this area of AI particularly fascinating. As much as I want to err on the side of caution, there’s no doubt that some of the models out there are pretty cool. I couldn’t resist having a play around and some of the platforms – Murf, Lovo, ElevenLabs, Speechify in particular have grabbed my attention.

What they all have in common is their text to speech generators that allow users to transform text into captivating audio. The AI is essentially used to break down the text into phonetic components and then uses its training to predict the most authentic way of vocalising it, e.g. mimicking human-like intonations. It doesn’t stop there though. Tools like ElevenLabs and Murf have large libraries of different voices that you can apply to the text that you prompt it with. It’s really quite epic!

I put a monotonous email from my landlord into the text to speech generator and chose from an array of different voices; Neil – ‘cheerful upbeat youth’, Shelby –‘erratic and confident’, Rowan ‘gruff and raspy’ - the list goes on. The one I eventually chose was Sanjay ‘profound and deep’. Never before has a lengthy message about putting the bins out, paying council tax and reading gas and electric meters been so enthralling. This is where I feel genuine excitement at tools like this and the possibility they have to bring mundane content to life.

Library Interface on ElevenLabs

The extension of bringing famous voices to this generated audio is also quite wild. I think back to my school days and wonder if I’d used something like Speechify to turn the pages of my dreary textbook into audio that could grab my attention, dubbed into the voice of my teenage heart throb, would straight As in Physics and Chemistry inevitably have followed? Think physicist Brian Cox reading to you about the refraction of light and electromagnetic induction (he wasn’t my teenage heart throb by the way), or Attenborough explaining the anatomy of plants – not quite Netflix but a lot better than trawling through AQA Biology.

Or imagine waking up to one of your favourite artists reading out your to-do list, or more narcissistically hearing a pep talk telling you how great you are (I tried to see if they had Little Simz voice - not yet!). Or sending my mum a birthday message in the sultry tones of Micheal Buble, her favourite artist – would definitely make me her favourite child. I’ve already seen AI voice generation being used by content creators to make their work more inviting, and I’m sure it’s just the start.

There is also potential risk to this. Voice actors are understandably worried that their voices will be taken and reused without proper compensation and the better the AI models become, the harder it will be to distinguish real voices from these ‘deepfake’ ones. Companies such as Speechify deny this, saying that while AI may learn to mimic Darth Vader or any other voice, these generated voices still struggle to convey real human emotion. But for how long?

In the wrong hands, AI voice and video technology risks causing all types of chaos. In July there was a realistic deepfake scam of money expert Martin Lewis, where AI was used to manipulate footage of him and his voice, to promote an Elon Musk investment scheme. If you look closely, it was not that convincing but it was staged as a zoom video of him being interviewed, so the sketchy quality disguises the out of sync mouth movements. I have to say, it was the voice generating tool made the whole thing feel pretty real. These technological developments are bound to make the misinformation on the web a hell of a lot harder to wade through.

Another even more sinister application is where scammers are using AI voice generators to sound like family members, where for example a daughter calls pretending to be in distress and pleads with her mum to transfer money. Banking apps suggesting audio activation sign-in are surely a terrible idea for this very reason. Some people think I’m paranoid because I don’t speak if I answer the phone to an unknown number, but all you need is 30 seconds of audio clips on systems like ElevenLabs for your voice to be cloned. So trust me, say nothing! Who knows where recordings of your voice might go…

More Thoughts

Back to thoughts
Back to Thoughts