Speaking feels like the simplest thing the human body does, but it is quite a complicated thing to be emulated by a computer. Amazon’s Alexa, Apple’s Siri, and even Google’s voice assistant are as close to a human voice as any AI has gotten, but even those sound like computers. Lyrebird, a Montreal-based startup has developed an AI voice generator that can mimic any person’s voice complete with a dash of natural human emotion. Ths system is not perfect yet, but it is quite brilliant for starters. We might be walking into a future where the voice will be as easily faked as photos.
As advanced as the text-to-speech systems get, they will have their specific problems as long as we continue to use the prerecorded words from voice actors. These words when combined in chains of sentences sound very robotic with no inspiration, emotion, or delivery impact. Not to forget the same monotonous voice that we have to listen to over and over again. Some people have even complained about the voice assistants having only a female voice. There are plenty of reasons for that, but you can not have anyone be okay with that one voice.
Lyrebird’s voice-imitation algorithm can mimic any person’s voice and even read a text with predefined emotion. The algorithm needs nothing but a less than a minute of pre-recorded audio to regenerate the voice with the perfect intonation. The company used recordings of Barack Obama, Hillary Clinton, and Donald Trump and produced mimicked audios as a part of their promotion campaign.
Apart from copying voices, the system can generate the same sentence with several different intonations.
The algorithm recognizes the patterns in a person’s speech using artificial neural networks and then reproduces the same patterns using simulated speech. A speech synthesis expert at Lyrebird, Jose Sotelo explained, ” We train our models on a huge dataset with thousands of speakers. Then, for a new speaker, we compress their information in a small key that contains their voice DNA. We use this key to say new sentences.”
The current form of the algorithm has the said abilities but it is still far from real human voice having clarity problems. The system requires much less voice data than the other systems of the kind, and the best part is that it works in real time. Jose Sotelo said, “We are currently raising funds and growing our engineering team. We are working on improving the quality of the audio to make it less robotic, and we hope to start beta testing soon.”
Company’s plan is to sell the system to developers to be used in applications like audio book narration and speech synthesis for people with disabilities, and personal AI assistants. In future, you can have your voice assistant sound like any person you want.
Even before hearing about the brilliant uses of the algorithm, one might imagine the ethical and security problems it can potentially cause. Once the system is refined with perfect imitation ability, it will be nearly impossible to differentiate between real human voice or the impersonator algorithm. The already ambiguous world of truth and lies will become a total mumble jumble where any person’s fake speech can be used to fool even security experts. This might be the end of the era where audio recording are reliable and Lyrebird is aware.
“We take seriously the potential malicious applications of our technology. We want this technology to be used for good purposes: giving back the voice to people who lost it to sickness, being able to record yourself at different stages in your life and hearing your voice later on, etc. Since this technology could be developed by other groups with malicious purposes, we believe that the right thing to do is to make it public and well-known so we stop relying on audio recordings [as evidence].
With a problem comes solution, and with some solutions arise problems. As scary as this might sound, we all have been through the time where image evidence stopped being reliable with the advancement of photo manipulation software. This will be similar, but even though a human can be fooled by an impersonator, a computer can very well differentiate the fabricated voice from real. Systems can be developed to detect the signs which could be the absence of background noises, faked acoustic space, or any similar differences.
One day, the speech synthesis technology may be refined enough to be able to replicate everything may it be speech quality, breathing noises, lip smacking or any other fine details, to an extent when not even a machine will be able to differentiate the real from the impersonator. That day, all audio recordings or voice evidence will lose its credibility.