Voice cloning attacks: all you need to know

Vishing

Deepfakes are scary. I mean we now live in a world where we can’t trust our eyes and ears in front of a computer.

That’s a very strong shift in our reality and how we should behave within it.

AI is rising. Vishing is rising.

So in this article, I wanted to deep dive into voice cloning attacks, as it is something we both have seen in the news, and operated during our vishing simulations.

Key takeway

  • Voice cloning is now easy, scalable, and widely available – Tools like ElevenLabs enable realistic voice cloning with just seconds of audio, making the technology accessible to attackers for use in social engineering and vishing.
  • Voice is no longer a reliable identifier – Since people naturally trust familiar voices, attackers using cloned voices can bypass skepticism, making voice-based impersonation particularly effective and dangerous.
  • Employee training is currently the best defense – As technical defenses lag behind, the most practical protection is frequent, realistic training programs that teach people to verify identities through independent channels before taking action.

Voice cloning technology

Despite its futuristic name and the negative bias we have against it as cybersecurity professionals, voice cloning is a very useful tech in many domains.

Podcasters can correct inaccuracies within a recording with simple text editors, without having to hit record again.

Movie directors and sound engineers can fix the quality of the voice recording on a movie without having to cast the actor again in post-production and pay all the fees associated.

Customer-facing services can duplicate themselves in live audio interactions.

So here is the thing: it’s not science fiction anymore. It’s mature and widely adopted.

The scariest part?

It’s very easy to do. There are many plug-and-play solutions like ElevenLabs that clone voices with a voice sample of 15 to 30 seconds.

We’ve been doing this for a year now and have seen the technology improve at a crazy pace.

Why are voice cloning attacks so dangerous?

As we’ve shown through numerous vishing examples, vishing is already a very serious threat. Now if we add voice cloning to it, it becomes deadly.

Voice cloning attacks are extremely dangerous as since the beginning of mankind, we’ve used voices as an authentication factor.

What I mean by this is that if you know someone and this person is calling you, even with a phone number you don’t know, on an unscheduled, out-of-the-blue call, you will be able to identify them.

And you won’t question their identity.

Voice cloning attacks are so new, that we don’t envision for a moment that the voice over the phone can be a synthetic clone controlled by an attacker.

And as I just said above, it is really easy to deploy and use.

This is why voice cloning attacks are so dangerous.

Types of voice cloning attacks

There are different types of voice cloning attacks.

Of course, given our work at Arsen, we first think about vishing and using voice clones in outbound or inbound phone calls to build trust in a social engineering engagement.

But there are many cases where voice cloning is used in an attack context.

For instance, it has been used to manipulate elections in Slovakia by producing a deepfake in a timely manner, to manipulate voter’s opinion.

Back to vishing though, we are seeing two main usage :

  • Live voice change to modify the voice of a caller into someone else’s
  • Software operated voices

The main difference at the moment is that one is a bit more adaptive and realistic — a human with a voice transformer can better adapt and react than automated software for the time being — and automated software are much more scalable and can parallelize calls.

Note that i’ve written “at the moment”, as the rate of progress here is incredible and I’m expecting AI attack frameworks to quickly catch up with human operators in offensive engagements.

How to protect against these attacks

Just like any new threat, the protection systems aren’t mature enough right now to protect companies against it.

I’m expecting phone systems and VoIP gateways to integrate with deepfake filters at some point. Stricter filters on phone systems too, with blacklists and reputation-based lists, but between the time of development and adoption, your best defense is to train your people against these new threats.

Building a comprehensive vishing awareness training program will be key.

I’ve written a complete article about this but the key point is to make people adopt new reflexes and processes to detect and mitigate the threat.

Because voice phishing will be mixed with social engineering techniques triggering emotional reactions, you’ll need frequent training and a robust double-verification process where people will stop the communication, contact their caller through a different channel and process to a strong authentication before releasing information or complying to requests, as urgent they might be.

Conclusion

In this article, I wanted to highlight voice cloning attacks not as an hypothetical, future risk but an already mature and active threat.

With its ease of use and scaling capacity, it’s very important to build strong reflexes against it.

This is what our platform does, allowing you to craft automated, realistic vishing simulations and vishing awareness programs.

You can read more about our vishing protection solutions and request a demonstration if you want to know more about it.

Can your team spot a vishing attack?

Test them and find your blind spots before attackers do.

Don't miss an article

No spam, ever. We'll never share your email address and you can opt out at any time.