Can AI dub movies?

Let's find out

abhiram kandiyana
January 31, 2024

Can AI dub movies into different languages?

Let's find out.

Background: Recently, We have seen a lot of advancements in tech, AI in specific, in the Video industry. One in particular has blown a lot of my mind which is AI video translation.

Now, AI can translate audio and videos into different languages with perfect lip sync, and retain some characteristics of the audio like voice, intonation, tone, and base.

The first thing that came to me when I saw this was Movie dubbing. The film industry, especially in India and the US, spends millions of $ for dubbing movies as they hire different voice actors for each language, train them on the scenes... etc.

With AI, It can be so easy to dub a movie in hours or days. So, I started researching on this.

And the experience for viewers becomes 100x. Imagine watching James Bond speak in your regional language, with the same voice and perfect lip sync. It would be surreal.

I even talked to a Hollywood director to see how we could dub one of his movies with Keanu Reeves into Japanese.

After trying different AI tools for months, talking to a few experts, and emailing AI founders at midnight I concluded that we don't have the technology yet to dub movies completely.

Here are 3 reasons when and why AI fails to dub some scenes

1. When the voice and the actor on screen don’t match

This is how the AI translates the voice (Kinda):

i. Separate the audio and video

ii. Translate the audio to the target language.

iii. Find the lips of the actor in every frame of the scene.

iv. Regenerate each frame with the lips moving as per the audio.

Great right? But it forgets to check one thing.

Are the voice and the actor on screen matching? Let me explain.

Let’s assume two actors in the scene are talking/arguing: Actor A, and Actor B.

When Actor A is speaking and he is the only one visible in the scene, AI can perfectly match his lips to the translated languages but in most cases, when Actor A is talking we see Actor B on screen.

This is exactly what is happening in the clip from Avengers below. When Cap is talking we see Tony on the screen. As the AI doesn’t know and check if the voice is from Tony, his lips start moving and vice versa.

You can also see this in the second clip from Spiderman: ATSV. For most of the scene, Only Miguel speaks as he narrates a key concept of the movie. But the AI keeps moving the lips of any character who comes to the screen like Gwen, Miguel from the past, video of Miguel’s daughter.

So, AI translation doesn’t work:

i. When there are more than one actor in the scene

ii. When the camera keeps moving from one actor to another in an intense dialogue

iii. When a scene is being narrated by a character who is not in the scene

2. The length of the sentence changes based on language, even if they mean the same

Different languages have different alphabet sizes.

For eg, Hindi has 52 letters, i.e., We can write or say something way shorter than in English.

This creates a difference in the length of sentences which causes inconsistencies in the lip movements.

You can see this at the beginning of the second clip from Spiderman: ATSV. In English, the sentence is a bit longer than in Japanese. So, the AI model tried to close Miguel’s mouth which was moving originally in English. I agree that It doesn’t completely throw you out of the movie, But you can easily notice it.

You can also notice this in a few shots with Tony from the Avengers clip.

3. Overlapping voices.

The AI is not yet sophisticated enough to separate different voices in a scene when they overlap. You can notice this in the Spiderman: ATSV scene below.

When Gwen and Miles speak immediately after Miguel does, The voice doesn’t change. The lips sync perfectly but the audio is still in Miguel’s voice.

4. When the actor is not looking toward the camera.

As I mentioned above, the AI detects lips in a scene and replaces those frames with new generations.

But sometimes, it doesn’t detect lips correctly when the actor is shot from a side angle. This can be seen in the last clip from “The Marriage Story”.

Strangely, In some frames even though Scarlett’s lips are visible they are not changed.

So I asked an expert and he said It might be because she is shaking her head violently and the model is unable to pick the right spot to regenerate the lips.

Again, these AI are black-box neural networks and are not completely explainable. (That's for a different day)

There are a few other reasons like, the AI is not able to detect the correct no. of speakers in a video which leads to the same voice being used for two different characters.

I believe all the issues above can be fixed with minor improvements and as the model is trained on more high-quality data.

Mark my words, Movie dubbing using AI will become a thing by the end of this decade.

Currently, I am playing with these AI to see if it can dub long-form YT videos like podcasts, etc. Will make a post on it once I have something to share. Stay tuned. Bye.

Reply

or to participate.