Testing technology that’s too dangerous to release, Microsoft’s VASA-1 AI has come out with the ability to turn anyone’s image into a video, prompting a global response of apprehension at the possibilities. The deepfake technology by Microsoft is far from the only AI service that’s looking into image or video editing, but it does have worrying implications of misuse. Epic Rap Battles of History might have been one of the first YouTube channels that made Leonardo da Vinci rap but the Microsoft AI face animation tool has managed to turn the Mona Lisa into a rapper as well. The video is part funny and part horrifying, yet no one can claim the tech isn’t fascinating. 

Microsoft VASA-1 AI

Image: Microsoft AI gives life to rapper Mona Lisa

Microsoft Vasa-1 AI—We Always Knew Mona Lisa Had It in Her

The Microsoft AI Mona Lisa rap video is aggressively good—it shows the painting of Mona Lisa rapping the audio from a 2011 Anne Hathaway clip on the Conan show. The vocals were taken from an episode that showed the lovely and unassuming Princess Diaries star rapping about the paparazzi, her vocal performance styled after Lil’ Wayne. This music choice to showcase the Microsoft deepfake technology may or may not have been intended to distract you from thinking too hard about the applications of such tech, but people have gotten over the initial amusement to once again question the intentions behind developing such a tool.

Understanding the Microsoft AI Face Animation Tool

If you’re still confused about what the Microsoft VASA-1 AI does, it generates “lifelike talking faces of virtual characters with appealing visual affective skills (VAS), given a single static image and a speech audio clip.” The clip that is generated not only moves the mouth of the character in the frame but also generates life-like head movements and expressions that mimic a real speaker—that’s where the true capability of the AI shines. The AI syncs the audio to the movements generated in the image to reflect how an actual speaker might present those lines, pushing its capabilities far beyond what a Snapchat filter overlay can do.

“The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos,” states the research abstract for the Microsoft VASA-1 AI. The AI is able to generate 512×512 videos at up to 40 FPS and the results are promising to say the least. Unfortunately, the primary shape it takes right now is being dubbed as Microsoft’s “deepfake technology.”

How Did We Arrive at Microsoft AI’s Mona Lisa Rap Video?

The Microsoft AI’s Mona Lisa rap video is just one example of the content the AI is able to generate. Using images generated by StyleGAN2 or DALL-E-3 that are very human-like, the Microsoft VASA-1 AI announcement showcased the full range of what the tool can do. The AI can work on a diverse range of subjects and generate one-minute-long videos, fusing the chosen audio clip with the video it generates from the image. There are different ways to customize the video too.

If you want someone demure gazing off-camera as they speak or a confident speaker looking straight into the camera, the Microsoft AI face animation tool can customize it for you. If you want the speaker to showcase a different kind of emotion on their face as they narrate, the VASA-1 AI can—very realistically—make them emote differently. These editing liberties don’t just apply to realistic photos either—they can be used on artistic photos and can also replicate the necessary mannerisms of someone singing or speaking a different language.

There are various isolation mechanics that allow the user to dynamically alter minute elements of the video. The pose and expression editing tools show how you might be able to control specific aspects of the generated video, making it a versatile service that could see unbelievable uses. The cultural impact of Hatsune Miku and Mave tells us that we’re largely unopposed to popularizing virtual idols just as we do human ones. 

Microsoft deepfake technology

The More Worrying Implications of Microsoft’s Deepfake Technology

Aware of how the public might react to the technology, the company put out a disclaimer right at the start, “This is only a research demonstration and there’s no product or API release plan.” The choice to use AI-generated images was also a strategic one to ensure that no one could accuse them of stealing their likeness for training the AI. “We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection,” the announcement states.

At first glance, the now popular clip of Microsoft AI rapper Mona Lisa is a lot of fun. It shows us how people might someday be able to commemorate loved ones who are no longer around or how we could put different bits of pop culture together to create a single masterpiece. But when you look at the realistic videos generated, you can envision just how the AI could be put to use for nefarious reasons. 

The company’s decision not to release the AI to the world at large comes because they are aware of its potential for misuse. Microsoft deepfake technology and other similar video editing tools could blur the lines between what is real and what isn’t, making the future vastly more stressful for everyone. From politicians to your next-door neighbor, anyone could fall victim to such advanced tools, which is why there need to be stringent checks in place before any such AI can be made easily available. 

Despite the drawbacks, we can’t sit back and dissuade such research from taking place. If big corporations with the funds and resources to test these tools don’t stay ahead of the AI curve, there will be someone with the same intellectual prowess who makes the same breakthrough. They might not have the resources necessary to monitor its release, however, leading to new AI services that are entirely unregulated. The Microsoft VASA-1 AI tool is an incredible step forward in the potential of artificial intelligence, but it’s one that needs to be met with caution and discretion every step of the way.