Artificial intelligence is getting better and better at generating an image in response to a handful of words, with publicly available AI image generators such as DALL-E 2 and Stable Diffusion. Now, Meta researchers are taking AI a step further: they're using it to concoct videos from a text prompt.
Meta CEO Mark Zuckerberg posted on Facebook on Thursday about the research, called Make-A-Video, with a 20-second clip that compiled several text prompts that Meta researchers used and the resulting (very short) videos. The prompts include “A teddy bear painting a self portrait,” “A spaceship landing on Mars,” “A baby sloth with a knitted hat trying to figure out a laptop,” and “A robot surfing a wave in the ocean.”
The videos for each prompt are just a few seconds long, and they generally show what the prompt suggests (with the exception of the baby sloth, which doesn't look much like the actual creature), in a fairly low-resolution and somewhat jerky style. Even so, it demonstrates a fresh direction AI research is taking as systems become increasingly good at generating images from words. If the technology is eventually released widely, though, it will raise many of the same concerns sparked by text-to-image systems, such as that it could be used to spread misinformation via video.
A web page for Make-A-Video includes these short clips and others, some of which look fairly realistic, such as a video created in response to the prompt “Clown fish swimming through the coral reef” or one meant to show “A young couple walking in a heavy rain.”
In his Facebook post, Zuckerberg pointed out how tricky it is to generate a moving image from a handful of words.
“It's much harder to generate video than photos because beyond correctly generating each pixel, the system also has to predict how they'll change over time,” he wrote.
A research paper describing the work explains that the project uses a text-to-image AI model to figure out how words correspond with pictures, and an AI technique known as unsupervised learning — in which algorithms pore over data that isn't labeled to discern patterns within it — to look at videos and determine what realistic motion looks like.
As with massive, popular AI systems that generate images from text, the researchers pointed out that their text-to-image AI model was trained on internet data — which means it learned “and likely exaggerated social biases, including harmful ones,” the researches wrote. They did note that they filtered data for “NSFW content and toxic words,” but as datasets can include many millions of images and text, it may not be possible to remove all such content.
Zuckerberg wrote that Meta plans to share the Make-A-Video project as a demo in the future.