Like “Avengers” director Joe Russo, I’m becoming increasingly convinced that fully AI-generated movies and TV shows will be possible within our lifetimes.
A host of AI unveilings over the past few months, in particular OpenAI’s ultra-realistic-sounding text-to-speech engine, have given glimpses into this brave new frontier. But Meta’s announcement today put our AI-generated content future into especially sharp relief — for me at least.
Meta his morning debuted Emu Video, an evolution of the tech giant’s image generation tool, Emu. Given a caption (e.g. “A dog running across a grassy knoll”), image or a photo paired with a description, Emu Video can generate a four-second animated clip.
Emu Video’s clips can be edited with a complementary AI model called Emu Edit, which was also announced today. Users can describe the modifications they want to make to Emu Edit in natural language — e.g. “the same clip, but in slow motion” — and see the changes reflected in a newly generated video.
Now, video generation tech isn’t new. Meta’s experimented with it before, as has Google. Meanwhile, startups like Runway are already building businesses on it.
But Emu Video’s 512×512, 16-frames-per-second clips are easily among the best I’ve seen in terms of their fidelity — to the point where my untrained eye has a tough time distinguishing them from the real thing.
Well — at least some of them. It seems Emu Video is most successful animating simple, mostly static scenes (e.g. waterfalls and timelapses of city skylines) that stray from photorealism — that is to say in styles like cubism, anime, “paper cut craft” and steampunk. One clip of the Eiffel Tower at dawn “as a painting,” with the tower reflected in the River Seine beneath it, reminded me of an e-card you might see on American Greetings.
Even in Emu Video’s best work, however, AI-generated weirdness manages to creep in — like bizarre physics (e.g. skateboards that move parallel to the ground) and freaky appendages (toes that curl behind feet and legs that blend into each other). Objects often appear and fade from view without much logic to it, too, like the birds overhead in the aforementioned Eiffel Tower clip.
After much too much time spent browsing Emu Video’s creations (or at least the examples that Meta cherry-picked), I started to notice another obvious tell: subjects in the clips don’t… well, do much. So far as I can tell, Emu Video doesn’t appear to have a strong grasp of action verbs, perhaps a limitation of the model’s underpinning architecture.
For example, a cute anthropomorphized racoon in an Emu Video clip will hold a guitar, but it won’t strum the guitar — even if the clip’s caption included the word “strum.” Or two unicorns will “play” chess, but only in the sense that they’ll sit inquisitively in front of a chessboard without moving the pieces.
So clearly there’s work to be done. Still, Emu Video’s more basic b-roll wouldn’t be out of place in a movie or TV show today, I’d say — and the ethical ramifications of this frankly terrify me.
The deepfakes risk aside, I fear for animators and artists whose livelihoods depend on crafting the sorts of scenes AI like Emu Video can now approximate. Meta and its generative AI rivals would likely argue that Emu Video, which Meta CEO Mark Zuckerberg says is being integrated into Facebook and Instagram (hopefully with better toxicity filters than Meta’s AI-generated stickers), augment rather than replace human artists. But I’d say that’s taking the optimistic, if not disingenuous, view — especially where money’s involved.
Earlier this year, Netflix used AI-generated background images in a three-minute animated short. The company claimed that the tech could help with anime’s supposed labor shortage — but conveniently glossed over how low pay and often strenuous working conditions are pushing away artists from the work.
In a similar controversy, the studio behind the credit sequence for Marvel’s “Secret Invasion” admitted to using AI, mainly the text-to-image tool Midjourney, to generate much of the sequence’s artwork. Series director Ali Selim made the case that the use of AI fits with the paranoid themes of the show, but the bulk of the artist community and fans vehemently disagreed.
Actors could be on the chopping block, too. One of the major sticking points in the recent SAG-AFTRA strike was the use of AI to create digital likenesses. Studios ultimately agreed to pay actors for their AI-generated likenesses. But might they reconsider as the tech improves? I think it’s likely.
Adding insult to injury, AI like Emu Video is usually trained on images and videos produced by artists, photographers and filmmakers — and without notifying or compensating those creators. In a whitepaper accompanying the release of Emu Video, Meta says only that the model was trained on a dataset of 34 million “video-text pairs” ranging in length from five to 60 seconds — not where those videos came from, their copyright statuses or whether Meta licensed them.
(Following this article’s publication, a Meta spokesperson told TechCrunch via email that Emu was trained on “data from licensed partners.”)
There’s been fits and starts toward industry-wide standards to allow artists to “opt out” of training or receive payment for AI-generated works to which they contributed. But if Emu Video is any indication, the tech — as so often happens — will soon run far ahead of ethics. Perhaps it already has.