Over the past two years, AI-powered image generators have become commodified, more or less, thanks to the widespread availability of — and decreasing technical barriers around — the tech. They’ve been deployed by practically every major tech player, including Google and Microsoft, as well as countless startups angling to nab a slice of the increasingly lucrative generative AI pie.
That isn’t to suggest they’re consistent yet, performance-wise — far from it. While the quality of image generators has improved, it’s been incremental, sometimes agonizing progress.
But Meta claims to have had a breakthrough.
Today, Meta announced CM3leon (“chameleon” in clumsy leetspeak), an AI model that the company claims achieves state-of-the-art performance for text-to-image generation. CM3leon is also distinguished by being one of the first image generators capable of generating captions for images, laying the groundwork for more capable image-understanding models going forward, Meta says.
“With CM3leon’s capabilities, image generation tools can produce more coherent imagery that better follows the input prompts,” Meta wrote in a blog post shared with TechCrunch earlier this week. “We believe CM3leon’s strong performance across a variety of tasks is a step toward higher-fidelity image generation and understanding.”
Most modern image generators, including OpenAI’s DALL-E 2, Google’s Imagen and Stable Diffusion, rely on a process called diffusion to create art. In diffusion, a model learns how to gradually subtract noise from a starting image made entirely of noise — moving it closer step by step to the target prompt.
The results are impressive. But diffusion is computationally intensive, making it expensive to operate and slow enough that most real-time applications are impractical.
CM3leon is a transformer model, by contrast, leveraging a mechanism called “attention” to weigh the relevance of input data such as text or images. Attention and the other architectural quirks of transformers can boost model training speed and make models more easily parallelizable. Larger and larger transformers can be trained with significant but not unattainable increases in compute, in other words.
And CM3leon is even more efficient than most transformers, Meta claims, requiring five times less compute and a smaller training data set than previous transformer-based methods.
Interestingly, OpenAI explored transformers as a means of image generation several years ago with a model called Image GPT. But it ultimately abandoned the idea in favor of diffusion — and might soon move on to “consistency.”
To train CM3leon, Meta used a data set of millions of licensed images from Shutterstock. The most capable of several versions of CM3leon that Meta built has 7 billion parameters, over twice as many as DALL-E 2. (Parameters are the parts of the model learned from training data and essentially define the skill of the model on a problem, like generating text — or, in this case, images.)
One key to CM3leon’s stronger performance is a technique called supervised fine-tuning, or SFT for short. SFT has been used to train text-generating models like OpenAI’s ChatGPT to great effect, but Meta theorized that it could be useful when applied to the image domain, as well. Indeed, instruction tuning improved CM3leon’s performance not only on image generation but on image caption writing, enabling it to answer questions about images and edit images by following text instructions (e.g. “change the color of the sky to bright blue”).
Most image generators struggle with “complex” objects and text prompts that include too many constraints. But CM3Leon doesn’t — or at least, not as often. In a few cherrypicked examples, Meta had CM3Leon generate images using prompts like “A small cactus wearing a straw hat and neon sunglasses in the Sahara desert,” “A close-up photo of a human hand, hand model,” “A raccoon main character in an Anime preparing for an epic battle with a samurai sword” and “A stop sign in a Fantasy style with the text ‘1991.’”
For the sake of comparison, I ran the same prompts through DALL-E 2. Some of the results were close. But the CM3Leon images were generally closer to the prompt and more detailed to my eyes, with the signage being the most obvious example. (Until recently, diffusion models handled both text and human anatomy relatively poorly.
CM3Leon can also understand instructions to edit existing images. For example, given the prompt “Generate high quality image of ‘a room that has a sink and a mirror in it’ with bottle at location (199, 130),” the model can generate something visually coherent and, as Meta puts it, “contextually appropriate” — room, sink, mirror, bottle and all. DALL-E 2 utterly fails to pick up on the nuances of prompts like these, at times completely omitting the objects specified in the prompt.
And, of course, unlike DALL-E 2, CM3leon can follow a range of prompts to generate short or long captions and answer questions about a particular image. In these areas, the model performed better than even specialized image captioning models (e.g. Flamingo, OpenFlamingo) despite seeing less text in its training data, Meta claims.
But what about bias? Generative AI models like DALL-E 2 have been found to reinforce societal biases, after all, generating images of positions of authority — like “CEO” or “director” — that depict mostly white men. Meta leaves that question unaddressed, saying only that CM3leon “can reflect any biases present in the training data.”
“As the AI industry continues to evolve, generative models like CM3leon are becoming increasingly sophisticated,” the company writes. “While the industry is still in its early stages of understanding and addressing these challenges, we believe that transparency will be key to accelerating progress.”
Meta didn’t say whether — or when — it plans to release CM3leon. Given the controversies swirling around open source art generators, I wouldn’t hold my breath