An example of an image generated by DALL·E 2 of “an astronaut riding a horse in a photorealistic style”. Credit: Open AI
A new method developed by researchers uses multiple models to create more complex images with better understanding.
With the introduction of DALL-E, the internet has had a moment of collective well-being. This AI-based image generator is inspired by artist Salvador Dali and the adorable robot WALL-E and uses natural language to produce any mysterious and beautiful image your heart desires. Seeing typed entries such as “smiling gopher holding an ice cream cone” instantly come to life is an AI-generated image that clearly resonates with the world.
It’s no small task to make said smiling gopher and his attributes appear on your screen. DALL-E 2 uses what is called a broadcast model, where it tries to encode the entire text into a single description to generate an image. However, once the text contains much more detail, it is difficult for a single description to capture everything. Also, although they are very flexible, diffusion models sometimes struggle to understand the composition of certain concepts, such as confusing attributes or relationships between different objects.
This array of generated images, showing “a train on a bridge” and “a river under the bridge”, was generated using a new method developed by researchers at MIT. Credit: Image courtesy of the researchers
To generate more complex images with better understanding, scientists from MITThe Computer Science and Artificial Intelligence Laboratory (CSAIL) structured the typical model from a different angle: they added a series of models together, where they all cooperate to generate the desired images capturing several different aspects, as requested by text or input labels. To create a two-component image, say described by two descriptive sentences, each model would tackle a particular component of the image.
The seemingly magical patterns behind image generation work by suggesting a series of iterative refinement steps to arrive at the desired image. It starts with a “bad” image, then gradually refines it until it becomes the selected image. By composing multiple models together, they jointly refine the appearance at each stage, so the result is an image that exhibits all the attributes of each model. By cooperating several models, you can achieve much more creative combinations in the generated images.
This array of generated images, showing “a river leading to mountains” and “red trees on the side”, was generated using a new method developed by researchers at MIT. Credit: Image courtesy of the researchers
Take, for example, a red truck and a green house. When these sentences become very complicated, the model will confuse the concepts of red truck and green house. A typical generator like DALL-E 2 could swap these colors and create a green truck and a red house. The team’s approach can handle this kind of attribute binding with objects, and especially when there are multiple sets of things, it can handle each object more accurately.
“The model can efficiently model object positions and relational descriptions, which is difficult for existing image generation models. For example, place an object and a cube in one position and a sphere in another. DALL-E 2 is good at generating natural images but sometimes struggles to understand object relationships,” says Shuang Li, PhD student at MIT CSAIL and co-lead author. “Beyond art and creativity, perhaps we could use our model for education. If you want to tell a child to put a cube on top of a sphere, and if we say it in language, it might be difficult for him to understand. But our model can generate the image and show them.
Make Dali Proud
Composable Diffusion – the team’s model – uses diffusion patterns alongside composition operators to combine text descriptions without additional training. The team’s approach captures text details more accurately than the original broadcast model, which directly encodes words into a single long sentence. For example, given “a pink sky” AND “a blue mountain on the horizon” AND “cherry blossoms in front of the mountain”, the team model was able to produce exactly that image, whereas the diffusion model original made the sky blue and everything in front of the mountains pink.
The researchers were able to create surprising and surreal images with the text, “a dog” and “the sky”. On the left are a dog and clouds separately, labeled “dog” and “sky” below, and on the right are two cloud-like images of dogs with the label “dog AND sky” below . Credit: Image courtesy of the researchers
“The fact that our model is composable means that you can learn different parts of the model, one at a time. You can first learn one object on top of another, then learn one object to the right of another, then learn something to the left of another,” says Yilun Du, co-lead author and PhD student at MIT CSAIL. “Since we can compose them together, you can imagine that our system allows us to gradually learn language, relationships or knowledge, which we think is a pretty interesting direction for future work.”
Although it showed prowess in generating complex and photorealistic images, it still encountered difficulties as the model was trained on a much smaller dataset than those like DALL-E 2. Therefore, there were certain objects he just couldn’t capture.
Now that Composable Diffusion can work on generative models, such as DALL-E 2, researchers are ready to explore continuous learning as a potential next step. Since more is usually added to object relationships, they want to see if diffusion models can start to “learn” without forgetting previously learned knowledge – to a place where the model can produce images with both knowledge previous and new.
This photographic illustration was created from images generated from an MIT system called Composable Diffusion, and arranged in Photoshop. Phrases such as “diffusion pattern” and “network” were used to generate the pink dots and geometric angular images. The phrase “a horse AND a field of yellow flowers” is included at the top of the image. The generated images of a horse and a yellow field appear on the left, and the combined imagery of a horse in a field of yellow flowers appears on the right. Credit: Jose-Luis Olivares, MIT and researchers
“This research proposes a new method for composing concepts in text-to-image generation not by concatenating them to form a prompt, but rather by computing scores against each concept and composing them using operators of conjunction and negation”, explains Mark Chen. He is co-creator of DALL-E 2 and researcher at OpenAI. “It’s a good idea that takes advantage of the energy-based interpretation of scattering patterns so that old ideas around compositionality using energy-based models can be applied. The approach is also capable of using classifier-free guidance, and surprisingly it outperforms the GLIDE baseline on various compositional benchmarks and can qualitatively produce very different types of image generations.
“Humans can compose scenes with different elements in a myriad of ways, but this task is difficult for computers,” says Bryan Russel, researcher at Adobe Systems. “This work proposes an elegant formulation that explicitly composes a set of broadcast models to generate an image from a complex natural language prompt.”
Reference: “Compositional Visual Generation with Composable Diffusion Models” by Nan Liu, Shuang Li, Yilun Du, Antonio Torralba and Joshua B. Tenenbaum, June 3, 2022, Computing > Computer vision and pattern recognition.
arXiv:2206.01714
Alongside Li and Du, the paper’s co-lead authors are Nan Liu, a master’s student in computer science at the University of Illinois at Urbana-Champaign, and MIT professors Antonio Torralba and Joshua B. Tenenbaum. They will present the work in 2022 European Conference on Computer Vision.
The research was supported by Raytheon BBN Technologies Corp., Mitsubishi Electric Research Laboratory and DEVCOM Army Research Laboratory.