Image from “A teddy bear doing the dishes”, as generated by Google Imagen Video.
Today, Google announced the development of Imagen Video, a text-to-video artificial intelligence mode capable of outputting 1280×768 video at 24 frames per second from a written prompt. Currently, it is in the research phase, but its appearance five months after Google Imagen indicates the rapid development of video synthesis models.
Just six months after the launch of OpenAI’s DALLE-2 text-to-image generator, progress in the field of AI diffusion models has accelerated rapidly. Google’s Imagen Video announcement comes less than a week after Meta unveiled its text-to-video artificial intelligence tool, Make-A-Video.
According to Google’s research document, Imagen Video includes several notable stylistic capabilities, such as generating videos based on the work of famous painters (Vincent van Gogh’s paintings, for example), generating 3D rotating objects while preserving object structure and text rendering in a variety of animation styles. Google hopes that general-purpose video digest models can “significantly reduce the difficulty of generating high-quality content.”
Key to Imagen Video’s capabilities is a “cascade” of seven delivery patterns that transform the initial text prompt (such as “a bear doing the dishes”) into a low-resolution video (16 frames, 24×48 pixels , at 3 fps ), then scales it into progressively higher resolutions with higher frame rates at each step. The final output video is 5.3 seconds long.
The sample videos featured on Imagen Video’s website range from the mundane (“Melt ice cream dripping from the cone”) to the fantastical (“Flying through an intense battle between pirate ships on a stormy ocean” ). They contain obvious artifacts, but show more fluidity and detail than previous text-to-image models such as CogVideo which debuted five months ago.

Another Google-adjacent text-to-video template also officially debuted today. Called Phenaki, it can create longer videos from detailed prompts. This, along with DreamFusion, which can create 3D models from text prompts, shows that competitive development on broadcast models continues rapidly, with the number of AI articles on arXiv exponential growth at a pace that makes it difficult for some researchers continue with the latest developments.
The training data for Google Imagen Video comes from the publicly available LAION-400M image-text dataset and “14 million video-text pairs and 60 million image-text pairs,” according to Google. As a result, it was formed on “problematic data” filtered by Google, but may still contain sexually explicit and violent content, as well as social stereotypes and cultural biases. The company is also concerned that its tool could be used “to generate false, hateful, explicit or harmful content”.
Therefore, we are unlikely to see a public release anytime soon: “We have decided not to release the Imagen Video template or its source code until these concerns are addressed,” Google says.