Tom and Jerry Videos

Apr 08, 2025

Generate your own one minute Tom and Jerry episodes with 5B diffusion model

Paper: One-Minute Video Generation with Test-Time Training
Link: https://test-time-training.github.io/video-dit/

Researchers from NVIDIA, Stanford University, UCSD, UC Berkeley and UT Austin are interested in making one minute cartoon videos. These researchers introduced Test-Time Training (TTT) layers to enhance the ability of pre-trained Diffusion Transformers to generate longer, more complex videos from text.

Hmm..What’s the background?

This paper addresses the challenge of generating long, complex, multi-scene videos with dynamic motion from text prompts, a task where current state-of-the-art video Transformers struggle due to the inefficiency of self-attention over long contexts and the limited expressiveness of hidden states in alternative RNN layers like Mamba. The authors use Tom and Jerry cartoons as a proof of concept, curating a dataset of approximately 7 hours of these cartoons with human-annotated storyboards to emphasize complex, long-range stories with dynamic motion, rather than just visual and physical realism.

So what is proposed in the research paper?

Here are the main insights:

Integration of Test-Time Training (TTT) layers into a pre-trained Diffusion Transformer (CogVideo-X 5B)
To handle the non-causal nature of Diffusion Transformers, they employ a bi-directional approach with TTT layers
The paper also details a multi-stage fine-tuning recipe to extend the context length to one minute, starting with 3-second segments and progressively increasing the length
Through human evaluation, TTT-MLP significantly outperforms baselines such as Mamba 2, Gated DeltaNet, and sliding-window attention layers in generating coherent videos that tell complex stories, leading by 34 Elo points on average

What’s next?

Future work could explore alternative strategies for integrating TTT layers into pre-trained models beyond bi-direction and learned gates, which could further enhance generation quality and accelerate fine-tuning.

The authors acknowledge current limitations such as video artifacts (e.g., temporal inconsistency, unnatural motion, aesthetic issues) and the fact that while TTT-MLP is more efficient than full attention for long videos, its wall-clock time is still worse than some baselines like Gated DeltaNet and Mamba 2.

Generate your own one minute Tom and Jerry episodes with 5B diffusion model

Learned something new? Consider sharing it!

So Essentially

Discussion about this post