Technical Paper: Sora - Creating video from text
Blog: Video generation models as world simulators
These are just short notes / excerpts from the technical paper for quick lookup.
Sora is quite a breakthrough. It is able to understand and simulate the physical world, generating upto 60s long high-definition videos while maintaining the quality, scene continuation and following the user’s prompt.
Key papers Sora is built upon -
Diffusion Transformer (DiT) Latent Diffusion Models DALL-E 3 Image Recaptioning
Sora (being a diffusion transformer) uses the idea from ViT of using patches.