generative model | Shreyansh Singh

Paper Summary #12 - Image Recaptioning in DALL-E 3

Technical Paper: Improving Image Generation with Better Captions OpenAI’s Sora is built upon the image captioning model which was described in quite some detail in the DALL-E 3 technical report. In general, in text-image datasets, the captions omit background details or common sense relationships, e.g. sink in a kitchen or stop signs along the road. They also omit the position and count of objects in the picture, color and size of the objects and any text present in the image.

Paper Summary #11 - Sora

Technical Paper: Sora - Creating video from text Blog: Video generation models as world simulators These are just short notes / excerpts from the technical paper for quick lookup. Sora is quite a breakthrough. It is able to understand and simulate the physical world, generating upto 60s long high-definition videos while maintaining the quality, scene continuation and following the user’s prompt. Key papers Sora is built upon - Diffusion Transformer (DiT) Latent Diffusion Models DALL-E 3 Image Recaptioning Sora (being a diffusion transformer) uses the idea from ViT of using patches.