Machine Learning | Shreyansh Singh

Paper Summary #12 - Image Recaptioning in DALL-E 3

Technical Paper: Improving Image Generation with Better Captions OpenAI’s Sora is built upon the image captioning model which was described in quite some detail in the DALL-E 3 technical report. In general, in text-image datasets, the captions omit background details or common sense relationships, e.g. sink in a kitchen or stop signs along the road. They also omit the position and count of objects in the picture, color and size of the objects and any text present in the image.

Paper Summary #11 - Sora

Technical Paper: Sora - Creating video from text Blog: Video generation models as world simulators These are just short notes / excerpts from the technical paper for quick lookup. Sora is quite a breakthrough. It is able to understand and simulate the physical world, generating upto 60s long high-definition videos while maintaining the quality, scene continuation and following the user’s prompt. Key papers Sora is built upon - Diffusion Transformer (DiT) Latent Diffusion Models DALL-E 3 Image Recaptioning Sora (being a diffusion transformer) uses the idea from ViT of using patches.

Paper Summary #10 - Gemini 1.5 Pro

Technical Paper: Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Blog: Our next-generation model: Gemini 1.5 These are just short notes / excerpts from the technical paper for quick lookup. Actual implementational details are anyways missing in the technical paper. Gemini 1.5 Pro is a sparse MoE Transformer-based model (seems like a trend these days, after GPT-4’s rumors and Mixtral). It supports upto 10M context multimodal context length which is about a day of 22 hours of audio recordings, more than ten times the entirety of the 1440 page book “War and Peace”, the entire Flax codebase (41,070 lines of code), or three hours of video at 1 frame-per-second.

Solving Substitution Ciphers using Markov Chain Monte Carlo (MCMC)

I was reading about Markov Chain Monte Carlo (MCMC) recently and discovered a very famous application of using them to decrypt substitution ciphers. This blog is meant to serve as notes on how the problem can be framed as a Markov chain and how a simple yet smart Monte Carlo sampling approach can help solve it very efficiently. In this blog post, I won’t be explaining what Markov process and MCMC is, however I’ll add some references for that at the end of this post.

Flash Attention in Pytorch

A simplified implementation of FlashAttention in PyTorch. I have implemented the forward pass and backward pass algorithms from the paper, and also shown that it is equivalent to the normal attention formulation in Transformers. I also include some code for benchmarking. Note that this is for educational purposes only as I haven’t implemented any of the CUDA and SRAM memory tricks as described in the paper.

Paper Summary #9 - Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Paper: Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training Link: https://arxiv.org/abs/2305.14342 Authors: Hong Liu, Zhiyuan Li, David Hall, Percy Liang, Tengyu Ma Code: https://github.com/Liuhong99/Sophia I have also released an annotated version of the paper. If you are interested, you can find it here. Sophia is probably one of the most interesting papers I have read recently and I really liked how well it was written. This post is basically the notes that I had made while reading the paper, which is why it is not exactly a blog post and most of it is verbatim copied from the paper.

Paper Summary #8 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Link: https://arxiv.org/abs/2205.14135 Authors: Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré Code: https://github.com/HazyResearch/flash-attention I have also released an annotated version of the paper. If you are interested, you can find it here. [Update] - I implemented a simplified version of FlashAttention (without the CUDA and SRAM memory optimizations) in PyTorch. Check it out on Github. I finished reading the FlashAttention paper recently and thought that it would be good to have a technical write-up of the paper, so that it can help me understand the concept well.

Academic Log | October-December 2022

A collection of academic papers/blogs/talks/projects that I read/watched/explored during the month. I also include any small (or large) personal projects that I did and any such related ML/non-ML work. Personal Projects Paper re-implementation - Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability by Cohen et al., 2021 - [Github] Paper re-implementation - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks by Frankle et al., 2018 - [Github] Paper re-implementation -An Empirical Model of Large-Batch Training by OpenAI, 2018 - [Github] Annotated Papers The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability Modeling Language Usage and Listener Engagement in Podcasts Which Algorithmic Choices Matter at Which Batch Sizes?

Academic Log | August/September 2022

A collection of academic papers/blogs/talks/projects that I read/watched/explored during the month. I also include any small (or large) personal projects that I did and any such related ML/non-ML work. Personal Projects VAE-Implementation - A simple implementation of Autoencoder and Variational Autoencoder - [Github] MinHash-Implemenation - A simple MinHash implementation based on the explanation in the Mining of Massive Datasets course by Stanford - [Github] Paper re-implementation - Sentence VAE paper, “Generating Sentences from a Continuous Space” by Bowman et al.

Paper Summary #7 - Efficient Transformers: A Survey

Paper: Efficient Transformers: A Survey Link: https://arxiv.org/abs/2009.06732 Authors: Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler I wanted to summarize this paper for a long time now because of the immense amount of information in this paper. Thanks to the Cohere For AI community for having a session on this paper which made me revisit this. What? This is a survey paper on the various memory-efficiency based improvements on the original Transformers architecture by Vaswani et al.