deep learning | Shreyansh Singh

Flash Attention in Pytorch

A simplified implementation of FlashAttention in PyTorch. I have implemented the forward pass and backward pass algorithms from the paper, and also shown that it is equivalent to the normal attention formulation in Transformers. I also include some code for benchmarking. Note that this is for educational purposes only as I haven’t implemented any of the CUDA and SRAM memory tricks as described in the paper.

Paper Summary #9 - Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Paper: Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training Link: https://arxiv.org/abs/2305.14342 Authors: Hong Liu, Zhiyuan Li, David Hall, Percy Liang, Tengyu Ma Code: https://github.com/Liuhong99/Sophia I have also released an annotated version of the paper. If you are interested, you can find it here. Sophia is probably one of the most interesting papers I have read recently and I really liked how well it was written. This post is basically the notes that I had made while reading the paper, which is why it is not exactly a blog post and most of it is verbatim copied from the paper.

Paper Summary #8 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Link: https://arxiv.org/abs/2205.14135 Authors: Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré Code: https://github.com/HazyResearch/flash-attention I have also released an annotated version of the paper. If you are interested, you can find it here. [Update] - I implemented a simplified version of FlashAttention (without the CUDA and SRAM memory optimizations) in PyTorch. Check it out on Github. I finished reading the FlashAttention paper recently and thought that it would be good to have a technical write-up of the paper, so that it can help me understand the concept well.

Academic Log | October-December 2022

A collection of academic papers/blogs/talks/projects that I read/watched/explored during the month. I also include any small (or large) personal projects that I did and any such related ML/non-ML work. Personal Projects Paper re-implementation - Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability by Cohen et al., 2021 - [Github] Paper re-implementation - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks by Frankle et al., 2018 - [Github] Paper re-implementation -An Empirical Model of Large-Batch Training by OpenAI, 2018 - [Github] Annotated Papers The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability Modeling Language Usage and Listener Engagement in Podcasts Which Algorithmic Choices Matter at Which Batch Sizes?

Academic Log | August/September 2022

A collection of academic papers/blogs/talks/projects that I read/watched/explored during the month. I also include any small (or large) personal projects that I did and any such related ML/non-ML work. Personal Projects VAE-Implementation - A simple implementation of Autoencoder and Variational Autoencoder - [Github] MinHash-Implemenation - A simple MinHash implementation based on the explanation in the Mining of Massive Datasets course by Stanford - [Github] Paper re-implementation - Sentence VAE paper, “Generating Sentences from a Continuous Space” by Bowman et al.

Paper Summary #7 - Efficient Transformers: A Survey

Paper: Efficient Transformers: A Survey Link: https://arxiv.org/abs/2009.06732 Authors: Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler I wanted to summarize this paper for a long time now because of the immense amount of information in this paper. Thanks to the Cohere For AI community for having a session on this paper which made me revisit this. What? This is a survey paper on the various memory-efficiency based improvements on the original Transformers architecture by Vaswani et al.

Academic Log | June/July 2022

A collection of academic papers/blogs/talks/projects that I read/watched/explored during the month. I also include any small (or large) personal projects that I did and any such related ML/non-ML work. Personal Projects Paper re-implementation - “Extracting Training Data from Large Language Models” by Carlini et al., 2021. - [Github] Annotated Papers Learning Backward Compatible Embeddings Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models Tracing Knowledge in Language Models Back to the Training Data Papers I read On the Unreasonable Effectiveness of Feature propagation in Learning on Graphs with Missing Node Features PaLM: Scaling Language Modeling with Pathways Hierarchical Text-Conditional Image Generation with CLIP Latents Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language Unified Contrastive Learning in Image-Text-Label Space Improving Passage Retrieval with Zero-Shot Question Generation Exploring Dual Encoder Architectures for Question Answering Efficient Fine-Tuning of BERT Models on the Edge Fine-Tuning Transformers: Vocabulary Transfer Manipulating SGD with Data Ordering Attacks Differentially Private Fine-tuning of Language Models Extracting Training Data from Large Language Models Learning Backward Compatible Embeddings Compacter: Efficient Low-Rank Hypercomplex Adapter Layers Agreement-on-the-Line: Predicting the Performance of Neural Networks under Distribution Shift Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models Tracing Knowledge in Language Models Back to the Training Data Blogs I read Domain Adaptation with Generative Pseudo-Labeling (GPL) Making Deep Learning Go Brrrr From First Principles Introduction to TorchScript Nonlinear Computation in Deep Linear Networks Talks I watched How GPU Computing Works Subscribe to my posts!

Paper Summary #6 - Language Models are Unsupervised Multitask Learners

Paper: Language Models are Unsupervised Multitask Learners Link: https://bit.ly/3vgaVJc Authors: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever Code: https://github.com/openai/gpt-2 I also made an annotated version of the paper which you can find here What? The paper demonstrates that language models begin to learn NLP tasks like question answering, machine translation, reading comprehension and summarization without any explicit supervision. The results shown are obtained after training the model on a new dataset of millions of web pages called WebText.

Paper Summary #5 - XLNet: Generalized Autoregressive Pretraining for Language Understanding

Paper: XLNet: Generalized Autoregressive Pretraining for Language Understanding Link: https://arxiv.org/pdf/1906.08237.pdf Authors: Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le Code: https://github.com/zihangdai/xlnet What? The paper proposes XLNet, a generalized autoregressive pretraining method that enables learning bidirectional contexts over all permutations of the factorization order and overcomes the limitations of BERT due to the autoregressive formulation of XLNet. XLNet incorporates Transformer-XL as the underlying model. It outperforms BERT in 20 NLP tasks like question answering, natural language inference, sentiment analysis and document ranking.

Paper Summary #4 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper: BERT - Pre-training of Deep Bidirectional Transformers for Language Understanding Link: https://bit.ly/3bdTUra Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova Code: https://bit.ly/3vRXlM7 What? The paper proposes BERT which stands for Bidirectional Encoder Representations from Transformers. BERT is designed to pre-train deep bidirectional representations from unlabeled text. It performs a joint conditioning on both left and right context in all the layers. The pre-trained BERT model can be fine-tuned with one additional layer to create the final task-specific models i.