transformers

Paper Summary #9 - Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Paper: Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training Link: https://arxiv.org/abs/2305.14342 Authors: Hong Liu, Zhiyuan Li, David Hall, Percy Liang, Tengyu Ma Code: https://github.com/Liuhong99/Sophia I have also released an annotated version of the paper. If you are interested, you can find it here. Sophia is probably one of the most interesting papers I have read recently and I really liked how well it was written. This post is basically the notes that I had made while reading the paper, which is why it is not exactly a blog post and most of it is verbatim copied from the paper.

Paper Summary #8 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Link: https://arxiv.org/abs/2205.14135 Authors: Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré Code: https://github.com/HazyResearch/flash-attention I have also released an annotated version of the paper. If you are interested, you can find it here. [Update] - I implemented a simplified version of FlashAttention (without the CUDA and SRAM memory optimizations) in PyTorch. Check it out on Github. I finished reading the FlashAttention paper recently and thought that it would be good to have a technical write-up of the paper, so that it can help me understand the concept well.

Paper Summary #7 - Efficient Transformers: A Survey

Paper: Efficient Transformers: A Survey Link: https://arxiv.org/abs/2009.06732 Authors: Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler I wanted to summarize this paper for a long time now because of the immense amount of information in this paper. Thanks to the Cohere For AI community for having a session on this paper which made me revisit this. What? This is a survey paper on the various memory-efficiency based improvements on the original Transformers architecture by Vaswani et al.

Paper Summary #1 - Attention Is All You Need

Paper: Attention Is All You Need Link: https://bit.ly/3aklLFY Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin Code: https://github.com/tensorflow/tensor2tensor What? Proposes Transformers, a new simple architecture for sequence transduction that uses only an attention mechanism and does not use any kind of recurrence or convolution. This model achieves SOTA (at the time) on the WMT 2014 English-to-French translation task with a score of 41.