Paper: Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
Link: https://arxiv.org/abs/2305.14342
Authors: Hong Liu, Zhiyuan Li, David Hall, Percy Liang, Tengyu Ma
Code: https://github.com/Liuhong99/Sophia
I have also released an annotated version of the paper. If you are interested, you can find it here.
Sophia is probably one of the most interesting papers I have read recently and I really liked how well it was written. This post is basically the notes that I had made while reading the paper, which is why it is not exactly a blog post and most of it is verbatim copied from the paper.
Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Link: https://arxiv.org/abs/2205.14135
Authors: Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
Code: https://github.com/HazyResearch/flash-attention
I have also released an annotated version of the paper. If you are interested, you can find it here.
[Update] - I implemented a simplified version of FlashAttention (without the CUDA and SRAM memory optimizations) in PyTorch. Check it out on Github.
I finished reading the FlashAttention paper recently and thought that it would be good to have a technical write-up of the paper, so that it can help me understand the concept well.
Paper: Efficient Transformers: A Survey
Link: https://arxiv.org/abs/2009.06732
Authors: Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler
I wanted to summarize this paper for a long time now because of the immense amount of information in this paper. Thanks to the Cohere For AI community for having a session on this paper which made me revisit this.
What? This is a survey paper on the various memory-efficiency based improvements on the original Transformers architecture by Vaswani et al.
Paper: Attention Is All You Need
Link: https://bit.ly/3aklLFY
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Code: https://github.com/tensorflow/tensor2tensor
What? Proposes Transformers, a new simple architecture for sequence transduction that uses only an attention mechanism and does not use any kind of recurrence or convolution. This model achieves SOTA (at the time) on the WMT 2014 English-to-French translation task with a score of 41.