Paper: Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training
Link: https://arxiv.org/abs/2305.14342
Authors: Hong Liu, Zhiyuan Li, David Hall, Percy Liang, Tengyu Ma
Code: https://github.com/Liuhong99/Sophia
I have also released an annotated version of the paper. If you are interested, you can find it here.
Sophia is probably one of the most interesting papers I have read recently and I really liked how well it was written. This post is basically the notes that I had made while reading the paper, which is why it is not exactly a blog post and most of it is verbatim copied from the paper.
Paper: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Link: https://arxiv.org/abs/2205.14135
Authors: Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
Code: https://github.com/HazyResearch/flash-attention
I have also released an annotated version of the paper. If you are interested, you can find it here.
[Update] - I implemented a simplified version of FlashAttention (without the CUDA and SRAM memory optimizations) in PyTorch. Check it out on Github.
I finished reading the FlashAttention paper recently and thought that it would be good to have a technical write-up of the paper, so that it can help me understand the concept well.
Paper: Efficient Transformers: A Survey
Link: https://arxiv.org/abs/2009.06732
Authors: Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler
I wanted to summarize this paper for a long time now because of the immense amount of information in this paper. Thanks to the Cohere For AI community for having a session on this paper which made me revisit this.
What? This is a survey paper on the various memory-efficiency based improvements on the original Transformers architecture by Vaswani et al.
Introduction Gboard — the Google keyboard, is a virtual keyboard for smartphones with support for more than 900+ language varieties and over 1 billion installs. In addition to decoding noisy signals from input modalities including tap and word-gesture typing, Gboard provides auto-correction, word completion, and next-word prediction features.
Next-word predictions provide a tool for facilitating text entry and is plays an important role in improving user experience. Based on a small amount of user-generated preceding text, language models (LMs) can predict the most probable next word or phrase.
In my last post, I covered a high-level overview of Federated Learning, its applications, advantages & challenges.
We also went through a high-level overview of how Federated Optimization algorithms work. But from a mathematical sense, how is Federated Learning training actually performed? That’s what we will be looking at in this post.
There was a paper, Communication-Efficient Learning of Deep Networks from Decentralized Data by Google (3637 citations!!!), in which the authors had proposed a federated optimization algorithm called FedAvg and compared it with a naive baseline, FedSGD.
Motivation Privacy-preserving Machine Learning had always been exciting for me. Since my B.Tech. thesis involving PPML (SMPC + Computer Vision), I didn’t get a chance to work on it after that. So, after about 2 years, I have started to read about it again, and sharing it with the community.
Federated Learning is a domain that I had somewhat eluded during my thesis. I had some idea about the topic but didn’t get into it much.
Paper: Language Models are Unsupervised Multitask Learners
Link: https://bit.ly/3vgaVJc
Authors: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever Code: https://github.com/openai/gpt-2
I also made an annotated version of the paper which you can find here
What? The paper demonstrates that language models begin to learn NLP tasks like question answering, machine translation, reading comprehension and summarization without any explicit supervision. The results shown are obtained after training the model on a new dataset of millions of web pages called WebText.
Paper: XLNet: Generalized Autoregressive Pretraining for Language Understanding
Link: https://arxiv.org/pdf/1906.08237.pdf
Authors: Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le
Code: https://github.com/zihangdai/xlnet
What? The paper proposes XLNet, a generalized autoregressive pretraining method that enables learning bidirectional contexts over all permutations of the factorization order and overcomes the limitations of BERT due to the autoregressive formulation of XLNet. XLNet incorporates Transformer-XL as the underlying model. It outperforms BERT in 20 NLP tasks like question answering, natural language inference, sentiment analysis and document ranking.
Paper: BERT - Pre-training of Deep Bidirectional Transformers for Language Understanding
Link: https://bit.ly/3bdTUra Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Code: https://bit.ly/3vRXlM7
What? The paper proposes BERT which stands for Bidirectional Encoder Representations from Transformers. BERT is designed to pre-train deep bidirectional representations from unlabeled text. It performs a joint conditioning on both left and right context in all the layers. The pre-trained BERT model can be fine-tuned with one additional layer to create the final task-specific models i.
Paper: Improving Language Understanding by Generative Pre-Training
Link: https://bit.ly/3xITvGP
Blog: https://openai.com/blog/language-unsupervised/ Authors: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever
Code: https://bit.ly/3gUFrUX
What? The paper proposes a semi-supervised technique that shows better performance on a wide variety of tasks like textual entailment, question answering, semantic similarity text classification by using a single task-agnostic model. The model can overcome the constraints of the small amount of annotated data for these specific tasks by performing an unsupervised generative-pretraining of a language model on a large diverse text corpus followed by supervised discriminative fine-tuning on each specific task.