transformer | Shreyansh Singh

Flash Attention in Pytorch

A simplified implementation of FlashAttention in PyTorch. I have implemented the forward pass and backward pass algorithms from the paper, and also shown that it is equivalent to the normal attention formulation in Transformers. I also include some code for benchmarking. Note that this is for educational purposes only as I haven’t implemented any of the CUDA and SRAM memory tricks as described in the paper.

ConvNeXt - Adversarial images generation

I implemented [Stanislav Fort's project](https://twitter.com/stanislavfort/status/1481263565998805002?s=20) in Pytorch. The Github repo has a notebook which looks at generating adversarial images to 'fool' the ConvNeXt model's image classification capabilities. ConvNeXt came out earlier this year (2022) from Meta AI. The FGSM (Fast Gradient Sign Method) is a great algorithm to attack models in a white-box fashion with the goal of misclassification. Noise is added to the input image (not randomly) but in a manner such that the direction is the same as the gradient of the cost function with respect to the data.

Paper Summary #5 - XLNet: Generalized Autoregressive Pretraining for Language Understanding

Paper: XLNet: Generalized Autoregressive Pretraining for Language Understanding Link: https://arxiv.org/pdf/1906.08237.pdf Authors: Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le Code: https://github.com/zihangdai/xlnet What? The paper proposes XLNet, a generalized autoregressive pretraining method that enables learning bidirectional contexts over all permutations of the factorization order and overcomes the limitations of BERT due to the autoregressive formulation of XLNet. XLNet incorporates Transformer-XL as the underlying model. It outperforms BERT in 20 NLP tasks like question answering, natural language inference, sentiment analysis and document ranking.

Paper Summary #4 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper: BERT - Pre-training of Deep Bidirectional Transformers for Language Understanding Link: https://bit.ly/3bdTUra Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova Code: https://bit.ly/3vRXlM7 What? The paper proposes BERT which stands for Bidirectional Encoder Representations from Transformers. BERT is designed to pre-train deep bidirectional representations from unlabeled text. It performs a joint conditioning on both left and right context in all the layers. The pre-trained BERT model can be fine-tuned with one additional layer to create the final task-specific models i.

Paper Summary #3 - Improving Language Understanding by Generative Pre-Training

Paper: Improving Language Understanding by Generative Pre-Training Link: https://bit.ly/3xITvGP Blog: https://openai.com/blog/language-unsupervised/ Authors: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever Code: https://bit.ly/3gUFrUX What? The paper proposes a semi-supervised technique that shows better performance on a wide variety of tasks like textual entailment, question answering, semantic similarity text classification by using a single task-agnostic model. The model can overcome the constraints of the small amount of annotated data for these specific tasks by performing an unsupervised generative-pretraining of a language model on a large diverse text corpus followed by supervised discriminative fine-tuning on each specific task.