efficient-ml

Flash Attention in Pytorch

A simplified implementation of FlashAttention in PyTorch. I have implemented the forward pass and backward pass algorithms from the paper, and also shown that it is equivalent to the normal attention formulation in Transformers. I also include some code for benchmarking. Note that this is for educational purposes only as I haven’t implemented any of the CUDA and SRAM memory tricks as described in the paper.

Paper Summary #7 - Efficient Transformers: A Survey

Paper: Efficient Transformers: A Survey Link: https://arxiv.org/abs/2009.06732 Authors: Yi Tay, Mostafa Dehghani, Dara Bahri, Donald Metzler I wanted to summarize this paper for a long time now because of the immense amount of information in this paper. Thanks to the Cohere For AI community for having a session on this paper which made me revisit this. What? This is a survey paper on the various memory-efficiency based improvements on the original Transformers architecture by Vaswani et al.