MLSys | Shreyansh Singh

Jun 21, 2026	Decompose-K: From torch.compile to Hand-Tuned Triton Kernels for Skinny Large‑K Matmuls
Jun 01, 2026	KV Cache Compaction and Compression: From Attention Sinks to Learned Memory
May 17, 2026	Paper Summary #17 - Engram
May 15, 2026	Paper Summary #15 - Hyper-Connections and mHC
Feb 19, 2026	Deep dive into CUDA Scan Kernels: Hierarchical and Single-Pass Variants
Apr 04, 2025	Notes from GTC'25: CUDA Techniques to Maximize Compute and Instruction Throughput
Mar 23, 2025	Notes from GTC'25: CUDA Techniques to Maximize Memory Bandwidth and Hide Latency - Part 2
Mar 23, 2025	Notes from GTC'25: CUDA Techniques to Maximize Memory Bandwidth and Hide Latency - Part 1
Mar 02, 2025	Faster Cross-Encoder Inference: Unleashing torch.compile for speed
Mar 26, 2023	Paper Summary #8 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Oct 10, 2022	Paper Summary #7 - Efficient Transformers: A Survey