- mlsys
- transformer
- paper-summaries
- MLSys
- LLMs
- PPML
•
•
•
•
•
-
Paper Summary #17 - Engram
A technical explainer for DeepSeek's Engram layers: conditional memory, hashed n-gram lookup, context-aware gating, sparse-capacity allocation, and the implementation path inside Transformer blocks.
-
Paper Summary #16 - Canon Layers
A deep dive into Canon Layers: why sequence models need cheap horizontal token flow, how residual causal depthwise convolution implements it, and where Canon-A/B/C/D fit inside Transformer and linear-model blocks.
-
Paper Summary #15 - Hyper-Connections and mHC
From residual-stream basics to manifold-constrained mixing: why widening the residual path helps, why unconstrained products destabilize depth, and how Sinkhorn-Knopp turns HC into conservative feature routing.
-
Deep dive into CUDA Scan Kernels: Hierarchical and Single-Pass Variants
A guided tour of hierarchical and single-pass CUDA scan kernels with coarsening and warp-level optimizations.
-
Paper Summary #14 - Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
My notes from the Physics of Language Models series of papers.
-
Understanding Multi-Head Latent Attention (MLA)
A mathematical and code deep-dive on one of the key innovations from Deepseek - Multihead Latent Attention (MLA)
-
Deriving the Gradient for the Backward Pass of Layer Normalization
Understanding the math behind Layer Normalization and deriving the gradients for the backward pass.
-
Notes from GTC'25: CUDA Techniques to Maximize Compute and Instruction Throughput
My notes from the talk on maximizing compute and instruction throughput at NVIDIA GTC 2025.
-
Notes from GTC'25: CUDA Techniques to Maximize Memory Bandwidth and Hide Latency - Part 2
Second part of my notes from the talk on maximizing memory bandwidth at NVIDIA GTC 2025.
-
Notes from GTC'25: CUDA Techniques to Maximize Memory Bandwidth and Hide Latency - Part 1
First part of my notes from the talk on maximizing memory bandwidth at NVIDIA GTC 2025.