Shreyansh Singh
  • About
  • Posts (current)
  • Projects
  • Publications
  • Bookshelf
  • CV
  • mlsys
  • •

  • transformer
  • •

  • paper-summaries
  • •

  • MLSys
  • •

  • LLMs
  • •

  • PPML
  • Paper Summary #17 - Engram

    A technical explainer for DeepSeek's Engram layers: conditional memory, hashed n-gram lookup, context-aware gating, sparse-capacity allocation, and the implementation path inside Transformer blocks.

    25 min read   ·   May 17, 2026

    2026   ·   llms   transformers   engram   memory   sparsity   ·   LLMs   MLSys

    Paper Summary #17 - Engram
  • Paper Summary #16 - Canon Layers

    A deep dive into Canon Layers: why sequence models need cheap horizontal token flow, how residual causal depthwise convolution implements it, and where Canon-A/B/C/D fit inside Transformer and linear-model blocks.

    26 min read   ·   May 16, 2026

    2026   ·   llms   transformers   canon-layers   paper-summaries   ·   LLMs

    Paper Summary #16 - Canon Layers
  • Paper Summary #15 - Hyper-Connections and mHC

    From residual-stream basics to manifold-constrained mixing: why widening the residual path helps, why unconstrained products destabilize depth, and how Sinkhorn-Knopp turns HC into conservative feature routing.

    35 min read   ·   May 15, 2026

    2026   ·   llms   transformers   residual-connections   hyper-connections   mhc   paper-summaries   ·   LLMs   MLSys

    Paper Summary #15 - Hyper-Connections and mHC
  • Deep dive into CUDA Scan Kernels: Hierarchical and Single-Pass Variants

    A guided tour of hierarchical and single-pass CUDA scan kernels with coarsening and warp-level optimizations.

    38 min read   ·   February 19, 2026

    2026   ·   cuda   gpu   scan   prefix-sum   mlsys   ·   CUDA   MLSys

    Deep dive into CUDA Scan Kernels: Hierarchical and Single-Pass Variants
  • Paper Summary #14 - Physics of Language Models: Part 3.1, Knowledge Storage and Extraction

    My notes from the Physics of Language Models series of papers.

    22 min read   ·   January 17, 2026

    2026   ·   transformer   knowledge   paper-summaries   ·   LLMs

    Paper Summary #14 - Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
  • Understanding Multi-Head Latent Attention (MLA)

    A mathematical and code deep-dive on one of the key innovations from Deepseek - Multihead Latent Attention (MLA)

    14 min read   ·   November 08, 2025

    2025   ·   attention   mla   ·   LLMs

    Understanding Multi-Head Latent Attention (MLA)
  • Deriving the Gradient for the Backward Pass of Layer Normalization

    Understanding the math behind Layer Normalization and deriving the gradients for the backward pass.

    8 min read   ·   June 04, 2025

    2025   ·   ml   math   ·   ML

    Deriving the Gradient for the Backward Pass of Layer Normalization
  • Notes from GTC'25: CUDA Techniques to Maximize Compute and Instruction Throughput

    My notes from the talk on maximizing compute and instruction throughput at NVIDIA GTC 2025.

    32 min read   ·   April 04, 2025

    2025   ·   cuda   mlsys   ·   MLSys

    Notes from GTC'25: CUDA Techniques to Maximize Compute and Instruction Throughput
  • Notes from GTC'25: CUDA Techniques to Maximize Memory Bandwidth and Hide Latency - Part 2

    Second part of my notes from the talk on maximizing memory bandwidth at NVIDIA GTC 2025.

    33 min read   ·   March 23, 2025

    2025   ·   cuda   mlsys   ·   MLSys

    Notes from GTC'25: CUDA Techniques to Maximize Memory Bandwidth and Hide Latency - Part 2
  • Notes from GTC'25: CUDA Techniques to Maximize Memory Bandwidth and Hide Latency - Part 1

    First part of my notes from the talk on maximizing memory bandwidth at NVIDIA GTC 2025.

    33 min read   ·   March 23, 2025

    2025   ·   cuda   mlsys   ·   MLSys

    Notes from GTC'25: CUDA Techniques to Maximize Memory Bandwidth and Hide Latency - Part 1
  • Newer
  • 1
  • 2
  • 3
  • 4
  • 5
  • Older
© Copyright 2026 Shreyansh Singh.