Feb 19, 2026 Deep dive into CUDA Scan Kernels: Hierarchical and Single-Pass Variants Apr 04, 2025 Notes from GTC'25: CUDA Techniques to Maximize Compute and Instruction Throughput Mar 23, 2025 Notes from GTC'25: CUDA Techniques to Maximize Memory Bandwidth and Hide Latency - Part 2 Mar 23, 2025 Notes from GTC'25: CUDA Techniques to Maximize Memory Bandwidth and Hide Latency - Part 1 Mar 02, 2025 Faster Cross-Encoder Inference: Unleashing torch.compile for speed Mar 26, 2023 Paper Summary #8 - FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness Oct 10, 2022 Paper Summary #7 - Efficient Transformers: A Survey