Sanghun Cho

AI Research Engineer

About Me

I’m interested in all technologies that make AI fast & realistic (Distributed Training, GPU Kernel, Compiler, NPU, etc.). My role is accelerating training and inference to improve generative AI performance in a variety of fields. I’ve worked on optimizing training costs in T2I and Bio, and recently made major contributions to popular acceleration frameworks (e.g. FlashAttention, Optimum Quanto) by developing custom GPU kernels for LLM.

Skills: Collective Communication, Computer Architecture, CUDA, Parallel Computing, PyTorch, Triton

Projects

DeepSeek Architecture Training Optimization

Present

Latency & Memory Optimization for DeepSeek Architecture Training

  • Multi-head Latent Attention (MLA) BWD Kernel
    • Responsible for adding MLA BWD support to FlashAttention-3 (Pull Request)
    • 1.1x-1.2x speedup (vs. FA3 + explicit padding & unpadding)
  • Custom Non-GEMM Kernels for DeepGEMM FP8 Grouped GEMM
    • Responsible for implementing custom non-GEMM kernels (activation & weight grouped quantization, permutation + alignment fusion, etc.) for training
    • Grouped Quantization
      • Made each group’s activation & weight quantization work at once (rowwise & columnwise)
      • 1.7x-36x speedup (vs. TransformerEngine)
    • Permutation + Alignment Fusion
      • DeepGEMM FP8 Grouped GEMM needs to 128-align each group’s activation
      • 3x speedup (vs. TransformerEngine permutation + alignment)
  • Keywords: CUDA, PyTorch, Training Acceleration

Sequence Packing for Distributed Training in Megatron-LM

Apr 2025

Properly (and efficiently) implemented sequence packing for distributed training (e.g. CP, DP) in Megatron-LM

  • Sequence packing + Context Parallelism (CP)
    • Responsible for implementing exact sequence packing for CP
    • Needed to properly slice input sequence into chunks for Striped Attention
      • Striped Attention handles computation imbalance caused by causal attention
      • TransformerEngine’s fused attention layer uses Striped Attention
    • Fixed RoPE issue in sequence packing + context parallelism
      • Previous Megatron-LM didn’t use TransformerEngine’s RoPE kernel that handles sequence packing + CP
  • Sequence packing + Data Parallelism (DP)
    • Responsible for implementing minibatch reordering to mitigate computation imbalance
      • Sequence packing + DP causes computation imbalance since sum of squared sequence lengths is different for each DP rank
    • Built custom data sampler sorting global batch & distributing it to each DP rank
    • 1.06x-1.1x speedup when using long sequence length
  • Keywords: PyTorch, Training Acceleration

vLLM INT4 KV Cache Attention Backend

Feb 2025

INT4 KV cache quantization for long context length LLM inference with minimal memory usage

  • Responsible for adding INT4 KV cache attention backend inside vLLM to save memory usage for long context LLM inference
  • Developed triton attention kernel for prefix KV caching (also used in chunked prefill)
    • Used inline assembly for fast INT4 -> FP16 (or BF16) dequantization
  • 1.1x-1.63x speedup, 3.87x memory saving (vs. xFormers backend)
  • Keywords: CUDA (especially PTX Assembly), Inference Acceleration, PyTorch, Triton

Fused Attention Z-loss

Oct 2024

Kernel fusion for attention z-loss to improve model instability without hurting model quality

  • Responsible for accelerating attention z-loss implementation used for stable training of LLM (e.g. PaLM, ST-MoE)
  • Achieved low latency & minimal memory usage even with long sequence length
    • Leveraged fused attention backward op to optimize computation
  • 7x-14x speedup, 63x-2900x memory saving
  • Keywords: PyTorch, Training Acceleration, Triton

Marlin w. Scaled Zero-point

Jun 2024

Github Repository

AWQ-compatible W4A16 GEMM implementation based on Marlin kernel for weight-only quantization

  • Responsible for adding zero-point in Marlin kernel for AWQ
    • Used scaled zero-point to remove redundant computation for dequantization
  • Integrated new kernel in the quantization library (e.g. huggingface/optimum-quanto)
  • Speedup: 2.2x (vs. PyTorch native), 1.3x (vs. AWQ kernel)
  • Keywords: CUDA, Inference Acceleration, PyTorch

Speeding up ALiBi by 3-5x with a hardware-efficient implementation

  • Responsible for implementing ALiBi inside FlashAttention-2 w. hardware-efficient manner
  • Dramatically reduced # memory accesses required to apply ALiBi compared to before, making 3-5x faster
  • Keywords: CUDA (especially CUTLASS), PyTorch, Training Acceleration

Solvent

Jul 2023

Github Repository

A Framework for Protein Folding

  • Responsible for optimizing AI training (especially developing & applying custom kernels)
  • xFormers memory-efficient attention w. bias-related optimizations via CUDA profiling
    • Multiple attention biases were previously memory discontinuous due to permutation and had different shapes, causing broadcast on reduction
    • Through CUDA profiling, discovered unnecessary memory rearrangements
    • Reordered memory rearrangements and reduction operations, saving 256x memory accesses
  • Operation fusion w. Triton
    • LayerNorm with chunking to take advantage of shapes that work more efficiently
    • Linear + Activation (e.g. Sigmoid)
    • Improved computation speed by reducing memory accesses through kernel fusion
  • Improved 30% training speed & memory footprint
  • Keywords: CUDA, PyTorch, Training Acceleration, Triton

minDALL-E on Conceptual Captions

Dec 2021

Github Repository

1.3B Text-to-image Generation Model Trained on 14 Million Image-text Pairs for Non-commercial Purposes

  • Responsible for optimizing AI training (computation, communication, etc.)
  • Computation optimization (especially for attention) via CUDA profiling
  • Communication optimization (especially for distributed training) w. ZeRO-3 (or FSDP) + PowerSGD
    • Reproduced distributed training technique mentioned in DALL-E
    • Reduced memory footprint for training with FSDP & solved inter-node network bandwidth bottleneck by introducing PowerSGD
    • Presentation: 코끼리를 GPU에 넣는 법 (Elephant in GPU), if(kakao)2022, Dec 9, 2022
  • Keywords: Collective Communication, CUDA, PyTorch, Training Acceleration

Experience

Kakao Brain (for now, Kakao)

AI Research Engineer

Aug 2021 - Present

www.kakaocorp.com

Large-Scale Generative AI Company

  • AI Training Performance Optimization
    • Responsible for optimizing the performance of AI training in various fields (e.g. T2I, Bio, LLM, etc.)
    • Improved speed and memory footprint in distributed training by profiling workload to resolve computational bottlenecks & tuning collective communication essential to training.
  • GPU Kernel Development
    • Responsible for developing CUDA & Triton kernels to efficiently use GPUs
    • Improved speed and memory footprint through operation fusion & further optimized attention operations to maximize computational efficiency

Education

Korea Advanced Institute of Science and Technology (KAIST)

Master of Science, School of Electrical Engineering

2019 - 2021

Graduate Research Assistant in Computer System and Network Lab (CSNL)

  • Optimized main bottlenecks of distributed deep learning training system (e.g. collective communication)
  • Profiled various workloads running on multi-GPU system (e.g. CUDA Unified Memory)
  • GPA: 3.42/4.3
  • Thesis: Communication Optimization for Deep Learning in Distributed Processing Environments

Sungkyunkwan University (SKKU)

Bachelor of Science, Department of Computer Science

2013 - 2019

Undergraduate Research Assistant in Advanced Research on Compilers and Systems (ARCS)

  • Optimized CPU-GPU communication
  • Studied GPGPU platform (e.g. CUDA)
  • GPA: 4.15/4.5
  • Thesis: Multithreaded Double Queuing for GPGPU

Publications

Logical/Physical Topology-Aware Collective Communication in Deep Learning Training, Sanghun Cho, Hyojun Son, John Kim, IEEE Symposium on High-Performance Computer Architecture (HPCA), Mar 24, 2023

Bandwidth Bottleneck in Network-on-Chip for High-Throughput Processors, Jiho Kim, Sanghun Cho, Minsoo Rhu, Ali Bakhoda, Tor M. Aamodt, John Kim, ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), Oct 5, 2020

Multithreaded Double Queuing for Balanced CPU-GPU Memory Copying, Sanghun Cho, Jaewan Hong, Jungsik Choi, Hwansoo Han, ACM/SIGAPP Symposium on Applied Computing (SAC), Apr 11, 2019

Automatic Memory Pinning Management for Fast Data Transfer on GPU Computing, Jaewan Hong, Sanghun Cho, Hwansoo Han, Korea Software Conference (KSC), Dec 1, 2018