Senior Director of ML Systems & AI Infrastructure
I work on CUDA kernel optimization, LLM inference and training systems, and large-scale distributed performance engineering. My work spans kernel-level optimization, execution planning, memory hierarchy tuning, benchmarking, and system-level scaling for modern generative AI workloads.
- CUDA kernel optimization and Tensor Core programming
- LLM inference optimization and speculative decoding
- Large-scale distributed training and performance analysis
- Benchmarking, profiling, and bottleneck isolation across kernel, runtime, and cluster layers
- Bridging model frameworks, runtime systems, and hardware execution
- Improved speculative decoding performance in SGLang through metadata replay and kernel fusion work, including end-to-end throughput gains and major reductions in CPU and kernel overhead
- Contributed to FlashAttention benchmarking and validation work across modern GPU architectures
- Contributed examples and kernel patterns to CUTLASS and CuTe DSL for Hopper and Blackwell-class architectures
- Led performance validation and scaling analysis for large-scale training and inference workloads across multi-node GPU clusters
- Built and refined benchmarking methodology across kernels, collectives, and end-to-end training and inference paths
Work on speculative decoding, metadata replay optimization, and inference-path performance improvements.
Benchmarking, validation, and optimization work for modern accelerator platforms.
Examples and kernel patterns for modern GPU architectures, including Hopper and Blackwell-oriented programming models.
Performance engineering across NCCL, training frameworks, cluster scaling, and production-facing benchmark workflows.
Pin repositories that best represent your work here, for example:
sglang— inference-path optimization and speculative decoding workflash-attention— benchmarking and optimization workcutlass— CUDA kernel examples and DSL-based kernel workdgxc-benchmarking— benchmark methodology and platform validation- a personal notes or benchmark repo — performance studies, repros, and writeups
- a CUDA or LLM systems repo — focused examples or performance experiments
- Performance-focused engineering work
- Benchmark harnesses and reproducible experiments
- Profiling-driven optimization
- Notes on GPU architecture, Tensor Core programming, and distributed systems
- Practical writeups that connect low-level optimization to end-to-end impact
I lead and contribute across the stack, from kernels and runtime behavior to large-scale AI platform performance. My work has included training and inference optimization, cluster-scale benchmarking, and upstream contributions to widely used open-source AI systems.
Johnson Li Senior Director of ML Systems & AI Infrastructure
CUDA kernels, LLM inference and training systems, distributed performance engineering, and benchmarking for modern AI workloads.
Focused on turning low-level optimization into measurable end-to-end gains across kernels, runtimes, and clusters.