Skip to content

Johnsonms/Johnson

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

Johnson

Senior Director of ML Systems & AI Infrastructure

I work on CUDA kernel optimization, LLM inference and training systems, and large-scale distributed performance engineering. My work spans kernel-level optimization, execution planning, memory hierarchy tuning, benchmarking, and system-level scaling for modern generative AI workloads.

What I focus on

  • CUDA kernel optimization and Tensor Core programming
  • LLM inference optimization and speculative decoding
  • Large-scale distributed training and performance analysis
  • Benchmarking, profiling, and bottleneck isolation across kernel, runtime, and cluster layers
  • Bridging model frameworks, runtime systems, and hardware execution

Selected impact

  • Improved speculative decoding performance in SGLang through metadata replay and kernel fusion work, including end-to-end throughput gains and major reductions in CPU and kernel overhead
  • Contributed to FlashAttention benchmarking and validation work across modern GPU architectures
  • Contributed examples and kernel patterns to CUTLASS and CuTe DSL for Hopper and Blackwell-class architectures
  • Led performance validation and scaling analysis for large-scale training and inference workloads across multi-node GPU clusters
  • Built and refined benchmarking methodology across kernels, collectives, and end-to-end training and inference paths

Open source and systems work

SGLang

Work on speculative decoding, metadata replay optimization, and inference-path performance improvements.

FlashAttention

Benchmarking, validation, and optimization work for modern accelerator platforms.

CUTLASS / CuTe DSL

Examples and kernel patterns for modern GPU architectures, including Hopper and Blackwell-oriented programming models.

Distributed AI systems

Performance engineering across NCCL, training frameworks, cluster scaling, and production-facing benchmark workflows.

Featured repositories

Pin repositories that best represent your work here, for example:

  • sglang — inference-path optimization and speculative decoding work
  • flash-attention — benchmarking and optimization work
  • cutlass — CUDA kernel examples and DSL-based kernel work
  • dgxc-benchmarking — benchmark methodology and platform validation
  • a personal notes or benchmark repo — performance studies, repros, and writeups
  • a CUDA or LLM systems repo — focused examples or performance experiments

What you will find in my repos

  • Performance-focused engineering work
  • Benchmark harnesses and reproducible experiments
  • Profiling-driven optimization
  • Notes on GPU architecture, Tensor Core programming, and distributed systems
  • Practical writeups that connect low-level optimization to end-to-end impact

Background

I lead and contribute across the stack, from kernels and runtime behavior to large-scale AI platform performance. My work has included training and inference optimization, cluster-scale benchmarking, and upstream contributions to widely used open-source AI systems.

Contact


Optional shorter version

Johnson Li Senior Director of ML Systems & AI Infrastructure

CUDA kernels, LLM inference and training systems, distributed performance engineering, and benchmarking for modern AI workloads.

Focused on turning low-level optimization into measurable end-to-end gains across kernels, runtimes, and clusters.

About

My personal repository

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors