SGEMM Optimization: From Naive to Tensor Core

Hand-written, progressively optimized CUDA matrix multiplication — the "Hello World" of HPC. Five kernel variants demonstrate core GPU optimization techniques, from a naive triple loop to a guarded Tensor Core WMMA path with explicit mixed-precision benchmarking.

Performance

The exact GFLOPS you see will depend on GPU model, CUDA version, and problem size. The benchmark now reports two Tensor Core views:

Tensor Core (WMMA end-to-end): includes FP32→FP16 conversion and safe fallback for non-WMMA-compatible dimensions.
Tensor Core (WMMA compute-only): times only the WMMA compute path and is shown only when M, K, and N are multiples of 16.

Verification tolerances are centralized in code:

Standard FP32 kernels: rtol=1e-3, atol=1e-4
Tensor Core mixed-precision path: rtol=5e-2, atol=1e-2

The default benchmark set includes:

aligned square cases: 512x512x512, 1024x1024x1024
one aligned non-square case: 256x384x640
one unaligned edge case: 511x513x1025 to exercise safe Tensor Core fallback

Note: the printed theoretical peak and roofline numbers are approximate analytical references, not exact hardware limits.

Optimization Roadmap

  Naive  ->  Tiled  ->  Bank-Free  ->  Double Buffer  ->  Tensor Core (WMMA)

Stage	What Changes	Why It Helps
Naive → Tiled	Load tiles into shared memory	Data reuse reduces global memory traffic by TILE_SIZE×
Tiled → Bank-Free	Pad shared memory `[32][33]`	Eliminates 32-way bank conflicts on column access
Bank-Free → Double Buffer	Two shared-memory buffers	Restructures tile staging and buffering to reduce memory stalls
→ Tensor Core	WMMA API `mma_sync`	Dedicated matrix units, ~8× peak over CUDA cores

Build & Run

Recommended path: CMake

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
./build/bin/sgemm_benchmark
./build/bin/sgemm_benchmark --dims 256 384 640
./build/bin/sgemm_benchmark -a

Quick local path: Makefile

make GPU_ARCH=sm_86
make benchmark
make test

Project Structure

sgemm-optimization/
├── src/
│   ├── kernels/
│   │   ├── naive_sgemm.cuh              # Naive: basic triple loop
│   │   ├── tiled_sgemm.cuh              # Tiled: shared memory blocking
│   │   ├── bank_conflict_free_sgemm.cuh # Bank conflict elimination
│   │   ├── double_buffer_sgemm.cuh      # Double buffer pipeline
│   │   └── tensor_core_sgemm.cuh        # Tensor Core (WMMA API)
│   ├── utils/
│   │   ├── cuda_utils.cuh               # CUDA error checking & utilities
│   │   ├── benchmark.cuh                # Benchmark framework (CUDA Events)
│   │   └── verify.cuh                   # Correctness verification (vs cuBLAS)
│   └── main.cu                          # Entry point
├── tests/
│   └── test_sgemm.cu                    # Google Test property tests
├── roofline_data_*.csv                  # Roofline analysis data
├── CMakeLists.txt                       # CMake build (recommended)
└── Makefile                             # Make build (quick start)

Testing

Google Test coverage includes:

Property	What It Verifies
Numerical correctness	Standard kernels match cuBLAS across square and non-square cases
Tensor Core fast path	WMMA path is validated on `16`-aligned dimensions
Tensor Core fallback	Non-aligned dimensions safely fall back to an FP32 kernel
Small/edge inputs	Includes `1x1x1` and unaligned edge cases
Error detection	Verification helpers stay consistent with benchmark tolerances

cmake --build build --target test_sgemm
ctest --test-dir build

# or
make test

GPU Architecture Reference

GPU Family	Architecture	Compute Capability	Build Flag
Tesla V100	Volta	sm_70	`GPU_ARCH=sm_70`
RTX 2080	Turing	sm_75	`GPU_ARCH=sm_75`
RTX 3090 / A100	Ampere	sm_80 / sm_86	`GPU_ARCH=sm_86`
RTX 4090 / L40	Ada Lovelace	sm_89	`GPU_ARCH=sm_89`
H100	Hopper	sm_90	`GPU_ARCH=sm_90`

Engineering Quality

Build: CMake 3.18+ is the primary build system; Makefile remains available for quick local use
Code style: clang-format enforced via CI
CI: GitHub Actions runs format checks and a containerized CUDA compile-only build; GPU runtime tests are still local / dedicated-runner only
Testing: Google Test verification against cuBLAS, including Tensor Core fallback and edge-size coverage

References

CUDA C++ Programming Guide
How to Optimize a CUDA Matmul Kernel — Simon Boehm
CUTLASS — NVIDIA's high-performance GEMM library
cuBLAS Documentation
Roofline Model

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
.kiro/specs/sgemm-optimization		.kiro/specs/sgemm-optimization
.vscode		.vscode
build		build
changelog		changelog
src		src
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.zh-CN.md		README.zh-CN.md
_config.yml		_config.yml
index.md		index.md
roofline_data_1024.csv		roofline_data_1024.csv
roofline_data_2048.csv		roofline_data_2048.csv
roofline_data_4096.csv		roofline_data_4096.csv
roofline_data_512.csv		roofline_data_512.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SGEMM Optimization: From Naive to Tensor Core

Performance

Optimization Roadmap

Build & Run

Project Structure

Testing

GPU Architecture Reference

Engineering Quality

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SGEMM Optimization: From Naive to Tensor Core

Performance

Optimization Roadmap

Build & Run

Project Structure

Testing

GPU Architecture Reference

Engineering Quality

References

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages