A heterogeneous Software-in-the-Loop (SIL) execution engine that resolves the HFT trilemma — ultra-low latency, deterministic risk management, and deep learning inference — through compiler-driven fusion on commodity hardware.
SHAURYA (hft.shaurya) is an ultra-low latency, LLVM-optimized C++20 High-Frequency Trading (HFT) framework that bridges Python-based AI model development with deterministic bare-metal execution performance. It is a heterogeneous Software-in-the-Loop (SIL) system, meaning the entire pipeline — from live market data ingestion to order routing — is treated as a single unified computational graph and optimized statically at compile time via LLVM/Clang's Link-Time Optimization (LTO) and AVX2 SIMD vectorization.
Unlike traditional approaches that force a choice between FPGA rigidity and software jitter, SHAURYA occupies a unique middle ground: the flexibility and iteration speed of high-level language development (Python/Keras), fused with the determinism and throughput of bare-metal C++ execution.
Designed for:
- 📈 Quantitative Researchers
- 🏢 Proprietary Trading Engineers
- ⚙️ Systems & Compiler Enthusiasts
- 🎓 HPC & Finance Students
SHAURYA's execution pipeline is built on five tightly integrated layers, each engineered to eliminate non-determinism at the specific abstraction level it targets:
Compile-time optimization stack:
Source C++20
└─ LLVM/Clang
├─ -flto → Link-Time Optimization (whole-program inlining, DCE)
├─ -march=native → AVX2/AVX-512 SIMD vectorization
├─ -O3 → Aggressive loop unrolling, constant folding
└─ -ffast-math → Fused multiply-add, relaxed IEEE754 for throughput
Layer mapping:
| Layer | Language | Role |
|---|---|---|
| Data Gateway | Python (async) | Live FIX feed ingestion, GIL-isolated |
| Execution Core | C++20 (bare-metal) | Zero-copy parsing, AI inference, risk validation |
| Risk Firewall | RTL-in-C++ (Verilog-like) | Branchless hardware-grade safety gates |
| Compiler Toolchain | LLVM/Clang + lld | LTO, SIMD fusion, cross-module inlining |
| Telemetry | Python Tkinter | Detached process, no critical path impact |
Benchmarked over a 50-stock basket (1,000 ticks) across two deployment environments:
Bare-Metal (Windows 11, Intel i7-12700K @ 4.9 GHz, 32 GB DDR4-3600):
| Metric | Value |
|---|---|
| Minimum End-to-End Latency | 1.75 µs |
| Mean Pipeline Latency | 4.22 µs |
| 90th Percentile | 6.3 µs |
| 99th Percentile | 40.9 µs |
| Peak (OS DPC spike) | 127 µs |
| Speedup vs. Naïve Baseline | 81x |
AWS EC2 (c6i.large, Ubuntu 22.04, Nitro Hypervisor):
| Metric | Value |
|---|---|
| Minimum Latency | 2.09 µs |
| Mean Latency | 36.77 µs |
| 90th Percentile | 52.3 µs |
| 99th Percentile | 118 µs |
| Hypervisor Tax (socket stage) | ~31.6 µs |
Pipeline stage breakdown (bare-metal vs. cloud):
| Stage | Bare Metal | AWS |
|---|---|---|
| Socket receive | 0.8 µs | 32.4 µs |
| FIX parsing + AI inference | 3.1 µs | 3.8 µs |
| Risk validation | 0.3 µs | 0.4 µs |
The socket receive stage accounts for >97% of the hypervisor overhead — the C++ execution core itself scales near-identically between environments. Static AI compilation alone provides a 20× latency penalty when replaced with TensorFlow's dynamic runtime.
Traditional FIX parsers construct std::string objects and call std::stof for numeric conversion — both involve heap allocation with non-deterministic latency due to memory fragmentation and locale-dependent formatting. SHAURYA's parser operates directly on the raw socket receive buffer using:
- Strict pointer arithmetic to identify FIX tag delimiters (SOH,
0x01) without copying - Branchless finite state machine (
WAITING_FOR_TAG → READING_TAG → READING_VALUE) with states chosen to maximize branch predictor hit rate fast_atofimplementation — a custom ASCII-to-float converter using pre-computed power-of-ten divisors and integer arithmetic, bypassing locale lookups and exception paths entirely- POD struct output with
alignas(64)cache-line alignment for immediate L1 access by the downstream ring buffer
Total parsing time for a 200-byte FIX message: < 500 ns on bare metal.
The inter-thread data transport uses a Single-Producer Single-Consumer (SPSC) lock-free ring buffer — a pre-allocated fixed-size circular array that never allocates memory during operation.
Key micro-architectural details:
std::memory_order_releaseon producer commit index writes — establishes a happens-before relationship without sequential consistency overheadstd::memory_order_acquireon consumer commit index reads — pairs with release semantics to synchronize without a full memory fence- 56-byte padding between producer and consumer indices (or
alignas(64)placement in separate structs) — eliminates false sharing, the phenomenon where independent cache-line writes by different CPU cores trigger MESI protocol invalidations, causing expensive L1 cache flushes on the peer thread - Bitwise wrap-around (
index & (capacity - 1)where capacity is a power of two) — avoids modulo division in the hot loop
Throughput: > 50 million messages/second on a single core. Push/pop latency: < 20 ns when buffer is not full.
Conventional inference frameworks (TensorFlow, PyTorch) use dynamic computational graphs, heap-allocated tensor objects, and garbage-collected runtimes — all sources of non-deterministic latency on the order of tens to hundreds of microseconds, making them incompatible with any sub-100 µs HFT pipeline.
SHAURYA's approach:
1. Offline training (Python/Keras): A feed-forward neural network (input: 20 lagged mid-prices → hidden: 128 ReLU → hidden: 64 ReLU → output: buy/sell/hold signal) is trained offline with TensorFlow. Architecture is intentionally overparameterized to stress-test the inference engine.
2. Static export via frugally-deep:
The trained .h5 model is converted to fdeep_model.json — a human-readable JSON file containing the full network topology, weight matrices (as flat float arrays), and bias vectors, with zero Python runtime dependencies.
3. Header-only C++ inference (Eigen + frugally-deep): At engine startup, the JSON is parsed once by the frugally-deep header-only library, which pre-allocates all tensors. The computational graph is then mapped to the Eigen C++ linear algebra library, which uses expression templates for lazy evaluation — avoiding temporary allocations — and emits AVX2-vectorized BLAS kernels for matrix multiplications.
4. Compile-time optimization by LLVM: Since the network architecture is fully known at compile time, LLVM/Clang performs:
- Loop unrolling across all layer computations
- Cross-translation-unit inlining (via LTO) of all Eigen expression templates
- Branchless ReLU via bitwise masking:
x & (x > 0)— no branch misprediction penalty - FMA (Fused Multiply-Add) fusion — bias addition merged into the matrix multiply kernel to minimize memory traffic
AI inference latency: < 1.2 µs on bare metal. Zero Python GIL. Zero TensorFlow runtime. Zero garbage collection.
Regulatory frameworks (MiFID II, SEC Rule 15c3-5) mandate unconditional pre-trade risk controls: kill switches, price clamps, fat-finger protection, and rate limiters. Hardware FPGA implementations guarantee nanosecond-level determinism but are inflexible and expensive to modify. Traditional software if-else chains are vulnerable to branch misprediction — when the CPU speculatively executes the wrong branch and must flush and refill the pipeline (10–20 cycle penalty per misprediction).
SHAURYA implements a Register-Transfer Level (RTL)-in-software firewall: all risk rules are expressed as branchless bitwise operations and arithmetic masking, directly analogous to combinational logic gates in hardware.
Example evaluations:
// Price clamp (no branch)
uint32_t valid = (price < max_price) & (price > min_price);
// Kill switch (no branch)
allow = allow & (!kill_signal);
// Fat-finger protection
uint32_t notional_ok = (quantity * price) < max_notional;The entire risk evaluation tree — covering price clamps, rate limits, position limits, and kill switches — is transpiled from Verilog-like hardware description at compile time into a single, cache-aligned C++ evaluation sequence. It resides in the L1 instruction cache throughout execution. Deterministic risk validation latency: < 0.5 µs per evaluation, with zero variance from branch misprediction.
SHAURYA exclusively targets LLVM/Clang (not GCC) for the following reasons:
- Intermediate Representation (IR): LLVM operates on a well-defined SSA-form IR, enabling optimization passes that are impossible at the source level
- Link-Time Optimization (LTO) via libLTO: Bitcode from all translation units is merged at link time, allowing the compiler to perform whole-program analysis — dead code elimination, cross-module inlining, and loop transformations across what would otherwise be opaque ABI boundaries
-march=native: Generates code targeting the exact CPU microarchitecture of the build machine, enabling AVX2/AVX-512 packed floating-point instructions-ffast-math: Enables reassociation, contraction (FMA), and reciprocal approximations — safe for HFT workloads where IEEE 754 strict compliance is not requiredlldlinker: LLVM's native linker, significantly faster than GNUldfor LTO workloads
SHAURYA targets both Windows 11 bare-metal and Linux (AWS EC2 Ubuntu) from a single codebase. All OS-specific syscalls are wrapped in preprocessor directives:
| Feature | Windows | Linux |
|---|---|---|
| Sockets | WinSock2 / SOCKET |
POSIX / int fd |
| Thread affinity | SetThreadAffinityMask |
pthread_setaffinity_np |
| Real-time scheduling | REALTIME_PRIORITY_CLASS |
SCHED_FIFO, priority 99 |
| High-res timer | QueryPerformanceCounter |
clock_gettime(CLOCK_MONOTONIC) |
| Socket close | closesocket() |
close() |
The dashboard runs as a completely separate OS process, communicating with the C++ engine via a named pipe or local UDP socket. This decoupling is critical — GUI rendering can take several milliseconds and must never block the execution thread.
Features:
- Real-time latency histograms and 99th percentile tail display
- Risk violation counters and firewall gate status
- Headless mode for cloud deployments: writes PNG chart output, compatible with X11 forwarding via
ssh -X - Per-run metrics file (
SHAURYA_Metrics.txt) with per-tick latency records for post-hoc statistical analysis
pip install hft.shaurya==0.3.0- LLVM/Clang (≥ 14 recommended)
lldlinker- C++17/C++20 compatible compiler
- Eigen (header-only, bundled)
- frugally-deep (header-only, bundled)
# Windows
clang++ -O3 -flto -march=native -ffast-math -std=c++20 \
-fuse-ld=lld src/*.cpp -o bin/Shaurya.exe
# Linux
clang++ -O3 -flto -march=native -ffast-math -std=c++20 \
-fuse-ld=lld src/*.cpp -o bin/Shaurya -lpthread -lnumapython gateway.py
# or
python -m hft.shaurya.gatewayThe Python layer connects to a live data feed (e.g., Yahoo Finance), formats ticks as FIX protocol messages, and streams them to 127.0.0.1:5000 via TCP. It runs in its own OS process — its GIL, GC pauses, and interpreter overhead are entirely isolated from the C++ measurement perimeter.
bin/Shaurya.exe # Windows
./bin/Shaurya # LinuxStartup sequence:
- Parse
fdeep_model.json→ pre-allocate all neural network tensors - Warm L1/L2 instruction cache with inference dry runs
- Initialize SPSC ring buffer with
alignas(64)allocation - Pin execution thread to isolated CPU core
- Begin live tick processing loop
python benchmark_50.py --ticks 20Injects 20 FIX messages per stock across a 50-stock basket. Outputs per-stock average, min, max, 90th/99th percentile latency, and aggregate statistics to both console and CSV. Fully automated — no human interaction required during the run.
After Ctrl+C shutdown, SHAURYA_Metrics.txt contains:
- Per-tick raw latency (µs)
- Average, 99th percentile, and tail latency
- Message throughput statistics
SHAURYA's performance is fundamentally governed by cache residency. Key decisions:
alignas(64)on all hot data structures — ensures each struct occupies exactly one or more full cache lines, preventing false sharing and structure-internal cache-line splits- NUMA-aware allocation (Linux) —
numa_node_of_cpu()is used to allocate ring buffer and neural network weights from the memory node local to the execution core, avoiding 50–100 ns remote memory access penalties on multi-socket servers - Software prefetching —
_mm_prefetch(Windows) /__builtin_prefetch(Linux) issues prefetch hints ~64 bytes ahead of the current neural network weight access, hiding DRAM latency by overlapping memory fetches with computation. The difference between cold and warm inference is up to 8 µs — making cache warming a mandatory startup step
- Two-thread model: Network receive thread (producer) and execution thread (consumer), each pinned to a dedicated logical core sharing the same L2/L3 but not hyper-threads
- Real-time scheduling:
SCHED_FIFOpriority 99 (Linux),REALTIME_PRIORITY_CLASS+THREAD_PRIORITY_TIME_CRITICAL(Windows) — reduces maximum latency spikes by ~40% compared to POSIX default scheduling - Socket configuration:
TCP_NODELAY(disables Nagle algorithm),SO_RCVBUFset to 8 MB — prevents TCP batching from introducing artificial latency on the loopback interface
Keras (.h5)
└─ frugally-deep converter
└─ fdeep_model.json (human-readable, ~500 KB for 128-64-1 net)
└─ frugally-deep header-only parser (one-time at startup)
└─ Eigen computational graph
└─ LLVM LTO + AVX2 → optimized binary
→ < 1.2 µs inference, no GC, no Python
Weight updates (without architectural changes) only require replacing fdeep_model.json — no recompilation. Architectural changes require recompilation, which completes in seconds.
Disabling individual optimizations on the same bare-metal hardware:
| Configuration Removed | Avg Latency | Penalty |
|---|---|---|
| Full SHAURYA (baseline) | 4.22 µs | — |
| Static AI → TensorFlow | 89.6 µs | +2,023% |
| Zero-copy parser → std::string + stof | 28.4 µs | +573% |
| Lock-free SPSC → mutex queue | 12.8 µs | +203% |
| Branchless risk → if-statements | 6.1 µs | +45% |
Static AI compilation is the single largest contributor — TensorFlow's dynamic memory allocator introduces GC pauses that reach 4.1 ms worst-case.
-O3 # Aggressive optimization (loop unrolling, constant propagation)
-flto # Link-Time Optimization (whole-program analysis)
-march=native # Target CPU's exact instruction set (AVX2, FMA, etc.)
-ffast-math # Enable FMA fusion, reciprocal approximations
-fuse-ld=lld # Use LLVM's lld linker for LTO compatibility# Disable CPU power management
# BIOS: Disable C-states, SpeedStep, Turbo (or lock to max frequency)
# Windows
# Power plan: "Ultimate Performance"
# Set processor affinity to isolate cores 2-3 from OS interrupts
# Linux
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sudo systemctl stop irqbalance
# Add to kernel boot params:
# isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3RiskGate fatFinger(max_notional = 1'000'000);
RiskGate priceClamp(max_slippage_bps = 50);
RiskGate rateLimit(max_orders_per_second = 1000);
// Kill switch is always active and unconditionalBare-metal Windows/Linux workstation:
- Sub-5 µs average latency achievable
- Minimum latency floor: ~1.75 µs (cache-primed, no OS interrupts)
- Suitable for: tick-by-tick strategies, momentum, market making research
AWS EC2 shared instance:
- Average ~36 µs, 99th percentile ~118 µs
- Hypervisor imposes a ~31.6 µs per-packet tax at the socket receive stage
- Suitable for: strategies with action intervals ≥ 100 ms, cloud-based backtesting
AWS dedicated host / bare-metal cloud:
- Hypervisor tax reduces to ~8–10 µs
- Suitable for: latency-sensitive cloud strategies without co-location budget
1. OS Scheduler Non-Determinism Neither Windows 11 nor standard Linux are real-time operating systems. Kernel timer interrupts and Deferred Procedure Calls (DPCs, Windows) can cause latency spikes of 127 µs (bare-metal) to 692 µs (AWS). Elimination requires PREEMPT_RT Linux, VxWorks, or bare-metal hypervisor deployment.
2. Kernel-Space Networking
SHAURYA uses standard POSIX/WinSock2 sockets. Each recv() call incurs a user-kernel context switch (~300–500 ns) plus a data copy from kernel to user space. Kernel bypass via DPDK would reduce receive latency to 300–500 ns but requires dedicated CPU cores and custom driver setup.
3. Static Model Architecture The neural network architecture is fixed at compile time. Weight updates require only replacing the JSON file; layer changes require recompilation. JIT compilation via LLVM ORC APIs is a future direction.
4. Single-Threaded Pipeline The current execution model pins the entire hot path to a single core. Parallelizing FIX parsing and inference across cores is future work.
5. No Real Exchange Connectivity The current release targets research and simulation. Production connectivity (co-location, FIX drop-copy, exchange-specific protocols) is explicitly out of scope.
- DPDK kernel-bypass networking integration (
#ifdef USE_DPDK) - Native FPGA backend — synthesizable Verilog output from RTL firewall C++ source
- PREEMPT_RT Linux ultra-low-latency build
- GPU co-processor support (CUDA) for larger model architectures (LSTM, CNN-LSTM)
- JIT model reloading via LLVM ORC APIs
- Advanced order routing simulator with realistic fill modeling
- Hardware-in-the-Loop (HIL) extension: real FPGA risk gate via PCIe
High latency spikes:
- Verify CPU frequency scaling is disabled (
scaling_governor = performance) - Ensure LTO is enabled (
-flto+lldlinker) - Confirm AVX2 availability:
grep avx2 /proc/cpuinfo - Disable Windows Defender real-time scanning and background services during benchmarks
Model not loading:
- Validate
fdeep_model.jsonwith frugally-deep's built-in checker - Ensure the JSON path matches the compile-time configured default
- Check weight precision — frugally-deep expects
float32by default
Build failures:
- Confirm Clang version:
clang++ --version(≥ 14 recommended) - Ensure
lldis installed:ld.lld --version - On Windows, verify LLVM is on
PATHand not shadowed by MSVC toolchain - Rebuild with
-vflag for verbose compilation diagnostics
SHAURYA is intended exclusively for:
- Academic research
- Systems engineering education
- HPC and compiler experimentation
It is not financial advice and not production-certified trading infrastructure. Users assume full responsibility for trading decisions, regulatory compliance, and capital risk. The RTL firewall has not undergone formal verification and should not be relied upon as the sole risk control in a live trading environment.
