Skip to content

harshitsinghcode/shaurya

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚔️ SHAURYA v0.3.2: Scalable High-Frequency Architecture for Ultra-Low Response Yield Access

A heterogeneous Software-in-the-Loop (SIL) execution engine that resolves the HFT trilemma — ultra-low latency, deterministic risk management, and deep learning inference — through compiler-driven fusion on commodity hardware.


Language Standard Python Compiler LTO SIMD Vectorization Min Latency Avg Latency P99 Latency AI Inference Risk Validation Throughput Speedup Architecture Determinism Lock-Free SPSC False Sharing Zero-Copy Branchless RTL AI Framework Inference Engine Model Format No GIL No GC Platform Windows Linux Cloud Cross-Platform Real-Time Priority X11 PyPI PyPI Downloads GitHub Stars GitHub Release Code Size

🧠 Introduction

SHAURYA (hft.shaurya) is an ultra-low latency, LLVM-optimized C++20 High-Frequency Trading (HFT) framework that bridges Python-based AI model development with deterministic bare-metal execution performance. It is a heterogeneous Software-in-the-Loop (SIL) system, meaning the entire pipeline — from live market data ingestion to order routing — is treated as a single unified computational graph and optimized statically at compile time via LLVM/Clang's Link-Time Optimization (LTO) and AVX2 SIMD vectorization.

Unlike traditional approaches that force a choice between FPGA rigidity and software jitter, SHAURYA occupies a unique middle ground: the flexibility and iteration speed of high-level language development (Python/Keras), fused with the determinism and throughput of bare-metal C++ execution.

Designed for:

  • 📈 Quantitative Researchers
  • 🏢 Proprietary Trading Engineers
  • ⚙️ Systems & Compiler Enthusiasts
  • 🎓 HPC & Finance Students

🏗️ Architecture Overview

SHAURYA's execution pipeline is built on five tightly integrated layers, each engineered to eliminate non-determinism at the specific abstraction level it targets:

Compile-time optimization stack:

Source C++20
    └─ LLVM/Clang
        ├─ -flto         → Link-Time Optimization (whole-program inlining, DCE)
        ├─ -march=native → AVX2/AVX-512 SIMD vectorization
        ├─ -O3           → Aggressive loop unrolling, constant folding
        └─ -ffast-math   → Fused multiply-add, relaxed IEEE754 for throughput

Layer mapping:

Layer Language Role
Data Gateway Python (async) Live FIX feed ingestion, GIL-isolated
Execution Core C++20 (bare-metal) Zero-copy parsing, AI inference, risk validation
Risk Firewall RTL-in-C++ (Verilog-like) Branchless hardware-grade safety gates
Compiler Toolchain LLVM/Clang + lld LTO, SIMD fusion, cross-module inlining
Telemetry Python Tkinter Detached process, no critical path impact

⚡ Performance

Benchmarked over a 50-stock basket (1,000 ticks) across two deployment environments:

Bare-Metal (Windows 11, Intel i7-12700K @ 4.9 GHz, 32 GB DDR4-3600):

Metric Value
Minimum End-to-End Latency 1.75 µs
Mean Pipeline Latency 4.22 µs
90th Percentile 6.3 µs
99th Percentile 40.9 µs
Peak (OS DPC spike) 127 µs
Speedup vs. Naïve Baseline 81x

AWS EC2 (c6i.large, Ubuntu 22.04, Nitro Hypervisor):

Metric Value
Minimum Latency 2.09 µs
Mean Latency 36.77 µs
90th Percentile 52.3 µs
99th Percentile 118 µs
Hypervisor Tax (socket stage) ~31.6 µs

Pipeline stage breakdown (bare-metal vs. cloud):

Stage Bare Metal AWS
Socket receive 0.8 µs 32.4 µs
FIX parsing + AI inference 3.1 µs 3.8 µs
Risk validation 0.3 µs 0.4 µs

The socket receive stage accounts for >97% of the hypervisor overhead — the C++ execution core itself scales near-identically between environments. Static AI compilation alone provides a 20× latency penalty when replaced with TensorFlow's dynamic runtime.


🚀 Key Features

✅ Zero-Copy FIX Parser

Traditional FIX parsers construct std::string objects and call std::stof for numeric conversion — both involve heap allocation with non-deterministic latency due to memory fragmentation and locale-dependent formatting. SHAURYA's parser operates directly on the raw socket receive buffer using:

  • Strict pointer arithmetic to identify FIX tag delimiters (SOH, 0x01) without copying
  • Branchless finite state machine (WAITING_FOR_TAG → READING_TAG → READING_VALUE) with states chosen to maximize branch predictor hit rate
  • fast_atof implementation — a custom ASCII-to-float converter using pre-computed power-of-ten divisors and integer arithmetic, bypassing locale lookups and exception paths entirely
  • POD struct output with alignas(64) cache-line alignment for immediate L1 access by the downstream ring buffer

Total parsing time for a 200-byte FIX message: < 500 ns on bare metal.

✅ Lock-Free SPSC Ring Buffer

The inter-thread data transport uses a Single-Producer Single-Consumer (SPSC) lock-free ring buffer — a pre-allocated fixed-size circular array that never allocates memory during operation.

Key micro-architectural details:

  • std::memory_order_release on producer commit index writes — establishes a happens-before relationship without sequential consistency overhead
  • std::memory_order_acquire on consumer commit index reads — pairs with release semantics to synchronize without a full memory fence
  • 56-byte padding between producer and consumer indices (or alignas(64) placement in separate structs) — eliminates false sharing, the phenomenon where independent cache-line writes by different CPU cores trigger MESI protocol invalidations, causing expensive L1 cache flushes on the peer thread
  • Bitwise wrap-around (index & (capacity - 1) where capacity is a power of two) — avoids modulo division in the hot loop

Throughput: > 50 million messages/second on a single core. Push/pop latency: < 20 ns when buffer is not full.

✅ Static Deep Learning Alpha Engine

Conventional inference frameworks (TensorFlow, PyTorch) use dynamic computational graphs, heap-allocated tensor objects, and garbage-collected runtimes — all sources of non-deterministic latency on the order of tens to hundreds of microseconds, making them incompatible with any sub-100 µs HFT pipeline.

SHAURYA's approach:

1. Offline training (Python/Keras): A feed-forward neural network (input: 20 lagged mid-prices → hidden: 128 ReLU → hidden: 64 ReLU → output: buy/sell/hold signal) is trained offline with TensorFlow. Architecture is intentionally overparameterized to stress-test the inference engine.

2. Static export via frugally-deep: The trained .h5 model is converted to fdeep_model.json — a human-readable JSON file containing the full network topology, weight matrices (as flat float arrays), and bias vectors, with zero Python runtime dependencies.

3. Header-only C++ inference (Eigen + frugally-deep): At engine startup, the JSON is parsed once by the frugally-deep header-only library, which pre-allocates all tensors. The computational graph is then mapped to the Eigen C++ linear algebra library, which uses expression templates for lazy evaluation — avoiding temporary allocations — and emits AVX2-vectorized BLAS kernels for matrix multiplications.

4. Compile-time optimization by LLVM: Since the network architecture is fully known at compile time, LLVM/Clang performs:

  • Loop unrolling across all layer computations
  • Cross-translation-unit inlining (via LTO) of all Eigen expression templates
  • Branchless ReLU via bitwise masking: x & (x > 0) — no branch misprediction penalty
  • FMA (Fused Multiply-Add) fusion — bias addition merged into the matrix multiply kernel to minimize memory traffic

AI inference latency: < 1.2 µs on bare metal. Zero Python GIL. Zero TensorFlow runtime. Zero garbage collection.

✅ Simulated FPGA RTL Risk Firewall

Regulatory frameworks (MiFID II, SEC Rule 15c3-5) mandate unconditional pre-trade risk controls: kill switches, price clamps, fat-finger protection, and rate limiters. Hardware FPGA implementations guarantee nanosecond-level determinism but are inflexible and expensive to modify. Traditional software if-else chains are vulnerable to branch misprediction — when the CPU speculatively executes the wrong branch and must flush and refill the pipeline (10–20 cycle penalty per misprediction).

SHAURYA implements a Register-Transfer Level (RTL)-in-software firewall: all risk rules are expressed as branchless bitwise operations and arithmetic masking, directly analogous to combinational logic gates in hardware.

Example evaluations:

// Price clamp (no branch)
uint32_t valid = (price < max_price) & (price > min_price);

// Kill switch (no branch)
allow = allow & (!kill_signal);

// Fat-finger protection
uint32_t notional_ok = (quantity * price) < max_notional;

The entire risk evaluation tree — covering price clamps, rate limits, position limits, and kill switches — is transpiled from Verilog-like hardware description at compile time into a single, cache-aligned C++ evaluation sequence. It resides in the L1 instruction cache throughout execution. Deterministic risk validation latency: < 0.5 µs per evaluation, with zero variance from branch misprediction.

✅ LLVM/Clang Compiler Infrastructure

SHAURYA exclusively targets LLVM/Clang (not GCC) for the following reasons:

  • Intermediate Representation (IR): LLVM operates on a well-defined SSA-form IR, enabling optimization passes that are impossible at the source level
  • Link-Time Optimization (LTO) via libLTO: Bitcode from all translation units is merged at link time, allowing the compiler to perform whole-program analysis — dead code elimination, cross-module inlining, and loop transformations across what would otherwise be opaque ABI boundaries
  • -march=native: Generates code targeting the exact CPU microarchitecture of the build machine, enabling AVX2/AVX-512 packed floating-point instructions
  • -ffast-math: Enables reassociation, contraction (FMA), and reciprocal approximations — safe for HFT workloads where IEEE 754 strict compliance is not required
  • lld linker: LLVM's native linker, significantly faster than GNU ld for LTO workloads

✅ Cross-Platform Deployment

SHAURYA targets both Windows 11 bare-metal and Linux (AWS EC2 Ubuntu) from a single codebase. All OS-specific syscalls are wrapped in preprocessor directives:

Feature Windows Linux
Sockets WinSock2 / SOCKET POSIX / int fd
Thread affinity SetThreadAffinityMask pthread_setaffinity_np
Real-time scheduling REALTIME_PRIORITY_CLASS SCHED_FIFO, priority 99
High-res timer QueryPerformanceCounter clock_gettime(CLOCK_MONOTONIC)
Socket close closesocket() close()

✅ Telemetry & Observation Plane

The dashboard runs as a completely separate OS process, communicating with the C++ engine via a named pipe or local UDP socket. This decoupling is critical — GUI rendering can take several milliseconds and must never block the execution thread.

Features:

  • Real-time latency histograms and 99th percentile tail display
  • Risk violation counters and firewall gate status
  • Headless mode for cloud deployments: writes PNG chart output, compatible with X11 forwarding via ssh -X
  • Per-run metrics file (SHAURYA_Metrics.txt) with per-tick latency records for post-hoc statistical analysis

📦 Installation

Python Gateway

pip install hft.shaurya==0.3.0

C++ Core — Prerequisites

  • LLVM/Clang (≥ 14 recommended)
  • lld linker
  • C++17/C++20 compatible compiler
  • Eigen (header-only, bundled)
  • frugally-deep (header-only, bundled)

Build

# Windows
clang++ -O3 -flto -march=native -ffast-math -std=c++20 \
    -fuse-ld=lld src/*.cpp -o bin/Shaurya.exe

# Linux
clang++ -O3 -flto -march=native -ffast-math -std=c++20 \
    -fuse-ld=lld src/*.cpp -o bin/Shaurya -lpthread -lnuma

🔨 Usage

Step 1 — Start Market Gateway

python gateway.py
# or
python -m hft.shaurya.gateway

The Python layer connects to a live data feed (e.g., Yahoo Finance), formats ticks as FIX protocol messages, and streams them to 127.0.0.1:5000 via TCP. It runs in its own OS process — its GIL, GC pauses, and interpreter overhead are entirely isolated from the C++ measurement perimeter.

Step 2 — Launch LLVM C++ Core

bin/Shaurya.exe        # Windows
./bin/Shaurya          # Linux

Startup sequence:

  1. Parse fdeep_model.json → pre-allocate all neural network tensors
  2. Warm L1/L2 instruction cache with inference dry runs
  3. Initialize SPSC ring buffer with alignas(64) allocation
  4. Pin execution thread to isolated CPU core
  5. Begin live tick processing loop

Step 3 — Benchmark (50 Stocks)

python benchmark_50.py --ticks 20

Injects 20 FIX messages per stock across a 50-stock basket. Outputs per-stock average, min, max, 90th/99th percentile latency, and aggregate statistics to both console and CSV. Fully automated — no human interaction required during the run.

Step 4 — Review Metrics

After Ctrl+C shutdown, SHAURYA_Metrics.txt contains:

  • Per-tick raw latency (µs)
  • Average, 99th percentile, and tail latency
  • Message throughput statistics

🔬 Technical Deep Dive

Memory Hierarchy & Cache Management

SHAURYA's performance is fundamentally governed by cache residency. Key decisions:

  • alignas(64) on all hot data structures — ensures each struct occupies exactly one or more full cache lines, preventing false sharing and structure-internal cache-line splits
  • NUMA-aware allocation (Linux) — numa_node_of_cpu() is used to allocate ring buffer and neural network weights from the memory node local to the execution core, avoiding 50–100 ns remote memory access penalties on multi-socket servers
  • Software prefetching_mm_prefetch (Windows) / __builtin_prefetch (Linux) issues prefetch hints ~64 bytes ahead of the current neural network weight access, hiding DRAM latency by overlapping memory fetches with computation. The difference between cold and warm inference is up to 8 µs — making cache warming a mandatory startup step

Thread Model & OS Interaction

  • Two-thread model: Network receive thread (producer) and execution thread (consumer), each pinned to a dedicated logical core sharing the same L2/L3 but not hyper-threads
  • Real-time scheduling: SCHED_FIFO priority 99 (Linux), REALTIME_PRIORITY_CLASS + THREAD_PRIORITY_TIME_CRITICAL (Windows) — reduces maximum latency spikes by ~40% compared to POSIX default scheduling
  • Socket configuration: TCP_NODELAY (disables Nagle algorithm), SO_RCVBUF set to 8 MB — prevents TCP batching from introducing artificial latency on the loopback interface

AI Model Static Compilation Pipeline

Keras (.h5)
    └─ frugally-deep converter
        └─ fdeep_model.json  (human-readable, ~500 KB for 128-64-1 net)
            └─ frugally-deep header-only parser (one-time at startup)
                └─ Eigen computational graph
                    └─ LLVM LTO + AVX2 → optimized binary
                        → < 1.2 µs inference, no GC, no Python

Weight updates (without architectural changes) only require replacing fdeep_model.json — no recompilation. Architectural changes require recompilation, which completes in seconds.

Ablation Study Results

Disabling individual optimizations on the same bare-metal hardware:

Configuration Removed Avg Latency Penalty
Full SHAURYA (baseline) 4.22 µs
Static AI → TensorFlow 89.6 µs +2,023%
Zero-copy parser → std::string + stof 28.4 µs +573%
Lock-free SPSC → mutex queue 12.8 µs +203%
Branchless risk → if-statements 6.1 µs +45%

Static AI compilation is the single largest contributor — TensorFlow's dynamic memory allocator introduces GC pauses that reach 4.1 ms worst-case.


⚙️ Configuration

Compiler Flags

-O3           # Aggressive optimization (loop unrolling, constant propagation)
-flto         # Link-Time Optimization (whole-program analysis)
-march=native # Target CPU's exact instruction set (AVX2, FMA, etc.)
-ffast-math   # Enable FMA fusion, reciprocal approximations
-fuse-ld=lld  # Use LLVM's lld linker for LTO compatibility

System Tuning (Bare-Metal)

# Disable CPU power management
# BIOS: Disable C-states, SpeedStep, Turbo (or lock to max frequency)

# Windows
# Power plan: "Ultimate Performance"
# Set processor affinity to isolate cores 2-3 from OS interrupts

# Linux
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
sudo systemctl stop irqbalance
# Add to kernel boot params:
# isolcpus=2,3 nohz_full=2,3 rcu_nocbs=2,3

Risk Gate Configuration

RiskGate fatFinger(max_notional = 1'000'000);
RiskGate priceClamp(max_slippage_bps = 50);
RiskGate rateLimit(max_orders_per_second = 1000);
// Kill switch is always active and unconditional

🎯 Deployment Scenarios

Bare-metal Windows/Linux workstation:

  • Sub-5 µs average latency achievable
  • Minimum latency floor: ~1.75 µs (cache-primed, no OS interrupts)
  • Suitable for: tick-by-tick strategies, momentum, market making research

AWS EC2 shared instance:

  • Average ~36 µs, 99th percentile ~118 µs
  • Hypervisor imposes a ~31.6 µs per-packet tax at the socket receive stage
  • Suitable for: strategies with action intervals ≥ 100 ms, cloud-based backtesting

AWS dedicated host / bare-metal cloud:

  • Hypervisor tax reduces to ~8–10 µs
  • Suitable for: latency-sensitive cloud strategies without co-location budget

🚧 Limitations

1. OS Scheduler Non-Determinism Neither Windows 11 nor standard Linux are real-time operating systems. Kernel timer interrupts and Deferred Procedure Calls (DPCs, Windows) can cause latency spikes of 127 µs (bare-metal) to 692 µs (AWS). Elimination requires PREEMPT_RT Linux, VxWorks, or bare-metal hypervisor deployment.

2. Kernel-Space Networking SHAURYA uses standard POSIX/WinSock2 sockets. Each recv() call incurs a user-kernel context switch (~300–500 ns) plus a data copy from kernel to user space. Kernel bypass via DPDK would reduce receive latency to 300–500 ns but requires dedicated CPU cores and custom driver setup.

3. Static Model Architecture The neural network architecture is fixed at compile time. Weight updates require only replacing the JSON file; layer changes require recompilation. JIT compilation via LLVM ORC APIs is a future direction.

4. Single-Threaded Pipeline The current execution model pins the entire hot path to a single core. Parallelizing FIX parsing and inference across cores is future work.

5. No Real Exchange Connectivity The current release targets research and simulation. Production connectivity (co-location, FIX drop-copy, exchange-specific protocols) is explicitly out of scope.


🚀 Roadmap

  • DPDK kernel-bypass networking integration (#ifdef USE_DPDK)
  • Native FPGA backend — synthesizable Verilog output from RTL firewall C++ source
  • PREEMPT_RT Linux ultra-low-latency build
  • GPU co-processor support (CUDA) for larger model architectures (LSTM, CNN-LSTM)
  • JIT model reloading via LLVM ORC APIs
  • Advanced order routing simulator with realistic fill modeling
  • Hardware-in-the-Loop (HIL) extension: real FPGA risk gate via PCIe

🧪 Troubleshooting

High latency spikes:

  • Verify CPU frequency scaling is disabled (scaling_governor = performance)
  • Ensure LTO is enabled (-flto + lld linker)
  • Confirm AVX2 availability: grep avx2 /proc/cpuinfo
  • Disable Windows Defender real-time scanning and background services during benchmarks

Model not loading:

  • Validate fdeep_model.json with frugally-deep's built-in checker
  • Ensure the JSON path matches the compile-time configured default
  • Check weight precision — frugally-deep expects float32 by default

Build failures:

  • Confirm Clang version: clang++ --version (≥ 14 recommended)
  • Ensure lld is installed: ld.lld --version
  • On Windows, verify LLVM is on PATH and not shadowed by MSVC toolchain
  • Rebuild with -v flag for verbose compilation diagnostics

☢️ Disclaimer

SHAURYA is intended exclusively for:

  • Academic research
  • Systems engineering education
  • HPC and compiler experimentation

It is not financial advice and not production-certified trading infrastructure. Users assume full responsibility for trading decisions, regulatory compliance, and capital risk. The RTL firewall has not undergone formal verification and should not be relied upon as the sole risk control in a live trading environment.


🔗 Links

ad astra per aspera 🛩️

About

A compiler‑driven, LLVM‑optimized C++20 HFT engine with zero‑copy FIX parsing, lock‑free SPSC queues, static AVX2 deep learning inference, and a branchless RTL risk firewall – delivering 1.75 µs minimum latency (81× faster than naive) on commodity hardware.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors