Shaurya is a high-frequency trading (HFT) market data feed handler engineered for sub-microsecond latency. By leveraging Zero-Copy parsing, Lock-Free concurrency, and Stack-based memory management, it bypasses the performance bottlenecks of standard software architectures to process financial data with deterministic speed.
Shaurya was benchmarked using high-resolution hardware timers (QueryPerformanceCounter).
| Implementation Approach | Average Latency | Min Latency | Why it's Slow/Fast? |
|---|---|---|---|
| Python Script | ~45.0 µs | ~30.0 µs | Interpreter overhead & Garbage Collection pauses. |
Standard C++ (std::string) |
~5.0 µs | ~3.5 µs | Frequent Heap Allocations (malloc) & deep memory copying. |
| SHAURYA (Zero-Copy) | 1.88 µs* | 0.3 µs | Zero-Copy pointer arithmetic & Lock-Free queues. |
The Result: Shaurya achieves a minimum internal reaction time of 300 nanoseconds, approximately 50x faster than standard Python implementations.
*Measured in Pure Mock Environment
Shaurya was subjected to a 30-minute stress test aggregating live ticks from Binance, Coinbase, and Bitstamp simultaneously.
- Test Duration: 30 Minutes
- Total Messages: 21,862 (Live Volatility Bursts)
- Outcome: The engine successfully normalized fragmented liquidity streams in real-time. While average latency increased under OS scheduler load (due to non-isolated cores), the minimum latency remained at 0.3 µs, proving the core engine's efficiency remains stable even during crypto market volatility.
Instead of copying network packets into new std::string objects (which forces the OS to allocate memory), Shaurya uses a custom StringViewLite class. This creates a lightweight "view" over the raw socket buffer, allowing the engine to parse prices without moving a single byte of memory.
Traditional systems use Mutex locks (std::mutex) to share data between threads, which forces the CPU to stop and switch contexts (expensive). Shaurya implements a Single-Producer Single-Consumer Ring Buffer using std::atomic instructions. This allows the Network Thread to push data and the Strategy Thread to read data simultaneously without ever blocking.
Critical data structures are aligned to 64-byte cache lines (alignas(64)). This prevents False Sharing, a phenomenon where two threads fight over the same CPU cache line, drastically reducing performance on multi-core systems.
- OS: Windows (Required for
winsock2andQueryPerformanceCounter) - Compiler: G++ (MinGW) supporting C++11 or higher.
- Build the System:
build.bat
- Start Data Source:
python bridge.py - Start Shaurya Engine:
bin\Shaurya.exe
Upon completion, the engine generates a Shaurya_Metrics.txt report detailing the nanosecond-level performance of the run.
If you are new to High-Frequency Trading systems, these concepts explain the "Why" behind Shaurya's architecture:
- Latency vs. Jitter: Understand why "Average Speed" is useless in HFT.
- Zero-Copy Networking: How avoiding memory copies saves microseconds.
- Lock-Free Programming: An introduction to Atomics and Ring Buffers.
- False Sharing: The hidden killer of multi-threaded performance.
Developed by your's truly 🛩️!
