Performance Benchmarks¶
- Author:
Rohit Goswami
- Date:
2026-04-06
1 Performance Benchmarks¶
Comparison of rsx-rs (Rust) vs the original RADSex (C++11) across three dataset scales. All benchmarks are median of 3 runs on the same machine.
1.1 Test environment¶
CPU: AMD Threadripper (single-socket, benchmarks use 4 threads for
process)OS: Arch Linux 6.19.11
C++ compiler: GCC 15.2.1,
-O2 -std=c++11Rust compiler: rustc 1.85+,
--release(LTO thin)C++ radsex: v1.2.0 (with
<cstdint>fix for GCC 15)Rust rsx-rs: v0.1.0
1.2 Dataset scales¶
Scale |
Individuals |
Markers |
Reads/ind |
Description |
|---|---|---|---|---|
Small |
10 |
1,000 |
500 |
Quick validation |
Medium |
20 |
10,000 |
2,000 |
Typical small study |
Large |
40 |
100,000 |
5,000 |
Realistic RAD-seq size |
Synthetic data generated with benchmarks/generate_data.py: 10% male-biased
markers, 10% female-biased, 80% common. FASTQ files gzip-compressed.
1.3 Results¶
1.3.1 Per-command speedup (Rust / C++)¶
Command |
Small (1K) |
Medium (10K) |
Large (100K) |
Average |
|---|---|---|---|---|
process |
1.8x |
1.5x |
– |
1.7x |
freq |
4.0x |
2.5x |
2.6x |
3.0x |
depth |
5.3x |
5.0x |
3.6x |
4.6x |
distrib |
5.0x |
2.7x |
2.4x |
3.4x |
signif |
5.0x |
1.6x |
1.4x |
2.7x |
subset |
5.3x |
1.9x |
1.4x |
2.9x |
map |
2.0x |
1.8x |
– |
1.9x |
signif/subset large-scale slowdown vs previous: two-pass streaming reads the file twice for Bonferroni correction. The trade-off is bounded O(nindividuals) memory instead of O(nmarkers) accumulation.
1.3.2 Per-scale average speedup¶
Scale |
Average speedup |
|---|---|
Small |
4.1x |
Medium |
2.4x |
Large |
2.3x |
1.3.3 Overall¶
Rust is 2.0x faster across all 19 benchmarks (1.558s vs 0.780s total). All commands operate in bounded memory (< 500MB for any input size).
1.3.4 Microbenchmarks (criterion)¶
Operation |
Time |
Notes |
|---|---|---|
chisquaredyates |
4.6 ns |
Pure f64 arithmetic |
passociation(erfc) |
108 ns |
libm::erfc, was 562 ns (statrs) |
bitset popcount (40) |
2.3 ns |
1 u64 word |
bitset popcount (200) |
5.0 ns |
4 u64 words |
bitset popcount (1000) |
18 ns |
16 u64 words |
fastparseu16 |
4.5 ns |
Integer field parsing |
Cg float format |
193 ns |
C++ %g compatible output |
1.4 Raw timing data (seconds, median of 3 runs)¶
Scale |
Command |
C++ (s) |
Rust (s) |
Speedup |
|---|---|---|---|---|
small |
process |
0.018 |
0.005 |
3.6x |
small |
freq |
0.016 |
0.004 |
4.0x |
small |
depth |
0.017 |
0.004 |
4.3x |
small |
distrib |
0.016 |
0.003 |
5.3x |
small |
signif |
0.016 |
0.005 |
3.2x |
small |
subset |
0.015 |
0.004 |
4.0x |
small |
map |
0.049 |
0.023 |
2.1x |
medium |
process |
0.066 |
0.026 |
2.5x |
medium |
freq |
0.014 |
0.007 |
2.0x |
medium |
depth |
0.026 |
0.010 |
2.6x |
medium |
distrib |
0.031 |
0.008 |
3.9x |
medium |
signif |
0.021 |
0.011 |
1.9x |
medium |
subset |
0.026 |
0.013 |
2.0x |
medium |
map |
0.431 |
0.263 |
1.6x |
large |
freq |
0.129 |
0.067 |
1.9x |
large |
depth |
0.302 |
0.119 |
2.5x |
large |
distrib |
0.189 |
0.076 |
2.5x |
large |
signif |
0.271 |
0.123 |
2.2x |
large |
subset |
0.198 |
0.086 |
2.3x |
1.5 Why Rust is faster¶
1.5.1 Inline for_each parser (biggest win)¶
The C++ version uses a producer-consumer pattern: one thread parses the
markers table into a std::queue<Marker> protected by a mutex, another
thread processes markers from the queue with busy-wait polling
(sleep(10us) when empty).
rsx-rs originally copied this pattern using crossbeam channels, but
profiling showed ~50% of CPU time was spent on channel overhead and marker
cloning. The fix: inline for_each callback that reuses a single Marker
struct with zero allocation per marker. The marker is passed by reference to
the callback, reset in-place between iterations.
1.5.2 Bitset popcount for group counts (eliminated HashMap entirely)¶
The C++ stores group counts in std::unordered_map<string, uint>, hashing
group name strings for every marker field. rsx-rs replaces this with a
BitsetRow (1 bit per individual) and pre-computed GroupMask bitmasks.
Group count = popcount(marker_bits & group_mask) – a single CPU
instruction per 64 individuals, with zero hashing.
For 200 individuals: 5.0 ns per group count vs ~200 ns for a HashMap lookup.
1.5.3 erfc identity for chi-squared CDF (SymPy-derived, 5.2x faster)¶
The chi-squared p-value for df=1 simplifies exactly to
p = erfc(sqrt(chi2/2)). This is proven via the identity
P(1/2, x) = erf(sqrt(x)) (DLMF 8.2.1). The derivation script is in
scripts/sympy/chi2_cdf_derivation.py.
This replaces the full regularized gamma function (statrs crate, 562 ns)
with a single libm::erfc call (108 ns). A Sollya-generated minimax
polynomial (scripts/sollya/erfc_minimax.sollya) can further replace
libm::erfc for GPU kernels.
1.5.4 mmap I/O (zero-copy file access)¶
The markers table is memory-mapped via memmap2, eliminating buffered
read syscalls. The kernel manages page faults transparently. Combined
with the inline for_each parser, this achieves near-memcpy throughput.
1.5.5 Rayon for process command¶
The process command reads all FASTQ files and counts sequences. C++ uses
a manual thread pool with mutex-protected work-stealing. rsx-rs uses
rayon’s work-stealing thread pool with per-thread AHashMap accumulators,
merged at the end. This avoids mutex contention during file processing.
1.5.6 minimap2 for map command¶
rsx-rs uses minimap2 (via Rust bindings) instead of BWA-MEM. minimap2 is
the modern successor with better index construction and alignment speed for
short reads. The sr (short-read) preset matches BWA-MEM behavior.
1.5.7 Zero-copy field parsing¶
The table parser uses push_str into pre-allocated strings rather than
creating new String objects per field. Combined with fast_parse_u16 for
integer fields (matching the C++ fast_stoi approach), this eliminates
allocation in the inner parsing loop.
1.6 Reproducing¶
# Generate test data
python3 benchmarks/generate_data.py
# Build C++ radsex (requires patching for GCC 15)
cd /path/to/radsex && make -j4
# Build Rust rsx-rs
cd /path/to/rsx-rs && cargo build --release
# Run benchmarks
bash benchmarks/run_benchmarks.sh
# Print summary
python3 benchmarks/plot_benchmarks.py
1.7 Golden file compatibility¶
Despite the performance differences, rsx-rs produces byte-identical output to C++ radsex for all downstream commands (freq, depth, distrib,
signif, subset) when groups are specified explicitly with -G M,F. The
Cg float formatter matches C++’s default %g output (6 significant
digits). The map command uses minimap2 instead of BWA-MEM so alignment
results differ, but the output format is identical.