HPC Design: Towards 10x¶
- Author:
Rohit Goswami
- Date:
2026-04-06
1 HPC Design: Mixed-Precision Bitset Architecture¶
1.1 Problem statement¶
At production scale (5M markers x 200 individuals, 5GB raw data), the bottleneck stack is:
I/O: sequential TSV parsing at ~100 MB/s (single-threaded buffered read)
Hashing: AHashMap lookups for group counts (~17% of CPU)
Memory: 2 GB dense u16 matrix, most entries are zero (~75% sparse)
Precision: f64 everywhere, but only the gamma CDF needs it
1.2 Key insight: the depth matrix is a sparse boolean¶
For distrib/signif/subset/freq, we only care whether depth > mindepth=.
This reduces each entry from u16 (16 bits) to 1 bit. A marker’s group count
becomes popcount(marker_bits & group_mask) – one CPU instruction per 64
individuals.
Representation |
Bytes/marker (200 ind) |
Total (5M markers) |
|---|---|---|
Dense u16 |
400 |
2.0 GB |
Sparse CSR |
300 |
1.5 GB |
Bitset |
32 |
160 MB |
1.3 Mixed precision strategy¶
Value |
Type |
Why |
|---|---|---|
Depths |
u16 |
Max sequencing depth 65535 |
Presence bit |
u1 |
depth >= mindepth |
Group counts |
u8 |
Max 255 individuals per group |
Chi-squared |
f32 |
Values 0-400, 7 sig figs sufficient |
P-value (CDF) |
f64 |
Gamma function needs >7 sig figs |
Corrected p |
f64 |
p * nmarkersspans 30 decades |
Bias |
f32 |
Range [-1, 1], ratio of small integers |
The hot loop becomes integer arithmetic (popcount on u64) with a single f64 CDF call per marker at the end. ~10x less floating-point work than the current all-f64 path.
1.4 Architecture layers¶
Layer 1: I/O (mmap + parallel chunk parse)
mmap the TSV file
Split into N chunks at newline boundaries
Each rayon thread parses its chunk
Uses stringzilla SIMD for delimiter scanning
Uses ryu for float output
Layer 2: Representation (bitset matrix)
During parse: threshold u16 depths to 1-bit
Pack into u64 words per marker row
Pre-compute group masks (u64 words with 1s for group members)
Total memory: 32 bytes/marker for 200 individuals
Layer 3: Compute (popcount + batch stats)
Group count = popcount(marker_u64[i] & group_mask[i]) for each word
Chi-squared in f32 from u8 counts
P-value via f64 gamma CDF (statrs)
Bonferroni correction in f64
Layer 4 (optional): GPU offload
Upload bitset matrix (160 MB) to GPU
1 thread per marker: popcount + chi-squared + CDF
Download results
Only faster when Layer 1 I/O is saturated
Layer 5 (optional): MPI distribution
MPI_Scatter file chunks across nodes
Each rank parses + computes locally
MPI_Reduce to merge results
Feature-gated behind "mpi" cargo feature
1.5 Parallel chunk parsing (fastMatMR pattern)¶
The markers table is line-oriented: each line is one marker. Lines are independent (no cross-marker state). This enables embarrassingly parallel parsing:
mmapthe file (zero-copy, kernel manages paging)Find N evenly-spaced split points
Walk each split point forward to the next
\n(boundary alignment)Each rayon thread parses its chunk into a local
Vec<BitsetRow>Concatenate results (no merge needed – rows are independent)
The stringzilla crate provides SIMD-accelerated find for locating
delimiters (\t, \n) in each chunk, replacing byte-by-byte scanning.
1.6 Expected performance stack¶
Target: 10x over C++ radsex on the large benchmark.
1.7 MPI design¶
Feature-gated: cargo build --features mpi
Only the process command benefits from MPI (distributes FASTQ file
processing across nodes). Downstream commands operate on a single markers
table which fits in memory on one node.
#[cfg(feature = "mpi")]
fn process_mpi(params: &ProcessParams) {
let universe = mpi::initialize().unwrap();
let world = universe.world();
let rank = world.rank();
let size = world.size();
// Rank 0 distributes file list
let my_files = scatter_files(&input_files, rank, size);
// Each rank processes its files locally with rayon
let local_counts = process_files_parallel(&my_files);
// Reduce: merge all counts to rank 0
let global_counts = reduce_counts(local_counts, &world);
if rank == 0 {
write_output(global_counts, ¶ms.output_file_path);
}
}
1.8 Streaming architecture¶
All commands operate in bounded memory regardless of input size:
1.8.1 Two-pass streaming (signif, subset, map)¶
Bonferroni correction requires knowing nmarkersbefore output. Two passes over the mmap’d table:
Pass 1: count nmarkers(fast, no sequence parsing)
Pass 2: compute p-values with known threshold, write directly
1.8.2 External sort-merge (merge, depth streaming)¶
For datasets exceeding RAM: chunked sort with lz4 temp files + k-way merge via BinaryHeap. The depth command sorts only non-zero entries (sparse optimization, 3.3x I/O reduction for 70% sparse data).
Proof: scripts/sympy/sparse_median_proof.py
1.8.3 Streaming Tucker via Gram eigendecomposition (pca)¶
For an (m x n) depth matrix where m >> n (75M markers, 200 individuals):
Stream rows, accumulate XTX (n x n = 200 x 200 = 320KB)
Eigendecompose XTX (Jacobi rotation, instant for 200x200)
Eigenvectors = right singular vectors of X = Tucker mode-2 factors
- This is mathematically exact (not an approximation):
SVD: X = U S VTGram: XTX = V S2VTTherefore: eigvecs(XTX) = V = TuckerU2
Proof: scripts/sympy/tucker_covariance_proof.py
On real-world RAD-seq data (5 populations, 147 individuals, 467K markers):
PC1 separates populations, not sexes
Sex signal (122/467K = 0.03% of markers) is invisible in unsupervised PCA
3 components explain 92% of variance (low effective rank)
This negative finding justifies targeted chi-squared/FST approaches
1.9 Precision validation¶
Formal analyses showing mathematical equivalences:
Mixed-precision (f32 chi-squared, f64 CDF):
rsx-core/tests/test_precision.rserfc identity for chi-squared CDF:
scripts/sympy/chi2_cdf_derivation.pySparse median equivalence:
scripts/sympy/sparse_median_proof.pyGram eigendecomposition = Tucker HOSVD:
scripts/sympy/tucker_covariance_proof.pySollya minimax erfc polynomial:
scripts/sollya/erfc_minimax.sollya