HPC Design: Towards 10x ¶

Author:

Rohit Goswami

Date:

2026-04-06

1 HPC Design: Mixed-Precision Bitset Architecture ¶

1.1 Problem statement ¶

At production scale (5M markers x 200 individuals, 5GB raw data), the bottleneck stack is:

I/O: sequential TSV parsing at ~100 MB/s (single-threaded buffered read)
Hashing: AHashMap lookups for group counts (~17% of CPU)
Memory: 2 GB dense u16 matrix, most entries are zero (~75% sparse)
Precision: f64 everywhere, but only the gamma CDF needs it

1.2 Key insight: the depth matrix is a sparse boolean ¶

For distrib/signif/subset/freq, we only care whether depth > min_depth=. This reduces each entry from u16 (16 bits) to 1 bit. A marker’s group count becomes popcount(marker_bits & group_mask) – one CPU instruction per 64 individuals.

Representation	Bytes/marker (200 ind)	Total (5M markers)
Dense u16	400	2.0 GB
Sparse CSR	300	1.5 GB
Bitset	32	160 MB

1.3 Mixed precision strategy ¶

Value	Type	Why
Depths	u16	Max sequencing depth 65535
Presence bit	u1	depth >= min_depth
Group counts	u8	Max 255 individuals per group
Chi-squared	f32	Values 0-400, 7 sig figs sufficient
P-value (CDF)	f64	Gamma function needs >7 sig figs
Corrected p	f64	p * n_markersspans 30 decades
Bias	f32	Range [-1, 1], ratio of small integers

The hot loop becomes integer arithmetic (popcount on u64) with a single f64 CDF call per marker at the end. ~10x less floating-point work than the current all-f64 path.

1.4 Architecture layers ¶

Layer 1: I/O (mmap + parallel chunk parse)
  mmap the TSV file
  Split into N chunks at newline boundaries
  Each rayon thread parses its chunk
  Uses stringzilla SIMD for delimiter scanning
  Uses ryu for float output

Layer 2: Representation (bitset matrix)
  During parse: threshold u16 depths to 1-bit
  Pack into u64 words per marker row
  Pre-compute group masks (u64 words with 1s for group members)
  Total memory: 32 bytes/marker for 200 individuals

Layer 3: Compute (popcount + batch stats)
  Group count = popcount(marker_u64[i] & group_mask[i]) for each word
  Chi-squared in f32 from u8 counts
  P-value via f64 gamma CDF (statrs)
  Bonferroni correction in f64

Layer 4 (optional): GPU offload
  Upload bitset matrix (160 MB) to GPU
  1 thread per marker: popcount + chi-squared + CDF
  Download results
  Only faster when Layer 1 I/O is saturated

Layer 5 (optional): MPI distribution
  MPI_Scatter file chunks across nodes
  Each rank parses + computes locally
  MPI_Reduce to merge results
  Feature-gated behind "mpi" cargo feature

1.5 Parallel chunk parsing (fastMatMR pattern)¶

The markers table is line-oriented: each line is one marker. Lines are independent (no cross-marker state). This enables embarrassingly parallel parsing:

mmap the file (zero-copy, kernel manages paging)
Find N evenly-spaced split points
Walk each split point forward to the next \n (boundary alignment)
Each rayon thread parses its chunk into a local Vec<BitsetRow>
Concatenate results (no merge needed – rows are independent)

The stringzilla crate provides SIMD-accelerated find for locating delimiters (\t, \n) in each chunk, replacing byte-by-byte scanning.

1.6 Expected performance stack ¶

Target: 10x over C++ radsex on the large benchmark.

1.7 MPI design ¶

Feature-gated: cargo build --features mpi

Only the process command benefits from MPI (distributes FASTQ file processing across nodes). Downstream commands operate on a single markers table which fits in memory on one node.

#[cfg(feature = "mpi")]
fn process_mpi(params: &ProcessParams) {
    let universe = mpi::initialize().unwrap();
    let world = universe.world();
    let rank = world.rank();
    let size = world.size();

    // Rank 0 distributes file list
    let my_files = scatter_files(&input_files, rank, size);

    // Each rank processes its files locally with rayon
    let local_counts = process_files_parallel(&my_files);

    // Reduce: merge all counts to rank 0
    let global_counts = reduce_counts(local_counts, &world);

    if rank == 0 {
        write_output(global_counts, &params.output_file_path);
    }
}

1.8 Streaming architecture ¶

All commands operate in bounded memory regardless of input size:

1.8.1 Two-pass streaming (signif, subset, map)¶

Bonferroni correction requires knowing n_markersbefore output. Two passes over the mmap’d table:

Pass 1: count n_markers(fast, no sequence parsing)
Pass 2: compute p-values with known threshold, write directly

1.8.2 External sort-merge (merge, depth streaming)¶

For datasets exceeding RAM: chunked sort with lz4 temp files + k-way merge via BinaryHeap. The depth command sorts only non-zero entries (sparse optimization, 3.3x I/O reduction for 70% sparse data).

Proof: scripts/sympy/sparse_median_proof.py

1.8.3 Streaming Tucker via Gram eigendecomposition (pca)¶

For an (m x n) depth matrix where m >> n (75M markers, 200 individuals):

Stream rows, accumulate X^TX (n x n = 200 x 200 = 320KB)
Eigendecompose X^TX (Jacobi rotation, instant for 200x200)
Eigenvectors = right singular vectors of X = Tucker mode-2 factors

This is mathematically exact (not an approximation):: SVD: X = U S V^TGram: X^TX = V S²V^TTherefore: eigvecs(X^TX) = V = Tucker_U2

Proof: scripts/sympy/tucker_covariance_proof.py

On real-world RAD-seq data (5 populations, 147 individuals, 467K markers):

PC1 separates populations, not sexes
Sex signal (122/467K = 0.03% of markers) is invisible in unsupervised PCA
3 components explain 92% of variance (low effective rank)
This negative finding justifies targeted chi-squared/FST approaches

1.9 Precision validation ¶

Formal analyses showing mathematical equivalences:

Mixed-precision (f32 chi-squared, f64 CDF): rsx-core/tests/test_precision.rs
erfc identity for chi-squared CDF: scripts/sympy/chi2_cdf_derivation.py
Sparse median equivalence: scripts/sympy/sparse_median_proof.py
Gram eigendecomposition = Tucker HOSVD: scripts/sympy/tucker_covariance_proof.py
Sollya minimax erfc polynomial: scripts/sollya/erfc_minimax.sollya