HPC Design: Towards 10x

Author:

Rohit Goswami

Date:

2026-04-06

1 HPC Design: Mixed-Precision Bitset Architecture

1.1 Problem statement

At production scale (5M markers x 200 individuals, 5GB raw data), the bottleneck stack is:

  1. I/O: sequential TSV parsing at ~100 MB/s (single-threaded buffered read)

  2. Hashing: AHashMap lookups for group counts (~17% of CPU)

  3. Memory: 2 GB dense u16 matrix, most entries are zero (~75% sparse)

  4. Precision: f64 everywhere, but only the gamma CDF needs it

1.2 Key insight: the depth matrix is a sparse boolean

For distrib/signif/subset/freq, we only care whether depth > mindepth=. This reduces each entry from u16 (16 bits) to 1 bit. A marker’s group count becomes popcount(marker_bits & group_mask) – one CPU instruction per 64 individuals.

Representation

Bytes/marker (200 ind)

Total (5M markers)

Dense u16

400

2.0 GB

Sparse CSR

300

1.5 GB

Bitset

32

160 MB

1.3 Mixed precision strategy

Value

Type

Why

Depths

u16

Max sequencing depth 65535

Presence bit

u1

depth >= mindepth

Group counts

u8

Max 255 individuals per group

Chi-squared

f32

Values 0-400, 7 sig figs sufficient

P-value (CDF)

f64

Gamma function needs >7 sig figs

Corrected p

f64

p * nmarkersspans 30 decades

Bias

f32

Range [-1, 1], ratio of small integers

The hot loop becomes integer arithmetic (popcount on u64) with a single f64 CDF call per marker at the end. ~10x less floating-point work than the current all-f64 path.

1.4 Architecture layers

Layer 1: I/O (mmap + parallel chunk parse)
  mmap the TSV file
  Split into N chunks at newline boundaries
  Each rayon thread parses its chunk
  Uses stringzilla SIMD for delimiter scanning
  Uses ryu for float output

Layer 2: Representation (bitset matrix)
  During parse: threshold u16 depths to 1-bit
  Pack into u64 words per marker row
  Pre-compute group masks (u64 words with 1s for group members)
  Total memory: 32 bytes/marker for 200 individuals

Layer 3: Compute (popcount + batch stats)
  Group count = popcount(marker_u64[i] & group_mask[i]) for each word
  Chi-squared in f32 from u8 counts
  P-value via f64 gamma CDF (statrs)
  Bonferroni correction in f64

Layer 4 (optional): GPU offload
  Upload bitset matrix (160 MB) to GPU
  1 thread per marker: popcount + chi-squared + CDF
  Download results
  Only faster when Layer 1 I/O is saturated

Layer 5 (optional): MPI distribution
  MPI_Scatter file chunks across nodes
  Each rank parses + computes locally
  MPI_Reduce to merge results
  Feature-gated behind "mpi" cargo feature

1.5 Parallel chunk parsing (fastMatMR pattern)

The markers table is line-oriented: each line is one marker. Lines are independent (no cross-marker state). This enables embarrassingly parallel parsing:

  1. mmap the file (zero-copy, kernel manages paging)

  2. Find N evenly-spaced split points

  3. Walk each split point forward to the next \n (boundary alignment)

  4. Each rayon thread parses its chunk into a local Vec<BitsetRow>

  5. Concatenate results (no merge needed – rows are independent)

The stringzilla crate provides SIMD-accelerated find for locating delimiters (\t, \n) in each chunk, replacing byte-by-byte scanning.

1.6 Expected performance stack

Target: 10x over C++ radsex on the large benchmark.

1.7 MPI design

Feature-gated: cargo build --features mpi

Only the process command benefits from MPI (distributes FASTQ file processing across nodes). Downstream commands operate on a single markers table which fits in memory on one node.

#[cfg(feature = "mpi")]
fn process_mpi(params: &ProcessParams) {
    let universe = mpi::initialize().unwrap();
    let world = universe.world();
    let rank = world.rank();
    let size = world.size();

    // Rank 0 distributes file list
    let my_files = scatter_files(&input_files, rank, size);

    // Each rank processes its files locally with rayon
    let local_counts = process_files_parallel(&my_files);

    // Reduce: merge all counts to rank 0
    let global_counts = reduce_counts(local_counts, &world);

    if rank == 0 {
        write_output(global_counts, &params.output_file_path);
    }
}

1.8 Streaming architecture

All commands operate in bounded memory regardless of input size:

1.8.1 Two-pass streaming (signif, subset, map)

Bonferroni correction requires knowing nmarkersbefore output. Two passes over the mmap’d table:

  1. Pass 1: count nmarkers(fast, no sequence parsing)

  2. Pass 2: compute p-values with known threshold, write directly

1.8.2 External sort-merge (merge, depth streaming)

For datasets exceeding RAM: chunked sort with lz4 temp files + k-way merge via BinaryHeap. The depth command sorts only non-zero entries (sparse optimization, 3.3x I/O reduction for 70% sparse data).

Proof: scripts/sympy/sparse_median_proof.py

1.8.3 Streaming Tucker via Gram eigendecomposition (pca)

For an (m x n) depth matrix where m >> n (75M markers, 200 individuals):

  • Stream rows, accumulate XTX (n x n = 200 x 200 = 320KB)

  • Eigendecompose XTX (Jacobi rotation, instant for 200x200)

  • Eigenvectors = right singular vectors of X = Tucker mode-2 factors

This is mathematically exact (not an approximation):

SVD: X = U S VTGram: XTX = V S2VTTherefore: eigvecs(XTX) = V = TuckerU2

Proof: scripts/sympy/tucker_covariance_proof.py

On real-world RAD-seq data (5 populations, 147 individuals, 467K markers):

  • PC1 separates populations, not sexes

  • Sex signal (122/467K = 0.03% of markers) is invisible in unsupervised PCA

  • 3 components explain 92% of variance (low effective rank)

  • This negative finding justifies targeted chi-squared/FST approaches

1.9 Precision validation

Formal analyses showing mathematical equivalences:

  • Mixed-precision (f32 chi-squared, f64 CDF): rsx-core/tests/test_precision.rs

  • erfc identity for chi-squared CDF: scripts/sympy/chi2_cdf_derivation.py

  • Sparse median equivalence: scripts/sympy/sparse_median_proof.py

  • Gram eigendecomposition = Tucker HOSVD: scripts/sympy/tucker_covariance_proof.py

  • Sollya minimax erfc polynomial: scripts/sollya/erfc_minimax.sollya