Architecture

Author:

Rohit Goswami

1 Architecture

rsx-rs follows the rgpot pattern: a Rust core library with auto-generated C headers, wrapped by a thin CLI binary.

1.1 Crate structure

rsx-rs/
  rsx-core/        Rust library (staticlib + cdylib + lib)
    src/
      lib.rs          Module declarations
      status.rs       C API error handling (status codes, thread-local errors)
      stats.rs        Chi-squared, Bonferroni, group bias, Cg formatter
      popmap.rs       Population map (individual -> group)
      marker.rs       Marker struct (sequence + depths + p-values)
      markers_table.rs  Streaming parser (crossbeam channels)
      io/
        seq_reader.rs   FASTQ/FASTA reader (needletail)
        table_io.rs     TSV read/write, fast integer parsing
      commands/
        process.rs    Parallel file ingestion (rayon + ahash)
        distrib.rs    Marker distribution between groups
        signif.rs     Significant marker extraction (two-pass Bonferroni)
        depth.rs      Per-individual read statistics
        freq.rs       Marker frequency distribution
        map.rs        Genome alignment (minimap2)
        subset.rs     Filtered marker extraction
      c_api/
        types.rs      Opaque handles, constructors, destructors
        commands.rs   extern "C" wrappers for each command
  rsx-cli/         Binary crate (clap subcommands)

1.2 C API contract

Following the metatensor pattern:

  1. All public types are #[repr(C)] – binary-compatible with C

  2. All public functions are extern "C" with catch_unwind for panic safety

  3. Status code first: every function returns rsx_status_t

  4. Thread-local errors: details via rsx_last_error()

  5. Naming: rsx_ prefix, SCREAMING_SNAKE_CASE enums, _t suffix structs

The C header is auto-generated by cbindgen:

cargo build --manifest-path rsx-core/Cargo.toml --features gen-header
# Produces rsx-core/include/rsx.h

1.3 Data flow

FASTQ files --[process]--> markers.tsv --[distrib/signif/freq/depth/subset]--> analysis.tsv
                                       --[map + genome.fa]--> aligned.tsv

The process command is the only one that reads raw sequence files. All other commands consume the markers depth table (TSV).

1.4 Streaming parser

The markers table parser uses a producer-consumer pattern:

  • Producer thread: reads TSV file in 64KB chunks, parses fields character-by-character (matching C++ approach), sends Marker batches through a crossbeam::channel::bounded channel

  • Consumer: receives batches via recv() (blocking, no polling)

This replaces the C++ std::queue + std::mutex + sleep(10us) pattern.

1.5 Parallelism

process uses rayon::par_iter() over input files. Each file is processed independently (count sequences with needletail, accumulate in ahash::AHashMap). For >= 8 individuals, a DashMap enables lock-free concurrent insertion (no sequential merge phase). Sequences are stored as 2-bit packed DNA (4x key compression: 100bp -> 26 bytes).

1.6 External sort-merge (merge command)

The merge command handles datasets that exceed available RAM (75M+ unique sequences across hundreds of samples). Algorithm:

  1. Read all input files streaming, buffer N entries in memory

  2. When buffer full: sort by 2-bit packed sequence, write lz4 temp file

  3. K-way merge sorted temp files using BinaryHeap (min-heap)

  4. Coalesce entries with equal sequence keys, merge depth columns

  5. Write TSV output with unified sample columns

Memory: O(buffersize), not O(totalsequences). Default buffer: 2M entries (~400MB). Temp files: lz4-compressed sorted chunks on disk.

1.7 Bounded-memory streaming (all commands)

All commands operate in bounded memory regardless of input size.

1.7.1 Two-pass streaming (signif, subset, map)

Commands requiring Bonferroni correction use two passes over the mmap’d table:

  1. Pass 1: count n_markers (fast path, no sequence parsing)

  2. Compute corrected_threshold = threshold / n_markers

  3. Pass 2: re-mmap same file (kernel page cache, zero I/O), compute p-values, write qualifying markers directly to output

Memory: O(nindividuals) for bitset masks + output buffer. No Vec<Marker> accumulation.

The map command builds the minimap2 index between passes (~100-300MB), then aligns and writes in pass 2.

1.7.2 External sort for depth

The depth command computes exact per-individual median via external sort when the input exceeds 2GB:

  1. Stream markers, emit (individual_idx, depth) pairs to buffer

  2. When buffer full: sort by (individual_idx, depth), write lz4 temp

  3. K-way merge: depths per individual are contiguous and sorted

  4. Pick middle element for exact median

Memory: O(buffersize), default 200MB. Auto-detected by file size.

1.7.3 Streaming accumulators (distrib, freq)

distrib and freq use fixed-size accumulators (O(nindividuals2) and O(nindividuals) respectively) that do not grow with nmarkers. These are inherently streaming.

1.8 Statistics

  • Chi-squared with Yates continuity correction (exact port of C++ stats.cpp)

  • P-value via erfc(sqrt(chi2/2)) (SymPy-derived identity, replaces gamma CDF)

  • Sollya degree-40 minimax polynomial available for GPU (fast_erfc_poly)

  • Bonferroni correction: p_corrected = min(1.0, p * n_markers)

  • Float formatting via Cg struct: matches C++ %g (6 significant digits)