Architecture ¶

Author:

Rohit Goswami

1 Architecture ¶

rsx-rs follows the rgpot pattern: a Rust core library with auto-generated C headers, wrapped by a thin CLI binary.

1.1 Crate structure ¶

rsx-rs/
  rsx-core/        Rust library (staticlib + cdylib + lib)
    src/
      lib.rs          Module declarations
      status.rs       C API error handling (status codes, thread-local errors)
      stats.rs        Chi-squared, Bonferroni, group bias, Cg formatter
      popmap.rs       Population map (individual -> group)
      marker.rs       Marker struct (sequence + depths + p-values)
      markers_table.rs  Streaming parser (crossbeam channels)
      io/
        seq_reader.rs   FASTQ/FASTA reader (needletail)
        table_io.rs     TSV read/write, fast integer parsing
      commands/
        process.rs    Parallel file ingestion (rayon + ahash)
        distrib.rs    Marker distribution between groups
        signif.rs     Significant marker extraction (two-pass Bonferroni)
        depth.rs      Per-individual read statistics
        freq.rs       Marker frequency distribution
        map.rs        Genome alignment (minimap2)
        subset.rs     Filtered marker extraction
      c_api/
        types.rs      Opaque handles, constructors, destructors
        commands.rs   extern "C" wrappers for each command
  rsx-cli/         Binary crate (clap subcommands)

1.2 C API contract ¶

Following the metatensor pattern:

All public types are #[repr(C)] – binary-compatible with C
All public functions are extern "C" with catch_unwind for panic safety
Status code first: every function returns rsx_status_t
Thread-local errors: details via rsx_last_error()
Naming: rsx_ prefix, SCREAMING_SNAKE_CASE enums, _t suffix structs

The C header is auto-generated by cbindgen:

cargo build --manifest-path rsx-core/Cargo.toml --features gen-header
# Produces rsx-core/include/rsx.h

1.3 Data flow ¶

FASTQ files --[process]--> markers.tsv --[distrib/signif/freq/depth/subset]--> analysis.tsv
                                       --[map + genome.fa]--> aligned.tsv

The process command is the only one that reads raw sequence files. All other commands consume the markers depth table (TSV).

1.4 Streaming parser ¶

The markers table parser uses a producer-consumer pattern:

Producer thread: reads TSV file in 64KB chunks, parses fields character-by-character (matching C++ approach), sends Marker batches through a crossbeam::channel::bounded channel
Consumer: receives batches via recv() (blocking, no polling)

This replaces the C++ std::queue + std::mutex + sleep(10us) pattern.

process uses rayon::par_iter() over input files. Each file is processed independently (count sequences with needletail, accumulate in ahash::AHashMap). For >= 8 individuals, a DashMap enables lock-free concurrent insertion (no sequential merge phase). Sequences are stored as 2-bit packed DNA (4x key compression: 100bp -> 26 bytes).

1.6 External sort-merge (merge command)¶

The merge command handles datasets that exceed available RAM (75M+ unique sequences across hundreds of samples). Algorithm:

Read all input files streaming, buffer N entries in memory
When buffer full: sort by 2-bit packed sequence, write lz4 temp file
K-way merge sorted temp files using BinaryHeap (min-heap)
Coalesce entries with equal sequence keys, merge depth columns
Write TSV output with unified sample columns

Memory: O(buffer_size), not O(total_sequences). Default buffer: 2M entries (~400MB). Temp files: lz4-compressed sorted chunks on disk.

1.7 Bounded-memory streaming (all commands)¶

All commands operate in bounded memory regardless of input size.

1.7.1 Two-pass streaming (signif, subset, map)¶

Commands requiring Bonferroni correction use two passes over the mmap’d table:

Pass 1: count n_markers (fast path, no sequence parsing)
Compute corrected_threshold = threshold / n_markers
Pass 2: re-mmap same file (kernel page cache, zero I/O), compute p-values, write qualifying markers directly to output

Memory: O(n_individuals) for bitset masks + output buffer. No Vec<Marker> accumulation.

The map command builds the minimap2 index between passes (~100-300MB), then aligns and writes in pass 2.

1.7.2 External sort for depth ¶

The depth command computes exact per-individual median via external sort when the input exceeds 2GB:

Stream markers, emit (individual_idx, depth) pairs to buffer
When buffer full: sort by (individual_idx, depth), write lz4 temp
K-way merge: depths per individual are contiguous and sorted
Pick middle element for exact median

Memory: O(buffer_size), default 200MB. Auto-detected by file size.

1.7.3 Streaming accumulators (distrib, freq)¶

distrib and freq use fixed-size accumulators (O(n_individuals²) and O(n_individuals) respectively) that do not grow with n_markers. These are inherently streaming.

1.8 Statistics ¶

Chi-squared with Yates continuity correction (exact port of C++ stats.cpp)
P-value via erfc(sqrt(chi2/2)) (SymPy-derived identity, replaces gamma CDF)
Sollya degree-40 minimax polynomial available for GPU (fast_erfc_poly)
Bonferroni correction: p_corrected = min(1.0, p * n_markers)
Float formatting via Cg struct: matches C++ %g (6 significant digits)