Command Reference

Author:

Rohit Goswami

1 Command Reference

All commands match the original C++ RADSex CLI interface exactly.

1.1 process

Build a marker depth table from demultiplexed reads.

rsx process -i <input_dir> -o <output.tsv> [-T threads] [-d min_depth]

Flag

Default

Description

-i

Directory of FASTQ/FASTA files (gzip supported)

-o

Output markers depth table

-T

1

Number of threads

-d

1

Minimum depth to retain a marker

Supported extensions: .fq, .fastq, .fasta, .fa, .fna (optionally .gz). Individual names are derived from filenames.

1.2 distrib

Compute marker distribution between two groups.

rsx distrib -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-d 5] [-G M,F] [-S 0.05] [-C]

Flag

Default

Description

-t

Markers depth table from process

-p

Population map (individualtgroup)

-o

Output distribution table

-d

1

Minimum depth

-G

auto

Groups to compare (comma-separated)

-S

0.05

Significance threshold

-C

false

Disable Bonferroni correction

1.3 signif

Extract markers significantly associated with a group.

rsx signif -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-d 5] [-G M,F] [-S 0.05] \
  [--correction bonferroni|fdr|none] [--test chisq|fisher|gtest] [--bayes] [-a]

Flag

Default

Description

--correction

bonferroni

Multiple testing: bonferroni, fdr, none

--test

chisq

Statistical test: chisq, fisher, gtest

--bayes

Add Bayes Factor + posterior P(sex-linked) cols

-a

Output FASTA instead of table

Two-pass streaming: pass 1 counts markers, pass 2 writes directly. O(nindividuals) memory. FDR correction collects p-values first (uses more memory but still bounded).

1.4 process

rsx process -i <input_dir> -o <output.tsv> [-T threads] [-d min_depth] [-k kmer_size]

Optional -k / --kmer-dedup: group markers by canonical k-mer of the given size before output. Collapses sequencing error variants, reducing the number of markers tested (more statistical power). The k-mer deduplication keeps the representative marker with highest total depth from each group.

1.5 freq

Compute marker frequency distribution.

rsx freq -t <table.tsv> -o <output.tsv> [-d 5]

1.6 depth

Compute retained read statistics per individual.

rsx depth -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-f 0.75]

Flag

Default

Description

-f

0.75

Min fraction of individuals for a marker to count

Auto-detects large files (> 2GB) and switches to streaming mode with external sort for exact median. Sparse optimization: only non-zero depths are sorted (3.3x reduction for typical RAD-seq data). See scripts/sympy/sparse_median_proof.py for the mathematical proof.

1.7 map

Align markers to a reference genome.

rsx map -t <table.tsv> -p <popmap.tsv> -g <genome.fa> -o <output.tsv> \
  [-d 5] [-G M,F] [-q 20] [-Q 0.1] [-S 0.05] [-C]

Flag

Default

Description

-g

Reference genome (FASTA)

-q

20

Minimum mapping quality

-Q

0.1

Minimum marker frequency

Uses minimap2 with sr (short-read) preset. Two-pass streaming: pass 1 counts markers, pass 2 aligns and writes directly. Only available when built with the map feature (default on, not on Windows).

Note: the minimap2 genome index can be large (e.g. 21GB for salmon genomes). This is intrinsic to the aligner, not rsx.

1.8 subset

Extract a filtered subset of markers.

rsx subset -t <table.tsv> -p <popmap.tsv> -o <output.tsv> \
  [-d 5] [-G M,F] [-m min_g1] [-M max_g1] [-n min_g2] [-N max_g2] \
  [-i min_ind] [-I max_ind] [-S 0.05] [-C] [-a]

Two-pass streaming like signif. O(nindividuals) memory.

1.9 merge

Merge multiple marker depth tables by sequence identity.

rsx merge -o <output.tsv> <table1.tsv> <table2.tsv> [table3.tsv ...] [-B 2000000]

Flag

Default

Description

-o

Output merged markers table

-B

2,000,000

Entries to buffer before flushing to disk

Uses external sort-merge with bounded memory (~500MB). Input files are positional arguments (supports shell glob). Sequences are deduplicated across files using 2-bit DNA encoding. Depths from the same individual appearing in multiple files are summed.

The algorithm:

  1. Reads all files streaming, buffers entries in RAM

  2. When buffer full: sorts by packed sequence, writes lz4 temp file

  3. K-way merges sorted temp files, coalesces equal sequences

  4. Writes merged TSV output

Handles 75M+ unique sequences across hundreds of samples without OOM. Adjust -B to trade memory for fewer temp files (higher) or less RAM (lower).

Optional Parquet output (requires --features parquet-io):

rsx merge -o <output.parquet> --output-parquet <table1.tsv> <table2.tsv>

Parquet uses ZSTD compression with nullable u16 depth columns. Sparse zeros stored as null for efficient RLE encoding.

1.10 pca

Streaming PCA of the depth matrix (Tucker mode-2 decomposition).

rsx pca -t <table.tsv> -o <output_dir/> [-d 5] [-r 10]

Flag

Default

Description

-t

Markers depth table

-o

Output directory

-d

1

Minimum depth

-r

all

Number of principal components to output

Computes principal components by streaming the Gram matrix XTX (nindividualsx nindividuals) and eigendecomposing it. This is mathematically equivalent to Tucker HOSVD mode-2 decomposition.

Memory: O(nindividuals2) = 320KB for 200 individuals, regardless of the number of markers. Works on arbitrarily large tables.

Output files:

  • eigenvalues.tsv: component, eigenvalue, variancefraction, cumulative

  • loadings.tsv: individual x component loading matrix

  • summary.txt: total variance, top components

Use cases:

  • Sex signal detection: tests whether PC1 separates M/F. On real-world RAD-seq data, the answer is no – sex-linked markers are only 0.03% of the depth matrix, drowned by library size and population effects. This negative finding justifies targeted methods (FST, chi-squared).

  • Sample QC: outlier individuals on PC2-3 indicate sequencing problems or batch effects.

  • Population structure: on combined multi-population tables, PC1 separates populations, not sexes. Confirms per-population analysis is the correct design for sex detection.

  • Rank estimation: eigenvalue spectrum reveals effective dimensionality. For RAD-seq depth data, typically 3-5 components explain > 90% of variance (low-rank, driven by restriction site conservation patterns).

See scripts/sympy/tucker_covariance_proof.py for the mathematical proof that this gives exact Tucker mode-2 factors.