Command Reference ¶

Author:

Rohit Goswami

1 Command Reference ¶

All commands match the original C++ RADSex CLI interface exactly.

1.1 process ¶

Build a marker depth table from demultiplexed reads.

rsx process -i <input_dir> -o <output.tsv> [-T threads] [-d min_depth]

Flag	Default	Description
`-i`		Directory of FASTQ/FASTA files (gzip supported)
`-o`		Output markers depth table
`-T`	1	Number of threads
`-d`	1	Minimum depth to retain a marker

Supported extensions: .fq, .fastq, .fasta, .fa, .fna (optionally .gz). Individual names are derived from filenames.

1.2 distrib ¶

Compute marker distribution between two groups.

rsx distrib -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-d 5] [-G M,F] [-S 0.05] [-C]

Flag	Default	Description
`-t`		Markers depth table from `process`
`-p`		Population map (individualtgroup)
`-o`		Output distribution table
`-d`	1	Minimum depth
`-G`	auto	Groups to compare (comma-separated)
`-S`	0.05	Significance threshold
`-C`	false	Disable Bonferroni correction

1.3 signif ¶

Extract markers significantly associated with a group.

rsx signif -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-d 5] [-G M,F] [-S 0.05] \
  [--correction bonferroni|fdr|none] [--test chisq|fisher|gtest] [--bayes] [-a]

Flag	Default	Description
`--correction`	bonferroni	Multiple testing: bonferroni, fdr, none
`--test`	chisq	Statistical test: chisq, fisher, gtest
`--bayes`		Add Bayes Factor + posterior P(sex-linked) cols
`-a`		Output FASTA instead of table

Two-pass streaming: pass 1 counts markers, pass 2 writes directly. O(n_individuals) memory. FDR correction collects p-values first (uses more memory but still bounded).

1.4 process ¶

rsx process -i <input_dir> -o <output.tsv> [-T threads] [-d min_depth] [-k kmer_size]

Optional -k / --kmer-dedup: group markers by canonical k-mer of the given size before output. Collapses sequencing error variants, reducing the number of markers tested (more statistical power). The k-mer deduplication keeps the representative marker with highest total depth from each group.

1.5 freq ¶

Compute marker frequency distribution.

rsx freq -t <table.tsv> -o <output.tsv> [-d 5]

1.6 depth ¶

Compute retained read statistics per individual.

rsx depth -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-f 0.75]

Flag	Default	Description
`-f`	0.75	Min fraction of individuals for a marker to count

Auto-detects large files (> 2GB) and switches to streaming mode with external sort for exact median. Sparse optimization: only non-zero depths are sorted (3.3x reduction for typical RAD-seq data). See scripts/sympy/sparse_median_proof.py for the mathematical proof.

1.7 map ¶

Align markers to a reference genome.

rsx map -t <table.tsv> -p <popmap.tsv> -g <genome.fa> -o <output.tsv> \
  [-d 5] [-G M,F] [-q 20] [-Q 0.1] [-S 0.05] [-C]

Flag	Default	Description
`-g`		Reference genome (FASTA)
`-q`	20	Minimum mapping quality
`-Q`	0.1	Minimum marker frequency

Uses minimap2 with sr (short-read) preset. Two-pass streaming: pass 1 counts markers, pass 2 aligns and writes directly. Only available when built with the map feature (default on, not on Windows).

Note: the minimap2 genome index can be large (e.g. 21GB for salmon genomes). This is intrinsic to the aligner, not rsx.

1.8 subset ¶

Extract a filtered subset of markers.

rsx subset -t <table.tsv> -p <popmap.tsv> -o <output.tsv> \
  [-d 5] [-G M,F] [-m min_g1] [-M max_g1] [-n min_g2] [-N max_g2] \
  [-i min_ind] [-I max_ind] [-S 0.05] [-C] [-a]

Two-pass streaming like signif. O(n_individuals) memory.

1.9 merge ¶

Merge multiple marker depth tables by sequence identity.

rsx merge -o <output.tsv> <table1.tsv> <table2.tsv> [table3.tsv ...] [-B 2000000]

Flag	Default	Description
`-o`		Output merged markers table
`-B`	2,000,000	Entries to buffer before flushing to disk

Uses external sort-merge with bounded memory (~500MB). Input files are positional arguments (supports shell glob). Sequences are deduplicated across files using 2-bit DNA encoding. Depths from the same individual appearing in multiple files are summed.

The algorithm:

Reads all files streaming, buffers entries in RAM
When buffer full: sorts by packed sequence, writes lz4 temp file
K-way merges sorted temp files, coalesces equal sequences
Writes merged TSV output

Handles 75M+ unique sequences across hundreds of samples without OOM. Adjust -B to trade memory for fewer temp files (higher) or less RAM (lower).

Optional Parquet output (requires --features parquet-io):

rsx merge -o <output.parquet> --output-parquet <table1.tsv> <table2.tsv>

Parquet uses ZSTD compression with nullable u16 depth columns. Sparse zeros stored as null for efficient RLE encoding.

1.10 pca ¶

Streaming PCA of the depth matrix (Tucker mode-2 decomposition).

rsx pca -t <table.tsv> -o <output_dir/> [-d 5] [-r 10]

Flag	Default	Description
`-t`		Markers depth table
`-o`		Output directory
`-d`	1	Minimum depth
`-r`	all	Number of principal components to output

Computes principal components by streaming the Gram matrix X^TX (n_individualsx n_individuals) and eigendecomposing it. This is mathematically equivalent to Tucker HOSVD mode-2 decomposition.

Memory: O(n_individuals²) = 320KB for 200 individuals, regardless of the number of markers. Works on arbitrarily large tables.

Output files:

eigenvalues.tsv: component, eigenvalue, variance_fraction, cumulative
loadings.tsv: individual x component loading matrix
summary.txt: total variance, top components

Use cases:

Sex signal detection: tests whether PC1 separates M/F. On real-world RAD-seq data, the answer is no – sex-linked markers are only 0.03% of the depth matrix, drowned by library size and population effects. This negative finding justifies targeted methods (FST, chi-squared).
Sample QC: outlier individuals on PC2-3 indicate sequencing problems or batch effects.
Population structure: on combined multi-population tables, PC1 separates populations, not sexes. Confirms per-population analysis is the correct design for sex detection.
Rank estimation: eigenvalue spectrum reveals effective dimensionality. For RAD-seq depth data, typically 3-5 components explain > 90% of variance (low-rank, driven by restriction site conservation patterns).

See scripts/sympy/tucker_covariance_proof.py for the mathematical proof that this gives exact Tucker mode-2 factors.