Command Reference¶
- Author:
Rohit Goswami
1 Command Reference¶
All commands match the original C++ RADSex CLI interface exactly.
1.1 process¶
Build a marker depth table from demultiplexed reads.
rsx process -i <input_dir> -o <output.tsv> [-T threads] [-d min_depth]
Flag |
Default |
Description |
|---|---|---|
|
Directory of FASTQ/FASTA files (gzip supported) |
|
|
Output markers depth table |
|
|
1 |
Number of threads |
|
1 |
Minimum depth to retain a marker |
Supported extensions: .fq, .fastq, .fasta, .fa, .fna (optionally .gz).
Individual names are derived from filenames.
1.2 distrib¶
Compute marker distribution between two groups.
rsx distrib -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-d 5] [-G M,F] [-S 0.05] [-C]
Flag |
Default |
Description |
|---|---|---|
|
Markers depth table from |
|
|
Population map (individualtgroup) |
|
|
Output distribution table |
|
|
1 |
Minimum depth |
|
auto |
Groups to compare (comma-separated) |
|
0.05 |
Significance threshold |
|
false |
Disable Bonferroni correction |
1.3 signif¶
Extract markers significantly associated with a group.
rsx signif -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-d 5] [-G M,F] [-S 0.05] \
[--correction bonferroni|fdr|none] [--test chisq|fisher|gtest] [--bayes] [-a]
Flag |
Default |
Description |
|---|---|---|
|
bonferroni |
Multiple testing: bonferroni, fdr, none |
|
chisq |
Statistical test: chisq, fisher, gtest |
|
Add Bayes Factor + posterior P(sex-linked) cols |
|
|
Output FASTA instead of table |
Two-pass streaming: pass 1 counts markers, pass 2 writes directly. O(nindividuals) memory. FDR correction collects p-values first (uses more memory but still bounded).
1.4 process¶
rsx process -i <input_dir> -o <output.tsv> [-T threads] [-d min_depth] [-k kmer_size]
Optional -k / --kmer-dedup: group markers by canonical k-mer of the
given size before output. Collapses sequencing error variants, reducing
the number of markers tested (more statistical power). The k-mer
deduplication keeps the representative marker with highest total depth
from each group.
1.5 freq¶
Compute marker frequency distribution.
rsx freq -t <table.tsv> -o <output.tsv> [-d 5]
1.6 depth¶
Compute retained read statistics per individual.
rsx depth -t <table.tsv> -p <popmap.tsv> -o <output.tsv> [-f 0.75]
Flag |
Default |
Description |
|---|---|---|
|
0.75 |
Min fraction of individuals for a marker to count |
Auto-detects large files (> 2GB) and switches to streaming mode with
external sort for exact median. Sparse optimization: only non-zero
depths are sorted (3.3x reduction for typical RAD-seq data). See
scripts/sympy/sparse_median_proof.py for the mathematical proof.
1.7 map¶
Align markers to a reference genome.
rsx map -t <table.tsv> -p <popmap.tsv> -g <genome.fa> -o <output.tsv> \
[-d 5] [-G M,F] [-q 20] [-Q 0.1] [-S 0.05] [-C]
Flag |
Default |
Description |
|---|---|---|
|
Reference genome (FASTA) |
|
|
20 |
Minimum mapping quality |
|
0.1 |
Minimum marker frequency |
Uses minimap2 with sr (short-read) preset. Two-pass streaming:
pass 1 counts markers, pass 2 aligns and writes directly. Only
available when built with the map feature (default on, not on Windows).
Note: the minimap2 genome index can be large (e.g. 21GB for salmon genomes). This is intrinsic to the aligner, not rsx.
1.8 subset¶
Extract a filtered subset of markers.
rsx subset -t <table.tsv> -p <popmap.tsv> -o <output.tsv> \
[-d 5] [-G M,F] [-m min_g1] [-M max_g1] [-n min_g2] [-N max_g2] \
[-i min_ind] [-I max_ind] [-S 0.05] [-C] [-a]
Two-pass streaming like signif. O(nindividuals) memory.
1.9 merge¶
Merge multiple marker depth tables by sequence identity.
rsx merge -o <output.tsv> <table1.tsv> <table2.tsv> [table3.tsv ...] [-B 2000000]
Flag |
Default |
Description |
|---|---|---|
|
Output merged markers table |
|
|
2,000,000 |
Entries to buffer before flushing to disk |
Uses external sort-merge with bounded memory (~500MB). Input files are positional arguments (supports shell glob). Sequences are deduplicated across files using 2-bit DNA encoding. Depths from the same individual appearing in multiple files are summed.
The algorithm:
Reads all files streaming, buffers entries in RAM
When buffer full: sorts by packed sequence, writes lz4 temp file
K-way merges sorted temp files, coalesces equal sequences
Writes merged TSV output
Handles 75M+ unique sequences across hundreds of samples without OOM.
Adjust -B to trade memory for fewer temp files (higher) or less RAM
(lower).
Optional Parquet output (requires --features parquet-io):
rsx merge -o <output.parquet> --output-parquet <table1.tsv> <table2.tsv>
Parquet uses ZSTD compression with nullable u16 depth columns. Sparse zeros stored as null for efficient RLE encoding.
1.10 pca¶
Streaming PCA of the depth matrix (Tucker mode-2 decomposition).
rsx pca -t <table.tsv> -o <output_dir/> [-d 5] [-r 10]
Flag |
Default |
Description |
|---|---|---|
|
Markers depth table |
|
|
Output directory |
|
|
1 |
Minimum depth |
|
all |
Number of principal components to output |
Computes principal components by streaming the Gram matrix XTX (nindividualsx nindividuals) and eigendecomposing it. This is mathematically equivalent to Tucker HOSVD mode-2 decomposition.
Memory: O(nindividuals2) = 320KB for 200 individuals, regardless of the number of markers. Works on arbitrarily large tables.
Output files:
eigenvalues.tsv: component, eigenvalue, variancefraction, cumulativeloadings.tsv: individual x component loading matrixsummary.txt: total variance, top components
Use cases:
Sex signal detection: tests whether PC1 separates M/F. On real-world RAD-seq data, the answer is no – sex-linked markers are only 0.03% of the depth matrix, drowned by library size and population effects. This negative finding justifies targeted methods (FST, chi-squared).
Sample QC: outlier individuals on PC2-3 indicate sequencing problems or batch effects.
Population structure: on combined multi-population tables, PC1 separates populations, not sexes. Confirms per-population analysis is the correct design for sex detection.
Rank estimation: eigenvalue spectrum reveals effective dimensionality. For RAD-seq depth data, typically 3-5 components explain > 90% of variance (low-rank, driven by restriction site conservation patterns).
See scripts/sympy/tucker_covariance_proof.py for the mathematical
proof that this gives exact Tucker mode-2 factors.