Usage¶
Preparation¶
STRique works on raw nanopore read data in either single or bulk fast5 format. Batches of single reads in tar archives work as well. Before starting the repeat detection the raw data folder must be indexed to enable extraction of single reads. The index file contains relative paths to the reads and must be saved in the indexed directory.
python3 scripts/STRique.py index [OPTIONS] input positional arguments: input Input batch or directory of batches optional arguments: --recursive Recursively scan input --out_prefix OUT_PREFIX Prefix for file paths in output --tmp_prefix TMP_PREFIX Prefix for temporary data
The command to recursively index the raw data archive could look similar to the following. Indexing is only required once after the sequencing is completed.
python3 ~/src/STRique/scripts/STRique.py index \ --recursive ~/my_data > ~/my_data/reads.fofn
The index file reads.fofn contains relative paths to the raw files and must therefore be saved in the right location. You can configure an --out_prefix to be prepended to each entry in the index. This is useful to index sub-directories while storing indices at a central location e.g.:
python3 ~/src/STRique/scripts/STRique.py index \ --recursive --out_prefix my_sample ~/my_data/my_sample > ~/my_data/my_sample.fofn
Configuration¶
Targeted repeats are configured in a tab-separated (.tsv) file with columns
chr begin end name repeat prefix suffix
The file must have the header line and can contain as many repeats as present/targeted by enrichment. For the hexanucleotide repeat at the c9orf72 locus the (truncated) config for hg19 alignments looks like this (A complete example file is in the config folder of the STRique repository):
chr9 27573527 27573544 c9orf72 GGCCCC ...GCCCCGACCACGCCCC TAGCGCGCGACTCCTG...
STRique will only consider aligned reads where the mapping including soft-clipping at least partially covers one of the configured targets. The longer the prefix/ suffix sequences, the more reliable the signal alignment at the cost of a longer runtime. A good estimate is 150 Bp for prefix and suffix. Repeat, prefix and suffix sequence are always on template strand.
Repeat Counting¶
The repeat detection requires an indexed raw data archive and the alignment of the reads:
python3 scripts/STRique.py count [OPTIONS] f5 model repeat positional arguments: f5Index Fast5 index model pore model repeat repeat region config file optional arguments: --out OUT output file name, if not given print to stdout --algn ALGN alignment in sam format, if not given read from stdin --mod_model MOD_MODEL Base modification pore model --config CONFIG Config file with HMM transition probabilities --t T Number of processes to use in parallel --log_level Detailed output {error,warning,info,debug}
The command to detect repeat lengths could look similar to:
cat ~/my_data.hg19.sam | python3 ~/src/STRique/scripts/STRique.py count \ ~/my_data/reads.fofn ~/src/STRique/models/r9_4_450bps.model \ ~/src/STRique/configs/repeat_config.tsv \ > ~/my_data.hg19.strique.tsv
For Docker users the container needs to mount the host file system to access the raw data. To process the same dataset as above run:
docker run -it --mount type=bind,source=${HOME},target=/host \ giesselmann/strique:v0.3.0 cat /host/my_data.hg19.sam | python3 scripts/STRique.py count \ /host/my_data/reads.fofn models/r9_4_450bps.model \ configs/repeat_config.tsv > /host/my_data.hg19.strique.tsv
Please note that changes made to the Docker filesystem are not persistent and will be lost after exiting the container. Make sure to write the output to a file on the host.