CLI Documentation¶
ccanalyser¶
An end to end solution for processing: Capture-C, Tri-C and Tiled-C data.
ccanalyser [OPTIONS] COMMAND [ARGS]...
Options
- --version¶
Show the version and exit.
alignments¶
Alignment annotation, identification and deduplication.
ccanalyser alignments [OPTIONS] COMMAND [ARGS]...
annotate¶
Annotates a bed file with other bed files using bedtools intersect.
Whilst bedtools intersect allows for interval names and counts to be used for annotating intervals, this command provides the ability to annotate intervals with both interval names and counts at the same time. As the pipeline allows for empty bed files, this command has built in support to deal with blank/malformed bed files and will return default N/A values.
Prior to interval annotation, the bed file to be intersected is validated and duplicate entries/multimapping reads are removed to ensure consistent annotations and prevent issues with reporter identification.
ccanalyser alignments annotate [OPTIONS] SLICES
Options
- -a, --actions <actions>¶
Determines if the overlaps are counted or if the name should just be reported
- Options
get | count
- -b, --bed_files <bed_files>¶
Bed file(s) to intersect with slices
- -n, --names <names>¶
Names to use as column names for the output tsv file.
- -f, --overlap_fractions <overlap_fractions>¶
The minimum overlap required for an intersection between two intervals to be reported.
- -o, --output <output>¶
Path for the annotated slices to be output.
- --duplicates <duplicates>¶
Method to use for reconciling duplicate slices (i.e. multimapping). Currently only ‘remove’ is supported.
- Options
remove
- -p, --n_cores <n_cores>¶
Intersections are performed by chromosome, this determines the number of cores.
- --invalid_bed_action <invalid_bed_action>¶
Method to deal with invalid bed files e.g. blank or incorrectly formatted. Setting this to ‘ignore’ will report default N/A values (either ‘.’ or 0) for invalid files
- Options
ignore | error
Arguments
- SLICES¶
Required argument
deduplicate¶
Identifies and removes duplicated aligned fragments.
PCR duplicates are very commonly present in Capture-C/Tri-C/Tiled-C data and must be removed for accurate analysis. Unlike fastq deduplicate, this command removes fragments with identical genomic coordinates.
Non-combined (pe) and combined (flashed) reads are treated slightly differently due to the increased confidence that the ligation junction has been captured for the flashed reads.
ccanalyser alignments deduplicate [OPTIONS] COMMAND [ARGS]...
identify¶
ccanalyser alignments deduplicate identify [OPTIONS] FRAGMENTS_FN
Options
- -o, --output <output>¶
Path for outputting fragments with duplicated coordinates in json format.
- --buffer <buffer>¶
Number of fragments to process at one time in order to preserve memory.
- --read_type <read_type>¶
Indicates if the fragments have been combined (flashed) or not (pe).
- Options
flashed | pe
Arguments
- FRAGMENTS_FN¶
Required argument
remove¶
Removes duplicated aligned fragments.
Parses a tsv file containing aligned read slices and outputs only slices from unique fragments. Duplicated parental read id determined by the “identify” subcommand are located within the slices tsv file and removed.
Outputs statistics for the number of unique slices and the number of duplicate slices identified.
ccanalyser alignments deduplicate remove [OPTIONS] SLICES_FN
Options
- -d, --duplicated_ids <duplicated_ids>¶
Path to duplicated fragment ids determined by the ‘identify’ subcommand.
- -o, --output <output>¶
Path for outputting deduplicated slices in tsv format.
- --buffer <buffer>¶
Number of fragments to process at one time, in order to preserve memory.
- --stats_prefix <stats_prefix>¶
Output prefix for deduplication statistics
- --sample_name <sample_name>¶
Name of sample being analysed e.g. DOX_treated_1. Required for correct statistics.
- --read_type <read_type>¶
Indicates if the fragments have been combined (flashed) or not (pe). Required for correct statistics.
- Options
flashed | pe
Arguments
- SLICES_FN¶
Required argument
filter¶
Removes unwanted aligned slices and identifies reporters.
Parses a BAM file and merges this with a supplied annotation to identify unwanted slices. Filtering can be tuned for Capture-C, Tri-C and Tiled-C data to ensure optimal filtering.
ccanalyser alignments filter [OPTIONS] {capture|tri|tiled}
Options
- -b, --bam <bam>¶
Required Bam file to process
- -a, --annotations <annotations>¶
Required Annotations for the bam file that must contain the required columns, see description.
- --custom_filtering <custom_filtering>¶
Custom filtering to be used. This must be supplied as a path to a yaml file.
- -o, --output_prefix <output_prefix>¶
Output prefix for deduplicated fastq file(s)
- --stats_prefix <stats_prefix>¶
Output prefix for stats file(s)
- --sample_name <sample_name>¶
Name of sample e.g. DOX_treated_1
- --read_type <read_type>¶
Type of read
- Options
flashed | pe
- --gzip, --no-gzip¶
Determines if files are gziped or not
Arguments
- METHOD¶
Required argument
fastq¶
Fastq splitting, deduplication and digestion.
ccanalyser fastq [OPTIONS] COMMAND [ARGS]...
deduplicate¶
Identifies PCR duplicate fragments from Fastq files.
PCR duplicates are very commonly present in Capture-C/Tri-C/Tiled-C data and must be removed for accurate analysis. These commands attempt to identify and remove duplicate reads/fragments from fastq file(s) to speed up downstream analysis.
ccanalyser fastq deduplicate [OPTIONS] COMMAND [ARGS]...
identify¶
Identifies fragments with duplicated sequences.
Merges the hashed dictionaries (in json format) generated by the “parse” subcommand and identifies read with exactly the same sequence (share an identical hash). Duplicated read identifiers (hashed) are output in json format. The “remove” subcommand uses this dictionary to remove duplicates from fastq files.
ccanalyser fastq deduplicate identify [OPTIONS] [INPUT_FILES]...
Options
- -o, --output <output>¶
Required Output file
Arguments
- INPUT_FILES¶
Optional argument(s)
parse¶
Parses fastq file(s) into easy to deduplicate format.
This command parses one or more fastq files and generates a dictionary containing hashed read identifiers together with hashed concatenated sequences. The hash dictionary is output in json format and the identify subcommand can be used to determine which read identifiers have duplicate sequences.
ccanalyser fastq deduplicate parse [OPTIONS] INPUT_FILES...
Options
- -o, --output <output>¶
File to store hashed sequence identifiers
- --read_buffer <read_buffer>¶
Number of reads to process before writing to file
Arguments
- INPUT_FILES¶
Required argument(s)
remove¶
Removes fragments with duplicated sequences from fastq files.
Parses input fastq files and removes any duplicates from the fastq file(s) that are present in the json file supplied. This json dictionary should be produced by the “identify” subcommand.
Statistics for the number of duplicated and unique reads are also provided.
ccanalyser fastq deduplicate remove [OPTIONS] [INPUT_FILES]...
Options
- -o, --output_prefix <output_prefix>¶
Output prefix for deduplicated fastq file(s)
- -d, --duplicated_ids <duplicated_ids>¶
Path to duplicate ids, identified by the identify subcommand
- --read_buffer <read_buffer>¶
Number of reads to process before writing to file
- --gzip, --no-gzip¶
Determines if files are gziped or not
- --compression_level <compression_level>¶
Level of compression for output files
- --sample_name <sample_name>¶
Name of sample e.g. DOX_treated_1
- --stats_prefix <stats_prefix>¶
Output prefix for stats file
Arguments
- INPUT_FILES¶
Optional argument(s)
digest¶
Performs in silico digestion of one or a pair of fastq files.
ccanalyser fastq digest [OPTIONS] INPUT_FASTQ...
Options
- -r, --restriction_enzyme <restriction_enzyme>¶
Required Restriction enzyme name or sequence to use for in silico digestion.
- -m, --mode <mode>¶
Required Digestion mode. Combined (Flashed) or non-combined (PE) read pairs.
- Options
flashed | pe
- -o, --output_file <output_file>¶
- -p, --n_cores <n_cores>¶
- --minimum_slice_length <minimum_slice_length>¶
- --keep_cutsite <keep_cutsite>¶
- --compression_level <compression_level>¶
Level of compression for output files (1=low, 9=high)
- --read_buffer <read_buffer>¶
Number of reads to process before writing to file to conserve memory.
- --stats_prefix <stats_prefix>¶
Output prefix for stats file
- --sample_name <sample_name>¶
Name of sample e.g. DOX_treated_1. Required for correct statistics.
Arguments
- INPUT_FASTQ¶
Required argument(s)
split¶
Splits fastq file(s) into equal chunks of n reads.
ccanalyser fastq split [OPTIONS] INPUT_FILES...
Options
- -m, --method <method>¶
Method to use for splitting
- Options
python | unix
- -o, --output_prefix <output_prefix>¶
Output prefix for deduplicated fastq file(s)
- --compression_level <compression_level>¶
Level of compression for output files
- -n, --n_reads <n_reads>¶
Number of reads per fastq file
- --gzip, --no-gzip¶
Determines if files are gziped or not
Arguments
- INPUT_FILES¶
Required argument(s)
genome¶
Genome wide methods digestion.
ccanalyser genome [OPTIONS] COMMAND [ARGS]...
digest¶
Performs in silico digestion of a genome in fasta format.
Digests the supplied genome fasta file and generates a bed file containing the locations of all restriction fragments produced by the supplied restriction enzyme.
A log file recording the number of restriction fragments for the suplied genome is also generated.
ccanalyser genome digest [OPTIONS] INPUT_FASTA
Options
- -r, --recognition_site <recognition_site>¶
Required Recognition enzyme or sequence
- -l, --logfile <logfile>¶
Path for digestion log file
- -o, --output_file <output_file>¶
Output file path
- --remove_cutsite <remove_cutsite>¶
Exclude the recognition sequence from the output
- --sort¶
Sorts the output bed file by chromosome and start coord.
Arguments
- INPUT_FASTA¶
Required argument
pipeline¶
Runs the data processing pipeline
ccanalyser pipeline [OPTIONS] {make|plot|show|clone|touch}
[PIPELINE_OPTIONS]...
Options
- -h, --help¶
- --version¶
Show the version and exit.
Arguments
- MODE¶
Required argument
- PIPELINE_OPTIONS¶
Optional argument(s)
reporters¶
Reporter counting, storing, comparison, pileups and heatmaps.
ccanalyser reporters [OPTIONS] COMMAND [ARGS]...
count¶
Determines the number of captured restriction fragment interactions genome wide.
Parses a reporter slices tsv and counts the number of unique restriction fragment interaction combinations that occur within each fragment.
Options to ignore unwanted counts e.g. excluded regions or capture fragments are provided. In addition the number of reporter fragments can be subsampled if required.
ccanalyser reporters count [OPTIONS] REPORTERS
Options
- -o, --output <output>¶
Name of output file
- --remove_exclusions¶
Prevents analysis of fragments marked as proximity exclusions
- --remove_capture¶
Prevents analysis of capture fragment interactions
- --subsample <subsample>¶
Subsamples reporters before analysis of interactions
Arguments
- REPORTERS¶
Required argument
differential¶
Identifies differential interactions between conditions.
Parses a union bedgraph containg reporter counts from at least two conditions with two or more replicates for a single capture probe and outputs differential interaction results. Following filtering to ensure that the number of interactions is above the required threshold (–threshold_count), diffxpy is used to run a wald test after fitting a negative binomial model to the interaction counts.The options to filter results can be filtered by a minimum mean value (threshold_mean) and/or maximum q-value (threshold-q) are also provided.
Notes:
Currently both the capture viewpoints and the name of the probe being analysed must be provided in order to correctly extract cis interactions.
If a N_SAMPLE * METADATA design matrix has not been supplied, the script assumes that the standard replicate naming structure has been followed i.e. SAMPLE_CONDITION_REPLICATE_(1|2).fastq.gz.
ccanalyser reporters differential [OPTIONS] UNION_BEDGRAPH
Options
- -n, --capture_name <capture_name>¶
Required Name of capture probe, must be present in viewpoint file.
- -c, --capture_viewpoints <capture_viewpoints>¶
Required Path to capture viewpoints bed file
- -o, --output_prefix <output_prefix>¶
Output prefix for pairwise statistical comparisons
- --design_matrix <design_matrix>¶
Path tsv file containing sample annotations (N_SAMPLES * N_INFO_COLUMNS)
- --grouping_col <grouping_col>¶
Column to use for grouping replicates
- --threshold_count <threshold_count>¶
Minimum count required to be considered for analysis
- --threshold_q <threshold_q>¶
Upper threshold of q-value required for output.
- --threshold_mean <threshold_mean>¶
Minimum mean count required for output.
Arguments
- UNION_BEDGRAPH¶
Required argument
pileup¶
Extracts reporters from a capture experiment and generates a bedgraph file.
Identifies reporters for a single probe (if a probe name is supplied) or all capture probes present in a capture experiment HDF5 file.
The bedgraph generated can be normalised by the number of cis interactions for inter experiment comparisons and/or binned into even genomic windows.
ccanalyser reporters pileup [OPTIONS] COOLER_FN
Options
- -n, --capture_names <capture_names>¶
Capture to extract and convert to bedgraph, if not provided will transform all.
- -o, --output_prefix <output_prefix>¶
Output prefix for bedgraphs
- --normalise¶
Normalised bedgraph (Correct for number of cis reads)
- --binsize <binsize>¶
Binsize to use for converting bedgraph to evenly sized genomic bins
- --gzip¶
Compress output using gzip
- --scale_factor <scale_factor>¶
Scale factor to use for bedgraph normalisation
- --sparse, --dense¶
Produce bedgraph containing just positive bins (sparse) or all bins (dense)
Arguments
- COOLER_FN¶
Required argument
plot¶
Plots a heatmap of reporter interactions.
Parses a HDF5 file containg the result of a capture experiment (binned into even genomic windows) and plots a heatmap of interactions over a specified genomic range. If a capture probe name is not supplied the script will plot all probes present in the file.
Heatmaps can also be normalised (–normalise) using either:
n_interactions: The number of cis interactions.
- n_rf_n_interactions: Normalised to the number of restriction fragments making up both genomic bins
and by the number of cis interactions.
- ice: ICE normalisation followed by number of cis interactions
correction.
ccanalyser reporters plot [OPTIONS] COOLER_FN
Options
- -c, --coordinates <coordinates>¶
Coordinates in the format chr1:1000-2000 or a path to a .bed file with coordinates
- -r, --resolution <resolution>¶
Required Resolution at which to plot. Must be present within the cooler file.
- -n, --capture_names <capture_names>¶
Capture names to plot. If not supplied will plot all
- --normalisation <normalisation>¶
Normalisation method for heatmap
- Options
n_interactions | n_rf_n_interactions | ice
- --cmap <cmap>¶
Colour map to use for plotting
- --vmax <vmax>¶
Vmax for plotting
- --vmin <vmin>¶
Vmin for plotting
- -o, --output_prefix <output_prefix>¶
Output prefix for plot
- --remove_capture¶
Removes the capture probe bins from the matrix
Arguments
- COOLER_FN¶
Required argument
store¶
Store reporter counts.
These commands store and manipulate reporter restriction fragment interaction counts as cooler formated groups in HDF5 files.
See subcommands for details.
ccanalyser reporters store [OPTIONS] COMMAND [ARGS]...
bins¶
Convert a cooler group containing restriction fragments to constant genomic windows
Parses a cooler group and aggregates restriction fragment interaction counts into genomic bins of a specified size. If the normalise option is selected, columns containing normalised counts are added to the pixels table of the output
ccanalyser reporters store bins [OPTIONS] COOLER_FN
Options
- -b, --binsizes <binsizes>¶
Binsizes to use for windowing
- --normalise¶
Enables normalisation of interaction counts during windowing
- --overlap_fraction <overlap_fraction>¶
Minimum overlap between genomic bins and restriction fragments for overlap
- -p, --n_cores <n_cores>¶
Number of cores used for binning
- --scale_factor <scale_factor>¶
Scaling factor used for normalisation
- --conversion_tables <conversion_tables>¶
Pickle file containing pre-computed fragment -> bin conversions.
- -o, --output <output>¶
Name of output file. (Cooler formatted hdf5 file)
Arguments
- COOLER_FN¶
Required argument
fragments¶
Stores restriction fragment interaction combinations at the restriction fragment level.
Parses reporter restriction fragment interaction counts produced by “ccanalyser reporters count” and gerates a cooler formatted group in an HDF5 File. See https://cooler.readthedocs.io/en/latest/ for further details.
ccanalyser reporters store fragments [OPTIONS] COUNTS
Options
- -f, --fragment_map <fragment_map>¶
Required Path to digested genome bed file
- -c, --capture_viewpoints <capture_viewpoints>¶
Required Path to capture viewpoints file
- -n, --capture_name <capture_name>¶
Required Name of capture viewpoint to store
- -g, --genome <genome>¶
Name of genome
- --suffix <suffix>¶
Suffix to append after the capture name for the output file
- -o, --output <output>¶
Name of output file. (Cooler formatted hdf5 file)
Arguments
- COUNTS¶
Required argument
merge¶
Merges ccanalyser cooler files together.
Produces a unified cooler with both restriction fragment and genomic bins whilst reducing the storage space required by hard linking the “bins” tables to prevent duplication.
ccanalyser reporters store merge [OPTIONS] COOLERS...
Options
- -o, --output <output>¶
Output file name
Arguments
- COOLERS¶
Required argument(s)