CLI Documentation

ccanalyser

An end to end solution for processing: Capture-C, Tri-C and Tiled-C data.

ccanalyser [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

alignments

Alignment annotation, identification and deduplication.

ccanalyser alignments [OPTIONS] COMMAND [ARGS]...

annotate

Annotates a bed file with other bed files using bedtools intersect.

Whilst bedtools intersect allows for interval names and counts to be used for annotating intervals, this command provides the ability to annotate intervals with both interval names and counts at the same time. As the pipeline allows for empty bed files, this command has built in support to deal with blank/malformed bed files and will return default N/A values.

Prior to interval annotation, the bed file to be intersected is validated and duplicate entries/multimapping reads are removed to ensure consistent annotations and prevent issues with reporter identification.

ccanalyser alignments annotate [OPTIONS] SLICES

Options

-a, --actions <actions>

Determines if the overlaps are counted or if the name should just be reported

Options

get | count

-b, --bed_files <bed_files>

Bed file(s) to intersect with slices

-n, --names <names>

Names to use as column names for the output tsv file.

-f, --overlap_fractions <overlap_fractions>

The minimum overlap required for an intersection between two intervals to be reported.

-o, --output <output>

Path for the annotated slices to be output.

--duplicates <duplicates>

Method to use for reconciling duplicate slices (i.e. multimapping). Currently only ‘remove’ is supported.

Options

remove

-p, --n_cores <n_cores>

Intersections are performed by chromosome, this determines the number of cores.

--invalid_bed_action <invalid_bed_action>

Method to deal with invalid bed files e.g. blank or incorrectly formatted. Setting this to ‘ignore’ will report default N/A values (either ‘.’ or 0) for invalid files

Options

ignore | error

Arguments

SLICES

Required argument

deduplicate

Identifies and removes duplicated aligned fragments.

PCR duplicates are very commonly present in Capture-C/Tri-C/Tiled-C data and must be removed for accurate analysis. Unlike fastq deduplicate, this command removes fragments with identical genomic coordinates.

Non-combined (pe) and combined (flashed) reads are treated slightly differently due to the increased confidence that the ligation junction has been captured for the flashed reads.

ccanalyser alignments deduplicate [OPTIONS] COMMAND [ARGS]...
identify
ccanalyser alignments deduplicate identify [OPTIONS] FRAGMENTS_FN

Options

-o, --output <output>

Path for outputting fragments with duplicated coordinates in json format.

--buffer <buffer>

Number of fragments to process at one time in order to preserve memory.

--read_type <read_type>

Indicates if the fragments have been combined (flashed) or not (pe).

Options

flashed | pe

Arguments

FRAGMENTS_FN

Required argument

remove

Removes duplicated aligned fragments.

Parses a tsv file containing aligned read slices and outputs only slices from unique fragments. Duplicated parental read id determined by the “identify” subcommand are located within the slices tsv file and removed.

Outputs statistics for the number of unique slices and the number of duplicate slices identified.

ccanalyser alignments deduplicate remove [OPTIONS] SLICES_FN

Options

-d, --duplicated_ids <duplicated_ids>

Path to duplicated fragment ids determined by the ‘identify’ subcommand.

-o, --output <output>

Path for outputting deduplicated slices in tsv format.

--buffer <buffer>

Number of fragments to process at one time, in order to preserve memory.

--stats_prefix <stats_prefix>

Output prefix for deduplication statistics

--sample_name <sample_name>

Name of sample being analysed e.g. DOX_treated_1. Required for correct statistics.

--read_type <read_type>

Indicates if the fragments have been combined (flashed) or not (pe). Required for correct statistics.

Options

flashed | pe

Arguments

SLICES_FN

Required argument

filter

Removes unwanted aligned slices and identifies reporters.

Parses a BAM file and merges this with a supplied annotation to identify unwanted slices. Filtering can be tuned for Capture-C, Tri-C and Tiled-C data to ensure optimal filtering.

ccanalyser alignments filter [OPTIONS] {capture|tri|tiled}

Options

-b, --bam <bam>

Required Bam file to process

-a, --annotations <annotations>

Required Annotations for the bam file that must contain the required columns, see description.

--custom_filtering <custom_filtering>

Custom filtering to be used. This must be supplied as a path to a yaml file.

-o, --output_prefix <output_prefix>

Output prefix for deduplicated fastq file(s)

--stats_prefix <stats_prefix>

Output prefix for stats file(s)

--sample_name <sample_name>

Name of sample e.g. DOX_treated_1

--read_type <read_type>

Type of read

Options

flashed | pe

--gzip, --no-gzip

Determines if files are gziped or not

Arguments

METHOD

Required argument

fastq

Fastq splitting, deduplication and digestion.

ccanalyser fastq [OPTIONS] COMMAND [ARGS]...

deduplicate

Identifies PCR duplicate fragments from Fastq files.

PCR duplicates are very commonly present in Capture-C/Tri-C/Tiled-C data and must be removed for accurate analysis. These commands attempt to identify and remove duplicate reads/fragments from fastq file(s) to speed up downstream analysis.

ccanalyser fastq deduplicate [OPTIONS] COMMAND [ARGS]...
identify

Identifies fragments with duplicated sequences.

Merges the hashed dictionaries (in json format) generated by the “parse” subcommand and identifies read with exactly the same sequence (share an identical hash). Duplicated read identifiers (hashed) are output in json format. The “remove” subcommand uses this dictionary to remove duplicates from fastq files.

ccanalyser fastq deduplicate identify [OPTIONS] [INPUT_FILES]...

Options

-o, --output <output>

Required Output file

Arguments

INPUT_FILES

Optional argument(s)

parse

Parses fastq file(s) into easy to deduplicate format.

This command parses one or more fastq files and generates a dictionary containing hashed read identifiers together with hashed concatenated sequences. The hash dictionary is output in json format and the identify subcommand can be used to determine which read identifiers have duplicate sequences.

ccanalyser fastq deduplicate parse [OPTIONS] INPUT_FILES...

Options

-o, --output <output>

File to store hashed sequence identifiers

--read_buffer <read_buffer>

Number of reads to process before writing to file

Arguments

INPUT_FILES

Required argument(s)

remove

Removes fragments with duplicated sequences from fastq files.

Parses input fastq files and removes any duplicates from the fastq file(s) that are present in the json file supplied. This json dictionary should be produced by the “identify” subcommand.

Statistics for the number of duplicated and unique reads are also provided.

ccanalyser fastq deduplicate remove [OPTIONS] [INPUT_FILES]...

Options

-o, --output_prefix <output_prefix>

Output prefix for deduplicated fastq file(s)

-d, --duplicated_ids <duplicated_ids>

Path to duplicate ids, identified by the identify subcommand

--read_buffer <read_buffer>

Number of reads to process before writing to file

--gzip, --no-gzip

Determines if files are gziped or not

--compression_level <compression_level>

Level of compression for output files

--sample_name <sample_name>

Name of sample e.g. DOX_treated_1

--stats_prefix <stats_prefix>

Output prefix for stats file

Arguments

INPUT_FILES

Optional argument(s)

digest

Performs in silico digestion of one or a pair of fastq files.

ccanalyser fastq digest [OPTIONS] INPUT_FASTQ...

Options

-r, --restriction_enzyme <restriction_enzyme>

Required Restriction enzyme name or sequence to use for in silico digestion.

-m, --mode <mode>

Required Digestion mode. Combined (Flashed) or non-combined (PE) read pairs.

Options

flashed | pe

-o, --output_file <output_file>
-p, --n_cores <n_cores>
--minimum_slice_length <minimum_slice_length>
--keep_cutsite <keep_cutsite>
--compression_level <compression_level>

Level of compression for output files (1=low, 9=high)

--read_buffer <read_buffer>

Number of reads to process before writing to file to conserve memory.

--stats_prefix <stats_prefix>

Output prefix for stats file

--sample_name <sample_name>

Name of sample e.g. DOX_treated_1. Required for correct statistics.

Arguments

INPUT_FASTQ

Required argument(s)

split

Splits fastq file(s) into equal chunks of n reads.

ccanalyser fastq split [OPTIONS] INPUT_FILES...

Options

-m, --method <method>

Method to use for splitting

Options

python | unix

-o, --output_prefix <output_prefix>

Output prefix for deduplicated fastq file(s)

--compression_level <compression_level>

Level of compression for output files

-n, --n_reads <n_reads>

Number of reads per fastq file

--gzip, --no-gzip

Determines if files are gziped or not

Arguments

INPUT_FILES

Required argument(s)

genome

Genome wide methods digestion.

ccanalyser genome [OPTIONS] COMMAND [ARGS]...

digest

Performs in silico digestion of a genome in fasta format.

Digests the supplied genome fasta file and generates a bed file containing the locations of all restriction fragments produced by the supplied restriction enzyme.

A log file recording the number of restriction fragments for the suplied genome is also generated.

ccanalyser genome digest [OPTIONS] INPUT_FASTA

Options

-r, --recognition_site <recognition_site>

Required Recognition enzyme or sequence

-l, --logfile <logfile>

Path for digestion log file

-o, --output_file <output_file>

Output file path

--remove_cutsite <remove_cutsite>

Exclude the recognition sequence from the output

--sort

Sorts the output bed file by chromosome and start coord.

Arguments

INPUT_FASTA

Required argument

pipeline

Runs the data processing pipeline

ccanalyser pipeline [OPTIONS] {make|plot|show|clone|touch}
                    [PIPELINE_OPTIONS]...

Options

-h, --help
--version

Show the version and exit.

Arguments

MODE

Required argument

PIPELINE_OPTIONS

Optional argument(s)

reporters

Reporter counting, storing, comparison, pileups and heatmaps.

ccanalyser reporters [OPTIONS] COMMAND [ARGS]...

count

Determines the number of captured restriction fragment interactions genome wide.

Parses a reporter slices tsv and counts the number of unique restriction fragment interaction combinations that occur within each fragment.

Options to ignore unwanted counts e.g. excluded regions or capture fragments are provided. In addition the number of reporter fragments can be subsampled if required.

ccanalyser reporters count [OPTIONS] REPORTERS

Options

-o, --output <output>

Name of output file

--remove_exclusions

Prevents analysis of fragments marked as proximity exclusions

--remove_capture

Prevents analysis of capture fragment interactions

--subsample <subsample>

Subsamples reporters before analysis of interactions

Arguments

REPORTERS

Required argument

differential

Identifies differential interactions between conditions.

Parses a union bedgraph containg reporter counts from at least two conditions with two or more replicates for a single capture probe and outputs differential interaction results. Following filtering to ensure that the number of interactions is above the required threshold (–threshold_count), diffxpy is used to run a wald test after fitting a negative binomial model to the interaction counts.The options to filter results can be filtered by a minimum mean value (threshold_mean) and/or maximum q-value (threshold-q) are also provided.

Notes:

Currently both the capture viewpoints and the name of the probe being analysed must be provided in order to correctly extract cis interactions.

If a N_SAMPLE * METADATA design matrix has not been supplied, the script assumes that the standard replicate naming structure has been followed i.e. SAMPLE_CONDITION_REPLICATE_(1|2).fastq.gz.

ccanalyser reporters differential [OPTIONS] UNION_BEDGRAPH

Options

-n, --capture_name <capture_name>

Required Name of capture probe, must be present in viewpoint file.

-c, --capture_viewpoints <capture_viewpoints>

Required Path to capture viewpoints bed file

-o, --output_prefix <output_prefix>

Output prefix for pairwise statistical comparisons

--design_matrix <design_matrix>

Path tsv file containing sample annotations (N_SAMPLES * N_INFO_COLUMNS)

--grouping_col <grouping_col>

Column to use for grouping replicates

--threshold_count <threshold_count>

Minimum count required to be considered for analysis

--threshold_q <threshold_q>

Upper threshold of q-value required for output.

--threshold_mean <threshold_mean>

Minimum mean count required for output.

Arguments

UNION_BEDGRAPH

Required argument

pileup

Extracts reporters from a capture experiment and generates a bedgraph file.

Identifies reporters for a single probe (if a probe name is supplied) or all capture probes present in a capture experiment HDF5 file.

The bedgraph generated can be normalised by the number of cis interactions for inter experiment comparisons and/or binned into even genomic windows.

ccanalyser reporters pileup [OPTIONS] COOLER_FN

Options

-n, --capture_names <capture_names>

Capture to extract and convert to bedgraph, if not provided will transform all.

-o, --output_prefix <output_prefix>

Output prefix for bedgraphs

--normalise

Normalised bedgraph (Correct for number of cis reads)

--binsize <binsize>

Binsize to use for converting bedgraph to evenly sized genomic bins

--gzip

Compress output using gzip

--scale_factor <scale_factor>

Scale factor to use for bedgraph normalisation

--sparse, --dense

Produce bedgraph containing just positive bins (sparse) or all bins (dense)

Arguments

COOLER_FN

Required argument

plot

Plots a heatmap of reporter interactions.

Parses a HDF5 file containg the result of a capture experiment (binned into even genomic windows) and plots a heatmap of interactions over a specified genomic range. If a capture probe name is not supplied the script will plot all probes present in the file.

Heatmaps can also be normalised (–normalise) using either:

  • n_interactions: The number of cis interactions.

  • n_rf_n_interactions: Normalised to the number of restriction fragments making up both genomic bins

    and by the number of cis interactions.

  • ice: ICE normalisation followed by number of cis interactions

    correction.

ccanalyser reporters plot [OPTIONS] COOLER_FN

Options

-c, --coordinates <coordinates>

Coordinates in the format chr1:1000-2000 or a path to a .bed file with coordinates

-r, --resolution <resolution>

Required Resolution at which to plot. Must be present within the cooler file.

-n, --capture_names <capture_names>

Capture names to plot. If not supplied will plot all

--normalisation <normalisation>

Normalisation method for heatmap

Options

n_interactions | n_rf_n_interactions | ice

--cmap <cmap>

Colour map to use for plotting

--vmax <vmax>

Vmax for plotting

--vmin <vmin>

Vmin for plotting

-o, --output_prefix <output_prefix>

Output prefix for plot

--remove_capture

Removes the capture probe bins from the matrix

Arguments

COOLER_FN

Required argument

store

Store reporter counts.

These commands store and manipulate reporter restriction fragment interaction counts as cooler formated groups in HDF5 files.

See subcommands for details.

ccanalyser reporters store [OPTIONS] COMMAND [ARGS]...
bins

Convert a cooler group containing restriction fragments to constant genomic windows

Parses a cooler group and aggregates restriction fragment interaction counts into genomic bins of a specified size. If the normalise option is selected, columns containing normalised counts are added to the pixels table of the output

ccanalyser reporters store bins [OPTIONS] COOLER_FN

Options

-b, --binsizes <binsizes>

Binsizes to use for windowing

--normalise

Enables normalisation of interaction counts during windowing

--overlap_fraction <overlap_fraction>

Minimum overlap between genomic bins and restriction fragments for overlap

-p, --n_cores <n_cores>

Number of cores used for binning

--scale_factor <scale_factor>

Scaling factor used for normalisation

--conversion_tables <conversion_tables>

Pickle file containing pre-computed fragment -> bin conversions.

-o, --output <output>

Name of output file. (Cooler formatted hdf5 file)

Arguments

COOLER_FN

Required argument

fragments

Stores restriction fragment interaction combinations at the restriction fragment level.

Parses reporter restriction fragment interaction counts produced by “ccanalyser reporters count” and gerates a cooler formatted group in an HDF5 File. See https://cooler.readthedocs.io/en/latest/ for further details.

ccanalyser reporters store fragments [OPTIONS] COUNTS

Options

-f, --fragment_map <fragment_map>

Required Path to digested genome bed file

-c, --capture_viewpoints <capture_viewpoints>

Required Path to capture viewpoints file

-n, --capture_name <capture_name>

Required Name of capture viewpoint to store

-g, --genome <genome>

Name of genome

--suffix <suffix>

Suffix to append after the capture name for the output file

-o, --output <output>

Name of output file. (Cooler formatted hdf5 file)

Arguments

COUNTS

Required argument

merge

Merges ccanalyser cooler files together.

Produces a unified cooler with both restriction fragment and genomic bins whilst reducing the storage space required by hard linking the “bins” tables to prevent duplication.

ccanalyser reporters store merge [OPTIONS] COOLERS...

Options

-o, --output <output>

Output file name

Arguments

COOLERS

Required argument(s)