Pipeline

This pipeline processes data from Capture-C/NG Capture-C/Tri-C and Tiled-C sequencing protocols designed to identify 3D interactions in the genome from a specified viewpoint.

It takes Illumina paired-end sequencing reads in fastq format (gzip compression is prefered) as input and performs the following steps:

  1. Identifies all restriction fragments in the genome

  2. Quality control of raw reads (fastqc, multiqc)

  3. Splits fastqs into smaller files to enable fast parallel processing.

  4. Removal of PCR duplicates based on exact sequence matches from fastq files

  5. Trimming of reads to remove adaptor sequence (trim_galore)

  6. Combining overlapping read pairs (FLASh)

  7. In silico digestion of reads in fastq files

  8. Alignment of fastq files with a user specified aligner (i.e. bowtie/bowtie2; BWA is not supported)

  9. Analysis of alignment statistics (picard CollectAlignmentSummaryMetrics, multiqc)

  10. Annotation of mapped reads with overlaps of capture probes, exclusion regions, blacklist, restriction fragments

  11. Removal of non-reporter slices and indentification of reporters

  12. Removal of PCR duplicates (exact coordinate matches)

  13. Storage of reporters in cooler format <https.//cooler.readthedocs.io/en/latest/datamodel.html>

  14. Generation of bedgraphs/BigWigs.

  15. Collation of run statistics and generation of a run report

Optional:

  • Generation of a UCSC track hub for visualisation.

  • Differential interaction identification.

  • Generation of subtraction bedgraphs for between condition comparisons

  • Plotting of heatmaps.

@authors: asmith, dsims

ccanalyser.pipeline.pipeline.check_config()

Checks that all essential configuration has been provided.

ccanalyser.pipeline.pipeline.modify_pipeline_params_dict()

Modifies P.PARAMS dictionary.

  • Selects the correct conda enviroment

  • Ensures the correct queue manager is selected.

  • Corrects the name of a UCSC hub by removing spaces and incorrect characters.

ccanalyser.pipeline.pipeline.set_up_chromsizes()

Ensures that genome chromsizes are present.

If chromsizes are not provided this function attempts to download them from UCSC. The P.PARAMS dictionary is updated with the location of the chromsizes.

ccanalyser.pipeline.pipeline.check_user_supplied_paths()
ccanalyser.pipeline.pipeline.genome_digest(infile, outfile)

In silco digestion of the genome to identify restriction fragment coordinates.

Runs ccanalyser genome digest.

ccanalyser.pipeline.pipeline.fastq_qc(infile, outfile)

Runs fastqc on the input files to generate fastq statistics.

ccanalyser.pipeline.pipeline.fastq_multiqc(infile, outfile)

Collate fastqc reports into single report using multiqc

ccanalyser.pipeline.pipeline.fastq_split(infiles, outfile)

Splits the input fastq files into chunks for parallel processing

Runs ccanalyser fastq split.

ccanalyser.pipeline.pipeline.fastq_duplicates_parse(infiles, outfile, sample_name, part_no)

Parses fastq files into json format for sequence deduplication.

Runs ccanalyser fastq deduplicate parse

ccanalyser.pipeline.pipeline.fastq_duplicates_identify(infiles, outfile)

Identifies duplicate sequences from parsed fastq files in json format.

Runs ccanalyser fastq deduplicate identify

ccanalyser.pipeline.pipeline.fastq_duplicates_remove(infiles, outfile)

Removes duplicate read fragments identified from parsed fastq files.

ccanalyser.pipeline.pipeline.stats_deduplication_collate(infiles, outfile)

Combines deduplication statistics from fastq file partitions.

ccanalyser.pipeline.pipeline.fastq_trim(infiles, outfile)

Trim adaptor sequences from fastq files using trim_galore

ccanalyser.pipeline.pipeline.stats_trim_collate(infiles, outfile)

Extracts and collates adapter trimming statistics from trim_galore output

ccanalyser.pipeline.pipeline.fastq_flash(infiles, outfile)

Combine overlapping paired-end reads using FLASh

ccanalyser.pipeline.pipeline.fastq_digest_combined(infile, outfile)

In silico restriction enzyme digest of combined (flashed) read pairs

ccanalyser.pipeline.pipeline.fastq_digest_non_combined(infiles, outfile)

In silico restriction enzyme digest of non-combined (non-flashed) read pairs

ccanalyser.pipeline.pipeline.stats_digestion_collate(infiles, outfile)

Aggregates in silico digestion statistics from fastq file partitions.

ccanalyser.pipeline.pipeline.fastq_preprocessing()
ccanalyser.pipeline.pipeline.fastq_alignment(infile, outfile)

Aligns in silico digested fastq files to the genome.

ccanalyser.pipeline.pipeline.alignments_merge(infiles, outfile)

Combines bam files (by flashed/non-flashed status and sample).

This task simply provides an input for picard CollectAlignmentSummaryMetrics and is only used to provide overall mapping statistics. Fastq partitions are not combined at this stage.

ccanalyser.pipeline.pipeline.alignments_index(infile, outfile)

Indexes all bam files (both partitioned and merged)

ccanalyser.pipeline.pipeline.pre_annotation()
ccanalyser.pipeline.pipeline.annotate_make_exclusion_bed(outfile)

Generates exclusion window around each capture site

ccanalyser.pipeline.pipeline.annotate_sort_viewpoints(outfile)

Sorts the capture oligos for bedtools intersect with –sorted option

ccanalyser.pipeline.pipeline.annotate_sort_blacklist(outfile)

Sorts the capture oligos for bedtools intersect with –sorted option

ccanalyser.pipeline.pipeline.annotate_alignments(infile, outfile)

Annotates mapped read slices.

Slices are annotated with:
  • capture name

  • capture count

  • exclusion name

  • exclusion count

  • blacklist count

  • restriction fragment number

ccanalyser.pipeline.pipeline.post_annotation()

Runs the pipeline until just prior to identification of reporters

ccanalyser.pipeline.pipeline.alignments_filter(infiles, outfile)

Filteres slices and outputs reporter slices for each capture site

ccanalyser.pipeline.pipeline.reporters_collate(infiles, outfile, *grouping_args)

Concatenates identified reporters

ccanalyser.pipeline.pipeline.alignments_deduplicate_fragments(infile, outfile, read_type)

Identifies duplicate fragments with the same coordinates and order.

ccanalyser.pipeline.pipeline.alignments_deduplicate_slices(infile, outfile, sample_name, read_type, capture_oligo)

Removes reporters with duplicate coordinates

ccanalyser.pipeline.pipeline.alignments_deduplicate_collate(infiles, outfile, *grouping_args)

Final collation of reporters by sample and capture probe

ccanalyser.pipeline.pipeline.stats_alignment_filtering_collate(infiles, outfile)

‘Combination of all reporter identification and filtering statistics

ccanalyser.pipeline.pipeline.post_ccanalyser_analysis()

Reporters have been identified, deduplicated and collated by sample/capture probe

ccanalyser.pipeline.pipeline.reporters_count(infile, outfile)

Counts the number of interactions identified between reporter restriction fragments

ccanalyser.pipeline.pipeline.reporters_store_restriction_fragment(infile, outfile, sample_name, capture_name)

Stores restriction fragment interaction counts in cooler format

ccanalyser.pipeline.pipeline.generate_bin_conversion_tables(outfile)

Converts restriction fragments to genomic bins.

Binning restriction fragments into genomic bins takes a substantial amount of time and memory. To avoid repeatedly performing the same action, bin conversion tables are calculated once for each required resolution and then stored as a pickle file.

ccanalyser.pipeline.pipeline.reporters_store_binned(infile, outfile, capture_name)

Converts a cooler file of restriction fragments to even genomic bins.

ccanalyser.pipeline.pipeline.reporters_store_merged(infiles, outfile, sample_name)

Combines cooler files together

ccanalyser.pipeline.pipeline.pipeline_merge_stats(infiles, outfile)

Generates a summary statistics file for the pipeline run.

ccanalyser.pipeline.pipeline.pipeline_make_report(infile, outfile)

Run jupyter notebook for reporting and plotting pipeline statistics

ccanalyser.pipeline.pipeline.reporters_make_bedgraph(infile, outfile, sample_name)

Extract reporters in bedgraph format from stored interactions

ccanalyser.pipeline.pipeline.reporters_make_bedgraph_normalised(infile, outfile, sample_name)

Extract reporters in bedgraph format from stored interactions.

In addition to generating a bedgraph this task also normalises the counts by the number of cis interactions identified to enable cross sample comparisons.

ccanalyser.pipeline.pipeline.reporters_make_union_bedgraph(infiles, outfile, normalisation_type, capture_name)

Collates bedgraphs by capture probe into a single file for comparison.

See bedtools unionbedg for more details.

ccanalyser.pipeline.pipeline.reporters_make_comparison_bedgraph(infile, outfile, viewpoint)
ccanalyser.pipeline.pipeline.reporters_make_bigwig(infile, outfile)

Uses UCSC tools bedGraphToBigWig to generate bigWigs for each bedgraph

ccanalyser.pipeline.pipeline.viewpoints_to_bigbed(infile, outfile)
ccanalyser.pipeline.pipeline.hub_make(infiles, outfile)

Creates a ucsc hub from the pipeline output

ccanalyser.pipeline.pipeline.identify_differential_interactions(infile, outfile, capture_name)
ccanalyser.pipeline.pipeline.reporters_plot_heatmap(infile, outfile)

Plots a heatmap over a specified region

ccanalyser.pipeline.pipeline.full(outfile)