Pipeline¶
This pipeline processes data from Capture-C/NG Capture-C/Tri-C and Tiled-C sequencing protocols designed to identify 3D interactions in the genome from a specified viewpoint.
It takes Illumina paired-end sequencing reads in fastq format (gzip compression is prefered) as input and performs the following steps:
Identifies all restriction fragments in the genome
Quality control of raw reads (fastqc, multiqc)
Splits fastqs into smaller files to enable fast parallel processing.
Removal of PCR duplicates based on exact sequence matches from fastq files
Trimming of reads to remove adaptor sequence (trim_galore)
Combining overlapping read pairs (FLASh)
In silico digestion of reads in fastq files
Alignment of fastq files with a user specified aligner (i.e. bowtie/bowtie2; BWA is not supported)
Analysis of alignment statistics (picard CollectAlignmentSummaryMetrics, multiqc)
Annotation of mapped reads with overlaps of capture probes, exclusion regions, blacklist, restriction fragments
Removal of non-reporter slices and indentification of reporters
Removal of PCR duplicates (exact coordinate matches)
Storage of reporters in cooler format <https.//cooler.readthedocs.io/en/latest/datamodel.html>
Generation of bedgraphs/BigWigs.
Collation of run statistics and generation of a run report
Optional:
Generation of a UCSC track hub for visualisation.
Differential interaction identification.
Generation of subtraction bedgraphs for between condition comparisons
Plotting of heatmaps.
@authors: asmith, dsims
- ccanalyser.pipeline.pipeline.check_config()¶
Checks that all essential configuration has been provided.
- ccanalyser.pipeline.pipeline.modify_pipeline_params_dict()¶
Modifies P.PARAMS dictionary.
Selects the correct conda enviroment
Ensures the correct queue manager is selected.
Corrects the name of a UCSC hub by removing spaces and incorrect characters.
- ccanalyser.pipeline.pipeline.set_up_chromsizes()¶
Ensures that genome chromsizes are present.
If chromsizes are not provided this function attempts to download them from UCSC. The P.PARAMS dictionary is updated with the location of the chromsizes.
- ccanalyser.pipeline.pipeline.check_user_supplied_paths()¶
- ccanalyser.pipeline.pipeline.genome_digest(infile, outfile)¶
In silco digestion of the genome to identify restriction fragment coordinates.
Runs ccanalyser genome digest.
- ccanalyser.pipeline.pipeline.fastq_qc(infile, outfile)¶
Runs fastqc on the input files to generate fastq statistics.
- ccanalyser.pipeline.pipeline.fastq_multiqc(infile, outfile)¶
Collate fastqc reports into single report using multiqc
- ccanalyser.pipeline.pipeline.fastq_split(infiles, outfile)¶
Splits the input fastq files into chunks for parallel processing
Runs ccanalyser fastq split.
- ccanalyser.pipeline.pipeline.fastq_duplicates_parse(infiles, outfile, sample_name, part_no)¶
Parses fastq files into json format for sequence deduplication.
- ccanalyser.pipeline.pipeline.fastq_duplicates_identify(infiles, outfile)¶
Identifies duplicate sequences from parsed fastq files in json format.
- ccanalyser.pipeline.pipeline.fastq_duplicates_remove(infiles, outfile)¶
Removes duplicate read fragments identified from parsed fastq files.
- ccanalyser.pipeline.pipeline.stats_deduplication_collate(infiles, outfile)¶
Combines deduplication statistics from fastq file partitions.
- ccanalyser.pipeline.pipeline.fastq_trim(infiles, outfile)¶
Trim adaptor sequences from fastq files using trim_galore
- ccanalyser.pipeline.pipeline.stats_trim_collate(infiles, outfile)¶
Extracts and collates adapter trimming statistics from trim_galore output
- ccanalyser.pipeline.pipeline.fastq_flash(infiles, outfile)¶
Combine overlapping paired-end reads using FLASh
- ccanalyser.pipeline.pipeline.fastq_digest_combined(infile, outfile)¶
In silico restriction enzyme digest of combined (flashed) read pairs
- ccanalyser.pipeline.pipeline.fastq_digest_non_combined(infiles, outfile)¶
In silico restriction enzyme digest of non-combined (non-flashed) read pairs
- ccanalyser.pipeline.pipeline.stats_digestion_collate(infiles, outfile)¶
Aggregates in silico digestion statistics from fastq file partitions.
- ccanalyser.pipeline.pipeline.fastq_preprocessing()¶
- ccanalyser.pipeline.pipeline.fastq_alignment(infile, outfile)¶
Aligns in silico digested fastq files to the genome.
- ccanalyser.pipeline.pipeline.alignments_merge(infiles, outfile)¶
Combines bam files (by flashed/non-flashed status and sample).
This task simply provides an input for picard CollectAlignmentSummaryMetrics and is only used to provide overall mapping statistics. Fastq partitions are not combined at this stage.
- ccanalyser.pipeline.pipeline.alignments_index(infile, outfile)¶
Indexes all bam files (both partitioned and merged)
- ccanalyser.pipeline.pipeline.pre_annotation()¶
- ccanalyser.pipeline.pipeline.annotate_make_exclusion_bed(outfile)¶
Generates exclusion window around each capture site
- ccanalyser.pipeline.pipeline.annotate_sort_viewpoints(outfile)¶
Sorts the capture oligos for bedtools intersect with –sorted option
- ccanalyser.pipeline.pipeline.annotate_sort_blacklist(outfile)¶
Sorts the capture oligos for bedtools intersect with –sorted option
- ccanalyser.pipeline.pipeline.annotate_alignments(infile, outfile)¶
Annotates mapped read slices.
- Slices are annotated with:
capture name
capture count
exclusion name
exclusion count
blacklist count
restriction fragment number
- ccanalyser.pipeline.pipeline.post_annotation()¶
Runs the pipeline until just prior to identification of reporters
- ccanalyser.pipeline.pipeline.alignments_filter(infiles, outfile)¶
Filteres slices and outputs reporter slices for each capture site
- ccanalyser.pipeline.pipeline.reporters_collate(infiles, outfile, *grouping_args)¶
Concatenates identified reporters
- ccanalyser.pipeline.pipeline.alignments_deduplicate_fragments(infile, outfile, read_type)¶
Identifies duplicate fragments with the same coordinates and order.
- ccanalyser.pipeline.pipeline.alignments_deduplicate_slices(infile, outfile, sample_name, read_type, capture_oligo)¶
Removes reporters with duplicate coordinates
- ccanalyser.pipeline.pipeline.alignments_deduplicate_collate(infiles, outfile, *grouping_args)¶
Final collation of reporters by sample and capture probe
- ccanalyser.pipeline.pipeline.stats_alignment_filtering_collate(infiles, outfile)¶
‘Combination of all reporter identification and filtering statistics
- ccanalyser.pipeline.pipeline.post_ccanalyser_analysis()¶
Reporters have been identified, deduplicated and collated by sample/capture probe
- ccanalyser.pipeline.pipeline.reporters_count(infile, outfile)¶
Counts the number of interactions identified between reporter restriction fragments
- ccanalyser.pipeline.pipeline.reporters_store_restriction_fragment(infile, outfile, sample_name, capture_name)¶
Stores restriction fragment interaction counts in cooler format
- ccanalyser.pipeline.pipeline.generate_bin_conversion_tables(outfile)¶
Converts restriction fragments to genomic bins.
Binning restriction fragments into genomic bins takes a substantial amount of time and memory. To avoid repeatedly performing the same action, bin conversion tables are calculated once for each required resolution and then stored as a pickle file.
- ccanalyser.pipeline.pipeline.reporters_store_binned(infile, outfile, capture_name)¶
Converts a cooler file of restriction fragments to even genomic bins.
- ccanalyser.pipeline.pipeline.reporters_store_merged(infiles, outfile, sample_name)¶
Combines cooler files together
- ccanalyser.pipeline.pipeline.pipeline_merge_stats(infiles, outfile)¶
Generates a summary statistics file for the pipeline run.
- ccanalyser.pipeline.pipeline.pipeline_make_report(infile, outfile)¶
Run jupyter notebook for reporting and plotting pipeline statistics
- ccanalyser.pipeline.pipeline.reporters_make_bedgraph(infile, outfile, sample_name)¶
Extract reporters in bedgraph format from stored interactions
- ccanalyser.pipeline.pipeline.reporters_make_bedgraph_normalised(infile, outfile, sample_name)¶
Extract reporters in bedgraph format from stored interactions.
In addition to generating a bedgraph this task also normalises the counts by the number of cis interactions identified to enable cross sample comparisons.
- ccanalyser.pipeline.pipeline.reporters_make_union_bedgraph(infiles, outfile, normalisation_type, capture_name)¶
Collates bedgraphs by capture probe into a single file for comparison.
See bedtools unionbedg for more details.
- ccanalyser.pipeline.pipeline.reporters_make_comparison_bedgraph(infile, outfile, viewpoint)¶
- ccanalyser.pipeline.pipeline.reporters_make_bigwig(infile, outfile)¶
Uses UCSC tools bedGraphToBigWig to generate bigWigs for each bedgraph
- ccanalyser.pipeline.pipeline.viewpoints_to_bigbed(infile, outfile)¶
- ccanalyser.pipeline.pipeline.hub_make(infiles, outfile)¶
Creates a ucsc hub from the pipeline output
- ccanalyser.pipeline.pipeline.identify_differential_interactions(infile, outfile, capture_name)¶
- ccanalyser.pipeline.pipeline.reporters_plot_heatmap(infile, outfile)¶
Plots a heatmap over a specified region
- ccanalyser.pipeline.pipeline.full(outfile)¶