Pipeline

The main feature of ccanalyser is the end-to-end data processing pipeline. The pipeline has been written using the cgat-core workflow management system and the following diagram illustrates the steps performed by the pipeline:

Pipeline flow diagram

This section provides further details on how to run the pipeline. In essence the pipeline requires a working directory with correctly named FASTQ files and a config.yml file that provides the pipeline configuration.

Step 1 - Create a working directory

To run the pipeline you will need to create a working directory for the pipeline run:

mkdir RS411_EPZ5676/
cd RS411_EPZ5676/

The pipeline will be executed here and all files will be generated in this directory.

Step 2 - Edit a copy of config.yml

The configuration file config.yml enables parameterisation of the pipeline run with user specific settings. Furthermore, it also provides paths to essential files for the pipeline run (e.g., bowtie2 indices). The paths supplied do not have to be in the same directory as the pipeline.

Warning

The yaml file must be named config.yml for the pipeline to recognise it and run correctly.

A copy of config.yml can be downloaded from GitHub using:

wget https://raw.githubusercontent.com/sims-lab/capture-c/master/config.yml

This yaml file can be edited using standard text editors e.g.:

# To use gedit
gedit config.yml

# To use nano
nano config.yml

Step 4 - Running the pipeline

After copying/linking FASTQ files into the working directory and configuring the copy of config.yml in the working directory for the current experiment, the pipeline can be run with:

ccanalyser pipeline

There are several options to visualise which tasks will be performed by the pipeline before running.

The tasks to be performed can be examined with:

# Shows the tasks to be performed
ccanalyser pipeline show

# Plots a directed graph using graphviz
ccanalyser pipeline plot

If you are happy with the tasks to be performed, the full pipeline run can be launched with:

# If using all default settings and using a cluster
ccanalyser pipeline make

# If not using a cluster, run in local mode.
ccanalyser pipeline make --local -p 4

# Avoiding network disconnections
nohup ccanalyser pipeline make &

See cgat-core Read the Docs for additional information.

Step 5 - Running the pipeline to a specified stage

There are currently multiple stopping points built into the pipeline at key stages. These are:

  • fastq_preprocessing - Stops after in silico digestion of FASTQ files.

  • pre_annotation - Stops before aligned slices are ready to be annotated.

  • post_annotation - Stops after aligned slices have been annotated.

  • post_ccanalyser_analysis - Stops after reporters have been identified and duplicate filtered.

  • full - Run the pipeline until all required tasks are complete.

To run the pipeline until one of these stopping points, use:

# Run until TASK_NAME step
ccanalyser pipeline make TASK_NAME

Pipeline outputs