CCanalyser CLI Modules¶

alignments annotate¶

ccanalyser.cli.alignments_annotate.cycle_argument(arg)¶: Allows for the same argument to be stated once but repeated for all files

ccanalyser.cli.alignments_annotate.remove_duplicates_from_bed(bed: Union[str, pybedtools.bedtool.BedTool, pandas.core.frame.DataFrame]) → pybedtools.bedtool.BedTool¶

Simple removal of duplicated entries from bed file.

If a “score” field is present a higher scored entry is prioritised.

Parameters: bed (Union[str, BedTool, pd.DataFrame]) – Bed object to deduplicate
Returns: BedTool with deduplicated names
Return type: BedTool

ccanalyser.cli.alignments_annotate.annotate(slices: os.PathLike, actions: Optional[Tuple] = None, bed_files: Optional[Tuple] = None, names: Optional[Tuple] = None, overlap_fractions: Optional[Tuple] = None, output: Optional[os.PathLike] = None, duplicates: str = 'remove', n_cores: int = 8, invalid_bed_action: str = 'error')¶

Annotates a bed file with other bed files using bedtools intersect.

Whilst bedtools intersect allows for interval names and counts to be used for annotating intervals, this command provides the ability to annotate intervals with both interval names and counts at the same time. As the pipeline allows for empty bed files, this command has built in support to deal with blank/malformed bed files and will return default N/A values.

Prior to interval annotation, the bed file to be intersected is validated and duplicate entries/multimapping reads are removed to ensure consistent annotations and prevent issues with reporter identification.

Parameters

slices (os.PathLike) – Input bed file.
actions (Tuple, optional) – Methods to use for annotation. Choose from (get|count). Defaults to None.
bed_files (Tuple, optional) – Bed files to intersect with the bed file to be annotated. Defaults to None.
names (Tuple, optional) – Column names for output tsv file. Defaults to None.
overlap_fractions (Tuple, optional) – Minimum overlap fractions required to call an intersection. Defaults to None.
output (os.PathLike, optional) – Output file path for annotated .tsv file. Defaults to None.
duplicates (str, optional) – Method to deal with multimapping reads/duplicate bed names. Currently, “remove” is the only supported option. Defaults to “remove”.
n_cores (int, optional) – Number of corese to use for intersection. Bed files are split by chromosome for faster intersection. Defaults to 8.
invalid_bed_action (str, optional) – Action to deal with invalid bed files. Choose from (ignore|error) .These can be ignored by setting to “ignore”. Defaults to ‘error’.

Raises

NotImplementedError – Only supported option for duplicate bed names is remove.

alignments deduplicate¶

ccanalyser.cli.alignments_deduplicate.identify(fragments_fn: os.PathLike, output: os.PathLike = 'duplicated_ids.json', buffer: int = 1000000.0, read_type: str = 'flashed')¶

Identifies aligned fragments with duplicate coordinates.

Parses a tsv file containing filtered aligned fragments and generates a dictionary containing the hashed parental read id and hashed genomic coordinates of all slices. Duplicated fragments are implicitly removed if they share the same genomic coordinate hash.

For non-combined reads (pe) a genomic coordinate hash is generated from the start of the first slice and the end of the last slice. This is due to the decreased confidence in the quality of the centre of the fragment. The coordinate hash for combined reads (flashed) is generated directly from the fragment coordinates. Only fragments with the exact coordinates and slice order will be considered to be duplicates.

Identified duplicate fragments are output in json format to be used by the “remove” subcommand.

Parameters

fragments_fn (os.PathLike) – Input fragments.tsv file to process.
output (os.PathLike, optional) – Output path to output duplicated parental read ids. Defaults to “duplicated_ids.json”.
buffer (int, optional) – Number of fragments to process in memory. Defaults to 1e6.
read_type (str, optional) – Process combined(flashed) or non-combined reads (pe). Due to the low confidence in the quaility of pe reads, duplicates are identified by removing any fragments with matching start and end coordinates. Defaults to “flashed”.

ccanalyser.cli.alignments_deduplicate.remove(slices_fn: os.PathLike, duplicated_ids: os.PathLike, output: os.PathLike = 'dedup.slices.tsv.gz', buffer: int = 5000000.0, sample_name: str = '', read_type: str = '', stats_prefix: os.PathLike = '')¶

Removes duplicated aligned fragments.

Parses a tsv file containing aligned read slices and outputs only slices from unique fragments. Duplicated parental read id determined by the “identify” subcommand are located within the slices tsv file and removed.

Outputs statistics for the number of unique slices and the number of duplicate slices identified.

Parameters

slices_fn (os.PathLike) – Input slices.tsv file.
duplicated_ids (os.PathLike) – Duplicated parental read ids in json format.
output (os.PathLike, optional) – Output file path for deduplicated slices. Defaults to “dedup.slices.tsv.gz”.
buffer (int, optional) – Number of slices to process in memory. Defaults to 1e6.
sample_name (str, optional) – Name of sample being processed e.g. DOX-treated_1 used for statistics. Defaults to “”.
read_type (str, optional) – Process combined(flashed) or non-combined reads (pe) used for statistics. Defaults to “”.
stats_prefix (os.PathLike, optional) – Output path for deduplication statistics. Defaults to “”.

alignments filter¶

ccanalyser.cli.alignments_filter.merge_annotations(df: pandas.core.frame.DataFrame, annotations: os.PathLike) → pandas.core.frame.DataFrame¶

Combines annotations with the parsed bam file output.

Uses pandas outer join on the indexes to merge annotations e.g. number of capture probe overlaps.

Annotation tsv must have the index as the first column and this index must have intersecting keys with the first dataframe’s index.

Parameters

df (pd.DataFrame) – Dataframe to merge with annotations
annotations (os.PathLike) – Filename of .tsv to read and merge with df

Returns

Merged dataframe

Return type

pd.DataFrame

ccanalyser.cli.alignments_filter.filter(bam: os.PathLike, annotations: os.PathLike, custom_filtering: os.PathLike = None, output_prefix: os.PathLike = 'reporters', stats_prefix: os.PathLike = '', method: str = 'capture', sample_name: str = '', read_type: str = '', gzip: bool = False)¶

Removes unwanted aligned slices and identifies reporters.

Parses a BAM file and merges this with a supplied annotation to identify unwanted slices. Filtering can be tuned for Capture-C, Tri-C and Tiled-C data to ensure optimal filtering.

Common filters include:

Removal of unmapped slices
Removal of excluded/blacklisted slices
Removal of non-capture fragments
Removal of multi-capture fragments
Removal of non-reporter fragments
Removal of fragments with duplicated coordinates.

For specific filtering for each of the three methods see:

In addition to outputting valid reporter fragments and slices separated by capture probe, this script also provides statistics on the number of read/slices filtered at each stage, and the number of cis/trans reporters for each probe.

Notes

Whilst the script is capable of processing any annotations in tsv format, provided that the correct columns are present. It is highly recomended that the “annotate” subcomand is used to generate this file.

Slice filtering is currently hard coded into each filtering class. This will be modified in a future update to enable custom filtering orders.

Parameters

bam (os.PathLike) – Input bam file to analyse
annotations (os.PathLike) – Annotations file generated by slices-annotate
custom_filtering (os.PathLike) – Allows for custom filtering to be performed. A yaml file is used to supply this ordering.
output_prefix (os.PathLike, optional) – Output file prefix. Defaults to “reporters”.
stats_prefix (os.PathLike, optional) – Output stats prefix. Defaults to “”.
method (str, optional) – Analysis method. Choose from (capture|tri|tiled). Defaults to “capture”.
sample_name (str, optional) – Sample being processed e.g. DOX-treated_1. Defaults to “”.
read_type (str, optional) – Process combined(flashed) or non-combined reads (pe) used for statistics. Defaults to “”.
gzip (bool, optional) – Compress output with gzip. Defaults to False.

fastq deduplicate¶

Created on Fri Oct 4 13:47:20 2019 @author: asmith

ccanalyser.cli.fastq_deduplicate.parse(input_files: Tuple, output: os.PathLike = 'out.json', read_buffer: int = 100000.0)¶

Parses fastq file(s) into easy to deduplicate format.

This command parses one or more fastq files and generates a dictionary containing hashed read identifiers together with hashed concatenated sequences. The hash dictionary is output in json format and the identify subcommand can be used to determine which read identifiers have duplicate sequences.

Parameters

input_files (Tuple) – One or more fastq files to process
output (os.PathLike, optional) – Output for parsed read identifiers and sequences. Defaults to “out.json”.
read_buffer (int, optional) – Number of reads to process before outputting to file. Defaults to 1e5.

ccanalyser.cli.fastq_deduplicate.identify(input_files: Tuple, output: os.PathLike = 'duplicates.json')¶

Identifies fragments with duplicated sequences.

Merges the hashed dictionaries (in json format) generated by the “parse” subcommand and identifies read with exactly the same sequence (share an identical hash). Duplicated read identifiers (hashed) are output in json format. The “remove” subcommand uses this dictionary to remove duplicates from fastq files.

Parameters

input_files (Tuple) – Paths to json files containing dictionaries with hashed read ids as the keys and hashed sequences as the values.
output (os.PathLike, optional) – Duplicate read ids identified. Defaults to “duplicates.json”.

ccanalyser.cli.fastq_deduplicate.remove(input_files: Tuple, duplicated_ids: os.PathLike, read_buffer: int = 100000.0, output_prefix: os.PathLike = '', gzip: bool = False, compression_level: int = 5, sample_name: str = '', stats_prefix: os.PathLike = '')¶

Removes fragments with duplicated sequences from fastq files.

Parses input fastq files and removes any duplicates from the fastq file(s) that are present in the json file supplied. This json dictionary should be produced by the “identify” subcommand.

Statistics for the number of duplicated and unique reads are also provided.

Parameters

input_files (Tuple) – Input fastq files (in the same order as used for the parse command).
duplicated_ids (os.PathLike) – Duplicated read ids from identify command (hashed and in json format).
read_buffer (int, optional) – Number of reads to process before writing to file. Defaults to 1e5.
output_prefix (os.PathLike, optional) – Deduplicated fastq output prefix. Defaults to “”.
gzip (bool, optional) – Determines if output is gzip compressed using pigz. Defaults to False.
compression_level (int, optional) – Level of compression if required (1-9). Defaults to 5.
sample_name (str, optional) – Name of sample processed e.g. DOX-treated_1. Defaults to “”.
stats_prefix (os.PathLike, optional) – Output prefix for statistics. Defaults to “”.

fastq digest¶

ccanalyser.cli.fastq_digest.collate_statistics(statq: multiprocessing.context.BaseContext.Queue, n_subprocesses: int) → pandas.core.frame.DataFrame¶

Collates digestion statistics from supplied statistics queue.

Parameters

statq (Queue) – Queue to use for collating statistics. Final item(s) must be ‘END’
n_subprocesses (int) – Number of digestion processes used. Required to know when to stop aquiring data from the queue.

Returns

Digestion statistics in histogram format.: Columns: ‘read_type’, ‘read_number’, ‘unfiltered/filtered’, ‘n_slices’, ‘n_reads’

Return type

pd.DataFrame

ccanalyser.cli.fastq_digest.digest(input_fastq: Tuple, restriction_enzyme: str, mode: str = 'pe', output_file: os.PathLike = 'out.fastq.gz', minimum_slice_length: int = 18, compression_level: int = 5, n_cores: int = 1, read_buffer: int = 100000, stats_prefix: os.PathLike = '', keep_cutsite: bool = False, sample_name: str = '')¶

Performs in silico digestion of one or a pair of fastq files.

Parameters

input_fastq (Tuple) – Input fastq files to process
restriction_enzyme (str) – Restriction enzyme name or site to use for digestion.
mode (str, optional) – Digest combined(flashed) or non-combined(pe). Undigested pe reads are output but flashed are not written. Defaults to “pe”.
output_file (os.PathLike, optional) – Output fastq file path. Defaults to “out.fastq.gz”.
minimum_slice_length (int, optional) – Minimum allowed length for in silico digested reads. Defaults to 18.
compression_level (int, optional) – Compression level for gzip output (1-9). Defaults to 5.
n_cores (int, optional) – Number of digestion processes to use. Defaults to 1.
read_buffer (int, optional) – Number of reads to process before writing to file. Defaults to 100000.
stats_prefix (os.PathLike, optional) – Output prefix for stats file. Defaults to “”.
keep_cutsite (bool, optional) – Determines if cutsite is removed from the output. Defaults to False.
sample_name (str, optional) – Name of sample processed eg. DOX-treated_1. Defaults to ‘’.

fastq split¶

Created on Wed Jan 8 15:45:09 2020

@author: asmith

Script splits a fastq into specified chunks

ccanalyser.cli.fastq_split.run_unix_split(fn: os.PathLike, n_reads: int, read_number: int, output_prefix: os.PathLike = '', gzip: bool = False, compression_level: int = 5)¶

ccanalyser.cli.fastq_split.split(input_files: Tuple, method: str = 'unix', output_prefix: os.PathLike = 'split', compression_level: int = 5, n_reads: int = 1000000, gzip: bool = True)¶

Splits fastq file(s) into equal chunks of n reads.

Parameters

input_files (Tuple) – Input fastq files to process.
method (str, optional) – Python or unix method (faster but not guarenteed to mantain read pairings) to split the fastq files. Defaults to “unix”.
output_prefix (os.PathLike, optional) – Output prefix for split fastq files. Defaults to “split”.
compression_level (int, optional) – Compression level for gzipped output. Defaults to 5.
n_reads (int, optional) – Number of reads to split the input fastq files into. Defaults to 1000000.
gzip (bool, optional) – Gzip compress output files if True. Defaults to True.

genome digest¶

Created on Fri Oct 4 13:47:20 2019 @author: asmith

Script generates a bed file of restriction fragment locations in a given genome.

ccanalyser.cli.genome_digest.parse_chromosomes(fasta: pysam.libcfaidx.FastxFile) → Iterator[pysam.libcfaidx.FastqProxy]¶

Parses a whole genome fasta file and yields chromosome entries.

Parameters: fasta (pysam.FastxFile) – Fasta file to process.
Yields: Iterator[pysam.FastqProxy] – Chromosome entry.

ccanalyser.cli.genome_digest.digest(input_fasta: os.PathLike, recognition_site: str, logfile: os.PathLike = 'genome_digest.log', output_file: os.PathLike = 'genome_digest.bed', remove_cutsite: bool = True, sort=False)¶

Performs in silico digestion of a genome in fasta format.

Digests the supplied genome fasta file and generates a bed file containing the locations of all restriction fragments produced by the supplied restriction enzyme.

A log file recording the number of restriction fragments for the suplied genome is also generated.

Parameters

input_fasta (os.PathLike) – Path to fasta file containing whole genome sequence, split by chromosome
recognition_site (str) – Restriction enzyme name/ Sequence of recognition site.
logfile (os.PathLike, optional) – Output path of the digestion logfile. Defaults to genome_digest.log.
output_file (os.PathLike, optional) – Output path for digested chromosome bed file. Defaults to genome_digest.bed.
remove_cutsite (bool, optional) – Determines if restriction site is removed. Defaults to True.

reporters count¶

ccanalyser.cli.reporters_count.count_re_site_combinations(groups: pandas.core.groupby.groupby.GroupBy, column: str = 'restriction_fragment', counts: Optional[collections.defaultdict] = None) → collections.defaultdict¶

Counts the number of unique combinations bewteen groups in a column.

Parameters

df (pd.core.groupby.GroupBy) – Aggregated dataframe for processing.
column (str, optional) – Column to examine for unique combinations per group. Defaults to “restriction_fragment”.
counts (defaultdict, optional) – defaultdict(int) containing previous counts. Defaults to None.

Returns

defaultdict(int) containing the count of unique interactions.

Return type

defaultdict

ccanalyser.cli.reporters_count.count(reporters: os.PathLike, output: os.PathLike = 'counts.tsv', remove_exclusions: bool = False, remove_capture: bool = False, subsample: int = 0)¶

Determines the number of captured restriction fragment interactions genome wide.

Parses a reporter slices tsv and counts the number of unique restriction fragment interaction combinations that occur within each fragment.

Options to ignore unwanted counts e.g. excluded regions or capture fragments are provided. In addition the number of reporter fragments can be subsampled if required.

Parameters

reporters (os.PathLike) – Reporter tsv file path.
output (os.PathLike, optional) – Output file path for interaction counts tsv. Defaults to ‘counts.tsv’.
remove_exclusions (bool, optional) – Removes regions marked as capture proximity exclusions before counting. Defaults to False.
remove_capture (bool, optional) – Removes all capture fragments before counting. Defaults to False.
subsample (int, optional) – Subsamples the fragments by the specified fraction. Defaults to 0 i.e. No subsampling.

reporters differential¶

ccanalyser.cli.reporters_differential.get_chromosome_from_name(df: pandas.core.frame.DataFrame, name: str)¶

ccanalyser.cli.reporters_differential.differential(union_bedgraph: os.PathLike, capture_name: str, capture_viewpoints: os.PathLike, output_prefix: os.PathLike = 'differential', design_matrix: Optional[os.PathLike] = None, grouping_col: str = 'condition', threshold_count: float = 20, threshold_q: float = 0.05, threshold_mean: float = 0)¶

Identifies differential interactions between conditions.

Parses a union bedgraph containg reporter counts from at least two conditions with two or more replicates for a single capture probe and outputs differential interaction results. Following filtering to ensure that the number of interactions is above the required threshold (–threshold_count), diffxpy is used to run a wald test after fitting a negative binomial model to the interaction counts.The options to filter results can be filtered by a minimum mean value (threshold_mean) and/or maximum q-value (threshold-q) are also provided.

Notes

Currently both the capture oligos and the name of the probe being analysed must be provided in order to correctly extract cis interactions.

If a N_SAMPLE * METADATA design matrix has not been supplied, the script assumes that the standard replicate naming structure has been followed i.e. SAMPLE_CONDITION_REPLICATE_(1|2).fastq.gz.

Parameters

union_bedgraph (os.PathLike) – Union bedgraph containg all samples to be compared.
capture_name (str) – Name of capture probe. MUST match one probe within the supplied oligos.
capture_viewpoints (os.PathLike) – Capture oligos used for the analysis.
output_prefix (os.PathLike, optional) – Output prefix for differntial interactions. Defaults to ‘differential’.
design_matrix (os.PathLike, optional) – Design matrix to use for grouping samples. (N_SAMPLES * METADATA). Defaults to None.
grouping_col (str, optional) – Column to use for grouping. Defaults to ‘condition’.
threshold_count (float, optional) – Minimum number of reported interactions required. Defaults to 20.
threshold_q (float, optional) – Maximum q-value for output. Defaults to 0.05.
threshold_mean (float, optional) – Minimum mean value for output. Defaults to 0.

reporters heatmap¶

ccanalyser.cli.reporters_heatmap.plot_matrix(matrix, figsize=(10, 10), axis_labels=None, cmap=None, vmin=0, vmax=0)¶

ccanalyser.cli.reporters_heatmap.plot(cooler_fn: os.PathLike, coordinates: Union[str, os.PathLike], resolution: int, capture_names: Optional[Tuple] = None, normalisation: Optional[str] = None, cmap: str = 'jet', vmax: float = 1, vmin: float = 0, output_prefix: os.PathLike = '', remove_capture: bool = False)¶

Plots a heatmap of reporter interactions.

Parses a HDF5 file containg the result of a capture experiment (binned into even genomic windows) and plots a heatmap of interactions over a specified genomic range. If a capture probe name is not supplied the script will plot all probes present in the file.

Heatmaps can also be normalised (–normalise) using either:

raw: No normalisation is performed.
n_interactions: The number of cis interactions.
n_rf_n_interactions: Normalised to the number of restriction fragments making up both genomic bins
and by the number of cis interactions.
ice: ICE normalisation followed by number of cis interactions
correction.

Parameters

cooler_fn (os.PathLike) – Path to capture cooler file containing interactions
coordinates (Union[str, os.PathLike]) – Coordinates for plotting. Either chrX:1000-2000 or bed file. If a bed file, the capture probe name must be contained in the name.
resolution (int) – Genomic resolution to plot. Must be present in the cooler.
capture_names (Tuple, optional) – Capture probes to plot. If None will plot all probes. Defaults to None.
normalisation (str, optional) – Normalisation for heatmap. Choose from (n_interactions|n_rf_n_interactions|ice). Defaults to None.
cmap (str, optional) – Colour map to use for heatmap. Defaults to “jet”.
vmax (float, optional) – vmaxold for heatmap. Defaults to 1.
output_prefix (os.PathLike, optional) – Output prefix for heatmap. Defaults to “”.

reporters pileup¶

ccanalyser.cli.reporters_pileup.bedgraph(cooler_fn: os.PathLike, capture_names: Optional[list] = None, output_prefix: os.PathLike = '', normalise: bool = False, binsize: int = 0, gzip: bool = True, scale_factor: int = 1000000.0, sparse: bool = True)¶

Extracts reporters from a capture experiment and generates a bedgraph file.

Identifies reporters for a single probe (if a probe name is supplied) or all capture probes present in a capture experiment HDF5 file.

The bedgraph generated can be normalised by the number of cis interactions for inter experiment comparisons and/or binned into even genomic windows.

Parameters

cooler_fn (os.PathLike) – Path to hdf5 file containing cooler groups.
capture_names (list, optional) – Name of capture probe to extract. If None, will process all probes present in the file. Defaults to None.
output_prefix (os.PathLike, optional) – Output file prefix for bedgraph. Defaults to “”.
normalise (bool, optional) – Normalise counts using the number of cis interactions. Defaults to False.
binsize (int, optional) – Genomic binsize to use for generating bedgraph. No binning performed if less than 0. Defaults to 0.
gzip (bool, optional) – Compress output bedgraph with gzip. Defaults to True.
scale_factor (int, optional) – Scaling factor for normalisation. Defaults to 1e6.
sparse (bool, optional) – Produce bedgraph containing just positive bins (True) or all bins (False). Defaults to True.

reporters store¶

ccanalyser.cli.reporters_store.fragments(counts: os.PathLike, fragment_map: os.PathLike, output: os.PathLike, capture_name: str, capture_viewpoints: os.PathLike, genome: str = '', suffix: str = '')¶

Stores restriction fragment interaction combinations at the restriction fragment level.

Parses reporter restriction fragment interaction counts produced by “ccanalyser reporters count” and gerates a cooler formatted group in an HDF5 File. See https://cooler.readthedocs.io/en/latest/ for further details.

Parameters

counts (os.PathLike) – Path to restriction fragment interactions counts .tsv file.
fragment_map (os.PathLike) – Path to restriction fragment .bed file, generated with genome-digest command.
output (os.PathLike) – Output file path for cooler hdf5 file.
capture_name (str) – Name of capture probe.
capture_viewpoints (os.PathLike) – Path to capture viewpoints bed file.
genome (str, optional) – Name of genome used for alignment e.g. hg19. Defaults to “”.
suffix (str, optional) – Suffix to append to filename. Defaults to “”.

ccanalyser.cli.reporters_store.bins(cooler_fn: os.PathLike, output: os.PathLike, binsizes: Optional[Tuple] = None, normalise: bool = False, n_cores: int = 1, scale_factor: int = 1000000.0, overlap_fraction: float = 1e-09, conversion_tables: Optional[os.PathLike] = None)¶

Convert a cooler group containing restriction fragments to constant genomic windows

Parses a cooler group and aggregates restriction fragment interaction counts into genomic bins of a specified size. If the normalise option is selected, columns containing normalised counts are added to the pixels table of the output

Notes

To avoid repeatedly calculating restriction fragment to bin conversions, bin conversion tables (a .pkl file containing a dictionary of :class:ccanalyser.tools.storage.GenomicBinner objects, one per binsize) can be supplied.

Parameters

cooler_fn (os.PathLike) – Path to cooler file. Nested coolers can be specified by STORE_FN.hdf5::/PATH_TO_COOLER
output (os.PathLike) – Path for output binned cooler file.
binsizes (Tuple, optional) – Genomic window sizes to use for binning. Defaults to None.
normalise (bool, optional) – Normalise the number of interactions to total number of cis interactions (True). Defaults to False.
n_cores (int, optional) – Number of cores to use for binning. Performed in parallel by chromosome. Defaults to 1.
scale_factor (int, optional) – Scaling factor to use for normalising interactions. Defaults to 1e6.
overlap_fraction (float, optional) – Minimum fraction to use for defining overlapping bins. Defaults to 1e-9.

ccanalyser.cli.reporters_store.merge(coolers: Tuple, output: os.PathLike)¶

Merges ccanalyser cooler files together.

Produces a unified cooler with both restriction fragment and genomic bins whilst reducing the storage space required by hard linking the “bins” tables to prevent duplication.

Parameters

coolers (Tuple) – Cooler files produced by either the fragments or bins subcommands.
output (os.PathLike) – Path from merged cooler file.

tsv aggregate¶

Created on Wed Dec 11 21:49:19 2019

@author: davids, asmith

Join any number of tab delimited text files on a single column. Index column must have the same header name in each file to be joined. Performs an outer join on the index column. Assumes files have headers.

ccanalyser.cli.tsv_aggregate.format_index_var(var)¶

ccanalyser.cli.tsv_aggregate.replace_na(df)¶

ccanalyser.cli.tsv_aggregate.is_compressed(files)¶

ccanalyser.cli.tsv_aggregate.load_tsv(tsv, index=None, header=None)¶

ccanalyser.cli.tsv_aggregate.join_tsvs(fnames, index_col, n_processes=8, header=True)¶

ccanalyser.cli.tsv_aggregate.concat_tsvs(fnames, delayed=False, header=None)¶

ccanalyser.cli.tsv_aggregate.main(input_files, output, index=None, header=None, method=None, n_processes=8, groupby_columns=None, aggregate_method=None, aggregate_columns=None)¶