CCanalyser tools

ccanalyser.tools.annotate module

class ccanalyser.tools.annotate.BedIntersection(bed1: Union[str, pybedtools.bedtool.BedTool, pandas.core.frame.DataFrame], bed2: Union[str, pybedtools.bedtool.BedTool, pandas.core.frame.DataFrame], intersection_name: str = 'count', intersection_method: str = 'count', intersection_min_frac: float = 1e-09, intersection_split_chrom: bool = True, n_cores: int = 1, invalid_bed_action='error')

Bases: object

Performs intersection between two named bed files.

Wrapper around bedtools intersect designed to intersect in parallel (by splitting file based on chromosome) and handle malformed bed files.

bed1

Bed file to intersect. Must be named.

Type

Union[str, BedTool, pd.DataFrame]

bed2

Bed file to intersect.

Type

Union[str, BedTool, pd.DataFrame]

intersection_name

Name for intersection.

Type

str

min_frac

Minimum fraction required for intersection

Type

float

n_cores

Number of cores for parallel intersection.

Type

int

invalid_bed_action

Method to deal with missing/malformed bed files (“ignore”|”error”)

property intersection: pandas.core.series.Series

Intersects the two bed files and returns a pd.Series.

ccanalyser.tools.deduplicate module

class ccanalyser.tools.deduplicate.ReadDeduplicationParserProcess(inq: multiprocessing.context.BaseContext.SimpleQueue, outq: multiprocessing.context.BaseContext.SimpleQueue, hash_seed: int = 42, save_hashed_dict_path: os.PathLike = 'parsed.json')

Bases: multiprocessing.context.Process

Process subclass for parsing fastq file(s) into a hashed {id:sequence} json format.

inq

Input read queue

outq

Output read queue (Not currently used)

hash_seed

Seed for xxhash64 algorithm to ensure consistency

save_hash_dict_path

Path to save hashed dictionary

run()

Processes fastq reads from multiple files and generates a hashed json dictionary.

Dictionary is hashed and in the format {(read 1 name + read 2 name): (sequence 1 + sequence 2)}.

Output path is specified by save_hashed_dict_path.

class ccanalyser.tools.deduplicate.ReadDuplicateRemovalProcess(inq: multiprocessing.context.BaseContext.SimpleQueue, outq: multiprocessing.context.BaseContext.SimpleQueue, duplicated_ids: set, statq: Optional[multiprocessing.context.BaseContext.Queue] = None, hash_seed: int = 42)

Bases: multiprocessing.context.Process

Process subclass for parsing fastq file(s) and removing identified duplicates.

inq

Input read queue

outq

Output queue for deduplicated reads.

duplicated_ids

Concatenated read ids to remove from input fastq files.

statq

Output queue for statistics.

reads_total

Number of fastq reads processed.

reads_unique

Number of non-duplicated reads output.

hash_seed

Seed for xxhash algorithm. MUST be the same as used by ReadDuplicationParserProcess.

run()

Performs read deduplication based on sequence.

Unique reads are placed on outq and deduplication statistics are placed on statq.

ccanalyser.tools.digest module

class ccanalyser.tools.digest.DigestedChrom(chrom: pysam.libcfaidx.FastqProxy, cutsite: str, fragment_number_offset: int = 0, fragment_min_len: int = 1)

Bases: object

Performs in slico digestion of fasta files.

Identifies all restriction sites for a supplied restriction enzyme/restriction site and generates bed style entries.

chrom

Chromosome to digest

Type

pysam.FastqProxy

recognition_seq

Sequence of restriction recognition site

Type

str

recognition_len

Length of restriction recognition site

Type

int

recognition_seq

Regular expression for restriction recognition site

Type

re.Pattern

fragment_indexes

Indexes of fragment(s) start and end positions.

Type

List[int]

fragment_number_offset

Starting fragment number.

Type

int

fragment_min_len

Minimum fragment length required to report fragment

Type

int

get_recognition_site_indexes()List[int]

Gets the start position of all recognition sites present in the sequence.

Notes

Also appends the start and end of the sequnece to enable clearer itteration through the indexes.

Returns

Indexes of fragment(s) start and end positions.

Return type

List[int]

property fragments: Iterable[str]

Extracts the coordinates of restriction fragments from the sequence.

Yields

Iterator[Iterable[str]] – Identified restriction fragments in bed format.

class ccanalyser.tools.digest.DigestedRead(read: pysam.libcfaidx.FastqProxy, cutsite: str, min_slice_length: int = 18, slice_number_offset: int = 0, allow_undigested: bool = False, read_type: str = 'flashed')

Bases: object

Performs in slico digestion of fastq files.

Identifies all restriction sites for a supplied restriction enzyme/restriction site and generates bed style entries.

read

Read to digest.

Type

pysam.FastqProxy

recognition_seq

Sequence of restriction recognition site.

Type

str

recognition_len

Length of restriction recognition site.

Type

int

recognition_seq

Regular expression for restriction recognition site.

Type

re.Pattern

slices

List of Fastq formatted digested reads (slices).

Type

List[str]

slice_indexes

Indexes of fragment(s) start and end positions.

Type

List[int]

slice_number_offset

Starting fragment number.

Type

int

min_slice_len

Minimum fragment length required to report fragment.

Type

int

has_slices

Recognition site(s) present within sequence.

Type

bool

get_recognition_site_indexes()List[int]
class ccanalyser.tools.digest.ReadDigestionProcess(inq: multiprocessing.context.BaseContext.SimpleQueue, outq: multiprocessing.context.BaseContext.SimpleQueue, statq: Optional[multiprocessing.context.BaseContext.Queue] = None, **digestion_kwargs)

Bases: multiprocessing.context.Process

Process subclass for multiprocessing fastq digestion.

run()

Performs read digestion.

Reads to digest are pulled from inq, digested with the DigestedRead class and the results placed on outq for writing.

If a statq is provided, read digestion stats are placed into this queue for aggregation.

ccanalyser.tools.filter module

class ccanalyser.tools.filter.SliceFilter(slices: pandas.core.frame.DataFrame, filter_stages: Optional[dict] = None, sample_name: str = '', read_type: str = '')

Bases: object

Perform slice filtering (inplace) and reporter identification.

The SliceFilter classes e.g. CCSliceFilter, TriCSliceFilter, TiledCSliceFilter perform all of the filtering (inplace) and reporter identification whilst also providing statistics of the numbers of slices/reads removed at each stage.

slices

Annotated slices dataframe.

Type

pd.DataFrame

fragments

Slices dataframe aggregated by parental read.

Type

pd.DataFrame

reporters

Slices identified as reporters.

Type

pd.DataFrame

filter_stages

Dictionary containg stages and a list of class methods (str) required to get to this stage.

Type

dict

slice_stats

Provides slice level statistics.

Type

pd.DataFrame

read_stats

Provides statistics of slice filtering at the parental read level.

Type

pd.DataFrame

filter_stats

Provides statistics of read filtering.

Type

pd.DataFrame

property filters: list

A list of the callable filters present within the slice filterer instance.

Returns

All filters present in the class.

Return type

list

property slice_stats: pandas.core.frame.DataFrame

Statistics at the slice level.

Returns

Statistics per slice.

Return type

pd.DataFrame

property filter_stats: pandas.core.frame.DataFrame

Statistics for each filter stage.

Returns

Statistics of the number of slices removed at each stage.

Return type

pd.DataFrame

property read_stats: pandas.core.frame.DataFrame

Gets statistics at a read level.

Aggregates slices by parental read id and calculates stats.

Returns

Statistics of the slices/fragments removed aggregated by read id.

Return type

pd.DataFrame

property fragments: pandas.core.frame.DataFrame

Summarises slices at the fragment level.

Uses pandas groupby to aggregate slices by their parental read name (shared by all slices from the same fragment). Also determines the number of reporter slices for each fragment.

Returns

Slices aggregated by parental read name.

Return type

pd.DataFrame

property reporters: pandas.core.frame.DataFrame

Extracts reporter slices from slices dataframe i.e. non-capture slices

Returns

All non-capture slices

Return type

pd.DataFrame

filter_slices(output_slices=False, output_location='.')

Performs slice filtering.

Filters are applied to the slices dataframe in the order specified by filter_stages. Filtering stats aggregated at the slice and fragment level are also printed.

Parameters
  • output_slices (bool, optional) – Determines if slices are to be output to a specified location after each filtering step. Useful for debugging. Defaults to False.

  • output_location (str, optional) – Location to output slices at each stage. Defaults to “.”.

get_unfiltered_slices()

Does not modify slices.

remove_unmapped_slices()

Removes slices marked as unmapped (Uncommon)

remove_orphan_slices()

Remove fragments with only one aligned slice (Common)

remove_duplicate_re_frags()

Prevent the same restriction fragment being counted more than once (Uncommon).

Example

–RE_FRAG1–-—Capture—--–RE_FRAG1—-

remove_slices_without_re_frag_assigned()

Removes slices if restriction_fragment column is N/A

remove_duplicate_slices()

Remove all slices if the slice coordinates and slice order are shared.

This method is designed to remove a fragment if it is a PCR duplicate (Common).

Example

Frag 1: chr1:1000-1250 chr1:1500-1750
Frag 2: chr1:1000-1250 chr1:1500-1750
Frag 3: chr1:1050-1275 chr1:1600-1755
Frag 4: chr1:1500-1750 chr1:1000-1250

Frag 2 removed. Frag 1,3,4 retained

remove_duplicate_slices_pe()

Removes PCR duplicates from non-flashed (PE) fragments (Common).

Sequence quality is often lower at the 3’ end of reads leading to variance in mapping coordinates. PCR duplicates are removed by checking that the fragment start and end are not duplicated in the dataframe.

remove_excluded_slices()

Removes any slices in the exclusion region (default 1kb) (V. Common)

remove_blacklisted_slices()

Removes slices marked as being within blacklisted regions

class ccanalyser.tools.filter.CCSliceFilter(slices, filter_stages=None, **sample_kwargs)

Bases: ccanalyser.tools.filter.SliceFilter

Perform Capture-C slice filtering (inplace) and reporter identification.

SliceFilter tuned specifically for Capture-C data. This class has addtional methods to remove common artifacts in Capture-C data i.e. multi-capture fragments, non-reporter fragments, multi-capture reporters. The default filter order is as follows:

  • remove_unmapped_slices

  • remove_orphan_slices

  • remove_multi_capture_fragments

  • remove_excluded_slices

  • remove_blacklisted_slices

  • remove_non_reporter_fragments

  • remove_multicapture_reporters

  • remove_slices_without_re_frag_assigned

  • remove_duplicate_re_frags

  • remove_duplicate_slices

  • remove_duplicate_slices_pe

  • remove_non_reporter_fragments

See the individual methods for further details.

slices

Annotated slices dataframe.

Type

pd.DataFrame

fragments

Slices dataframe aggregated by parental read.

Type

pd.DataFrame

reporters

Slices identified as reporters.

Type

pd.DataFrame

filter_stages

Dictionary containg stages and a list of class methods (str) required to get to this stage.

Type

dict

slice_stats

Provides slice level statistics.

Type

pd.DataFrame

read_stats

Provides statistics of slice filtering at the parental read level.

Type

pd.DataFrame

filter_stats

Provides statistics of read filtering.

Type

pd.DataFrame

property fragments: pandas.core.frame.DataFrame

Summarises slices at the fragment level.

Uses pandas groupby to aggregate slices by their parental read name (shared by all slices from the same fragment). Also determines the number of reporter slices for each fragment.

Returns

Slices aggregated by parental read name.

Return type

pd.DataFrame

property slice_stats

Statistics at the slice level.

Returns

Statistics per slice.

Return type

pd.DataFrame

property frag_stats: pandas.core.frame.DataFrame

Statistics aggregated at the fragment level.

As this involves slice aggregation it can be rather slow for large datasets. It is recomended to only use this property if it is required.

Returns

Fragment level statistics

Return type

pd.DataFrame

property reporters: pandas.core.frame.DataFrame

Extracts reporter slices from slices dataframe i.e. non-capture slices

Returns

All non-capture slices

Return type

pd.DataFrame

property captures: pandas.core.frame.DataFrame

Extracts capture slices from slices dataframe

i.e. slices that do not have a null capture name

Returns

Capture slices

Return type

pd.DataFrame

property capture_site_stats: pandas.core.series.Series

Extracts the number of unique capture sites.

property merged_captures_and_reporters: pandas.core.frame.DataFrame

Merges captures and reporters sharing the same parental id.

Capture slices and reporter slices with the same parental read id are merged together. The prefixes ‘capture’ and ‘reporter’ are used to identify slices marked as either captures or reporters.

Returns

Merged capture and reporter slices

Return type

pd.DataFrame

property cis_or_trans_stats: pandas.core.frame.DataFrame

Extracts reporter cis/trans statistics from slices.

Returns

Reporter cis/trans statistics

Return type

pd.DataFrame

remove_non_reporter_fragments()

Removes the fragment if it has no reporter slices present (Common)

remove_multi_capture_fragments()

Removes double capture fragments.

All slices (i.e. the entire fragment) are removed if more than one capture probe is present i.e. a double capture (V. Common)

remove_multicapture_reporters(n_adjacent: int = 1)

Deals with an odd situation in which a reporter spanning two adjacent capture sites is not removed.

Example

——Capture 1—-/——Capture 2—— —–REP——–

In this case the “reporter” slice is not considered either a capture or exclusion.

These cases are dealt with by explicitly removing reporters on restriction fragments adjacent to capture sites.

Parameters

n_adjacent – Number of adjacent restriction fragments to remove

class ccanalyser.tools.filter.TriCSliceFilter(slices, filter_stages=None, **sample_kwargs)

Bases: ccanalyser.tools.filter.CCSliceFilter

Perform Tri-C slice filtering (inplace) and reporter identification.

SliceFilter tuned specifically for Tri-C data. Whilst the vast majority of filters are inherited from CCSliceFilter, this class has addtional methods for Tri-C analysis i.e. remove_slices_with_one_reporter. The default filtering order is:

  • remove_unmapped_slices

  • remove_slices_without_re_frag_assigned

  • remove_orphan_slices

  • remove_multi_capture_fragments

  • remove_blacklisted_slices

  • remove_non_reporter_fragments

  • remove_multicapture_reporters

  • remove_duplicate_re_frags

  • remove_duplicate_slices

  • remove_duplicate_slices_pe

  • remove_non_reporter_fragments

  • remove_slices_with_one_reporter

See the individual methods for further details.

slices

Annotated slices dataframe.

Type

pd.DataFrame

fragments

Slices dataframe aggregated by parental read.

Type

pd.DataFrame

reporters

Slices identified as reporters.

Type

pd.DataFrame

filter_stages

Dictionary containg stages and a list of class methods (str) required to get to this stage.

Type

dict

slice_stats

Provides slice level statistics.

Type

pd.DataFrame

read_stats

Provides statistics of slice filtering at the parental read level.

Type

pd.DataFrame

filter_stats

Provides statistics of read filtering.

Type

pd.DataFrame

remove_slices_with_one_reporter()

Removes fragments if they do not contain at least two reporters.

class ccanalyser.tools.filter.TiledCSliceFilter(slices, filter_stages=None, **sample_kwargs)

Bases: ccanalyser.tools.filter.SliceFilter

Perform Tiled-C slice filtering (inplace) and reporter identification.

SliceFilter tuned specifically for Tiled-C data. This class has addtional methods to remove common artifacts in Tiled-C data i.e. non-capture fragments, multi-capture (with different tiled regions) fragments. A reporter is defined differently in a Tiled-C analysis as a reporter slice can also be a capture slice.

The default filter order is as follows:

  • remove_unmapped_slices

  • remove_orphan_slices

  • remove_blacklisted_slices

  • remove_non_capture_fragments

  • remove_dual_capture_fragments

  • remove_slices_without_re_frag_assigned

  • remove_duplicate_re_frags

  • remove_duplicate_slices

  • remove_duplicate_slices_pe

  • remove_orphan_slices

See the individual methods for further details.

slices

Annotated slices dataframe.

Type

pd.DataFrame

fragments

Slices dataframe aggregated by parental read.

Type

pd.DataFrame

reporters

Slices identified as reporters.

Type

pd.DataFrame

filter_stages

Dictionary containg stages and a list of class methods (str) required to get to this stage.

Type

dict

slice_stats

Provides slice level statistics.

Type

pd.DataFrame

read_stats

Provides statistics of slice filtering at the parental read level.

Type

pd.DataFrame

filter_stats

Provides statistics of read filtering.

Type

pd.DataFrame

property fragments: pandas.core.frame.DataFrame

Summarises slices at the fragment level.

Uses pandas groupby to aggregate slices by their parental read name (shared by all slices from the same fragment). Also determines the number of reporter slices for each fragment.

Returns

Slices aggregated by parental read name.

Return type

pd.DataFrame

property slice_stats

Statistics at the slice level.

Returns

Statistics per slice.

Return type

pd.DataFrame

property cis_or_trans_stats: pandas.core.frame.DataFrame

Extracts reporter cis/trans statistics from slices.

Unlike Capture-C/Tri-C reporter slice can also be capture slices as all slices within the capture region are considered as reporters. To extract cis/trans statistics, one capture slice in each fragment is considered to be the “primary capture” this then enables merging of this “primary capture” with the other reporters both inside and outside of the tiled region.

Returns

Reporter cis/trans statistics

Return type

pd.DataFrame

remove_slices_outside_capture()

Removes slices outside of capture region(s)

remove_non_capture_fragments()

Removes fragments without a capture assigned

remove_dual_capture_fragments()

Removes a fragment with multiple different capture sites.

Modified for TiledC filtering as the fragment dataframe is generated slightly differently.

ccanalyser.tools.io module

class ccanalyser.tools.io.FastqReaderProcess(input_files: Union[str, list], outq: multiprocessing.context.BaseContext.Queue, read_buffer: int = 100000, read_counter: Optional[multiprocessing.managers.BaseManager.register.<locals>.temp] = None, n_subprocesses: int = 1, statq: Optional[multiprocessing.context.BaseContext.Queue] = None)

Bases: multiprocessing.context.Process

Reads fastq file(s) in chunks and places them on a queue.

input_file

Input fastq files.

outq

Output queue for chunked reads/read pairs.

statq

(Not currently used) Queue for read statistics if required.

read_buffer

Number of reads to process before placing them on outq

read_counter

(Not currently used) Can be used to sync between multiple readers.

n_subproceses

Number of processes running concurrently. Used to make sure enough termination signals are used.

run()

Performs reading and chunking of fastq file(s).

class ccanalyser.tools.io.FastqReadFormatterProcess(inq: multiprocessing.context.BaseContext.SimpleQueue, outq: multiprocessing.context.BaseContext.SimpleQueue, formatting: Optional[list] = None)

Bases: multiprocessing.context.Process

run()

Method to be run in sub-process; can be overridden in sub-class

class ccanalyser.tools.io.FastqWriterSplitterProcess(inq: multiprocessing.context.BaseContext.Queue, output_prefix: Union[str, list], paired_output: bool = False, gzip=False, compression_level: int = 3, compression_threads: int = 8, n_subprocesses: int = 1, n_workers_terminated: int = 0, n_files_written: int = 0)

Bases: multiprocessing.context.Process

run()

Method to be run in sub-process; can be overridden in sub-class

class ccanalyser.tools.io.FastqWriterProcess(inq: multiprocessing.context.BaseContext.Queue, output: Union[str, list], compression_level: int = 5, n_subprocesses: int = 1)

Bases: multiprocessing.context.Process

run()

Method to be run in sub-process; can be overridden in sub-class

ccanalyser.tools.io.parse_alignment(aln)

Parses reads from a bam file into a list.

Extracts:

-read name -parent reads -flashed status -slice number -mapped status -multimapping status -chromosome number (e.g. chr10) -start (e.g. 1000) -end (e.g. 2000) -coords e.g. (chr10:1000-2000)

Parameters

aln – pysam.AlignmentFile.

Returns

Containing the attributes extracted.

Return type

list

ccanalyser.tools.io.parse_bam(bam)

Uses parse_alignment function convert bam file to a dataframe.

Extracts:

-‘slice_name’ -‘parent_read’ -‘pe’ -‘slice’ -‘mapped’ -‘multimapped’ -‘chrom’ -‘start’ -‘end’ -‘coordinates’

Parameters

bam – File name of bam file to process.

Returns

DataFrame with the columns listed above.

Return type

pd.Dataframe

ccanalyser.tools.pileup module

class ccanalyser.tools.pileup.CoolerBedGraph(cooler_fn: str, sparse: bool = True, only_cis: bool = False)

Bases: object

Generates a bedgraph file from a cooler file created by interactions-store.

cooler

Cooler file to use for bedgraph production

Type

cooler.Cooler

capture_name

Name of capture probe being processed.

Type

str

sparse

Only output bins with interactions.

Type

bool

only_cis

Only output cis interactions.

Type

bool

property bedgraph: pandas.core.frame.DataFrame

Returns: pd.DataFrame: DataFrame in bedgraph format.

property reporters: pandas.core.frame.DataFrame

Interactions with capture fragments/bins.

Returns

DataFrame containing just bins interacting with the capture probe.

Return type

pd.DataFrame

normalise_bedgraph(scale_factor=1000000.0)pandas.core.frame.DataFrame

Normalises the bedgraph.

Uses the number of cis interactions to normalise the bedgraph counts.

Parameters

scale_factor (int, optional) – Scaling factor for normalisation. Defaults to 1e6.

Returns

Normalised bedgraph formatted DataFrame

Return type

pd.DataFrame

to_file(fn: os.PathLike, normalise: bool = False, **normalise_kwargs)

Outputs the bedgraph dataframe to a file.

If normalise is True, will also normalise the counts by the number of cis interactions.

Parameters
  • fn (os.PathLike) – Output file name.

  • normalise (bool, optional) – Normalise the bedgraph before writing to file. Defaults to False.

class ccanalyser.tools.pileup.CoolerBedGraphWindowed(cooler_fn: str, binsize: int = 5000.0, binner: Optional[ccanalyser.tools.storage.CoolerBinner] = None, sparse=True)

Bases: ccanalyser.tools.pileup.CoolerBedGraph

normalise_bedgraph(scale_factor=1000000.0)

Normalises the bedgraph.

Uses the number of cis interactions to normalise the bedgraph counts.

Parameters

scale_factor (int, optional) – Scaling factor for normalisation. Defaults to 1e6.

Returns

Normalised bedgraph formatted DataFrame

Return type

pd.DataFrame

property reporters_binned
class ccanalyser.tools.pileup.CCBedgraph(path=None, df=None, capture_name='', capture_chrom='', capture_start='', capture_end='')

Bases: object

property score
property coordinates
to_bedtool()
to_file(path)

ccanalyser.tools.plotting module

class ccanalyser.tools.plotting.CCMatrix(cooler_fn: os.PathLike, binsize: 5000, capture_name: str, remove_capture=False)

Bases: object

get_matrix(coordinates, field='count')
get_matrix_normalised(coordinates, normalisation_method=None, **normalisation_kwargs)

ccanalyser.tools.statistics module

class ccanalyser.tools.statistics.DeduplicationStatistics(sample: str, read_type: str = 'pe', reads_total: int = 0, reads_unique: int = 0)

Bases: object

property df
class ccanalyser.tools.statistics.DigestionStats(read_type, read_number, unfiltered, filtered)

Bases: tuple

filtered

Alias for field number 3

read_number

Alias for field number 1

read_type

Alias for field number 0

unfiltered

Alias for field number 2

ccanalyser.tools.statistics.collate_histogram_data(fnames)
ccanalyser.tools.statistics.collate_read_data(fnames)
ccanalyser.tools.statistics.collate_slice_data(fnames)
ccanalyser.tools.statistics.collate_cis_trans_data(fnames)
ccanalyser.tools.statistics.extract_trimming_stats(fn)

ccanalyser.tools.storage module

ccanalyser.tools.storage.get_capture_coords(viewpoint_file: str, viewpoint_name: str)
ccanalyser.tools.storage.get_capture_bins(bins, viewpoint_chrom, viewpoint_start, viewpoint_end)
ccanalyser.tools.storage.create_cooler_cc(output_prefix: str, bins: pandas.core.frame.DataFrame, pixels: pandas.core.frame.DataFrame, capture_name: str, capture_viewpoints: os.PathLike, capture_bins: Optional[Union[int, list]] = None, suffix=None, **cooler_kwargs)os.PathLike

Creates a cooler hdf5 file or cooler formatted group within a hdf5 file.

Parameters
  • output_prefix (str) – Output path for hdf5 file. If this already exists, will append a new group to the file.

  • bins (pd.DataFrame) – DataFrame containing the genomic coordinates of all bins in the pixels table.

  • pixels (pd.DataFrame) – DataFrame with columns: bin1_id, bin2_id, count.

  • capture_name (str) – Name of capture probe to store.

  • capture_viewpoints (os.PathLike) – Path to capture viewpoints used for the analysis.

  • capture_bins (Union[int, list], optional) – Bins containing capture viewpoints. Can be determined from viewpoints if not supplied. Defaults to None.

  • suffix (str, optional) – Suffix to append before the .hdf5 file extension. Defaults to None.

Raises

ValueError – Capture name must exactly match the name of a supplied capture viewpoint.

Returns

Path of cooler hdf5 file.

Return type

os.PathLike

class ccanalyser.tools.storage.GenomicBinner(chromsizes: Union[os.PathLike, pandas.core.frame.DataFrame, pandas.core.series.Series], fragments: pandas.core.frame.DataFrame, binsize: int = 5000, n_cores: int = 8, method: Literal[midpoint, overlap] = 'midpoint', min_overlap: float = 0.2)

Bases: object

Provides a conversion table for converting two sets of bins.

chromsizes

Series indexed by chromosome name containg chromosome sizes in bp

Type

pd.Series

fragments

DataFrame containing bins to convert to equal genomic intervals

Type

pd.DataFrame

binsize

Genomic bin size

Type

int

min_overlap

Minimum degree of intersection to define an overlap.

Type

float

n_cores

Number of cores to use for bin intersection.

Type

int

property bins: pandas.core.frame.DataFrame

Equal genomic bins.

Returns

DataFrame in bed format.

Return type

pd.DataFrame

property bin_conversion_table: pandas.core.frame.DataFrame

Returns: pd.DataFrame: Conversion table containing coordinates and ids of intersecting bins.

class ccanalyser.tools.storage.CoolerBinner(cooler_fn: os.PathLike, binsize: Optional[int] = None, n_cores: int = 8, binner: Optional[ccanalyser.tools.storage.GenomicBinner] = None)

Bases: object

Bins a cooler file into equal genomic intervals.

cooler

(cooler.Cooler): Cooler instance to bin.

binner

Binner class to generate bin conversion tables

Type

ccanalyser.storeage.GenomicBinner

binsize

Genomic bin size

Type

int

scale_factor

Scaling factor for normalising interaction counts.

Type

int

n_cis_interactions

Number of cis interactions with the capture bins.

Type

int

n_cores

Number of cores to use for binning.

Type

int

property bins: pandas.core.frame.DataFrame

Returns: pd.DataFrame: Even genomic bins of a specified binsize.

property bin_conversion_table: pandas.core.frame.DataFrame

Returns: pd.DataFrame: Conversion table containing coordinates and ids of intersecting bins.

property capture_bins

Returns: pd.DataFrame: Capture bins converted to the new even genomic bin format.

property pixel_conversion_table

Returns: pd.DataFrame: Conversion table to convert old binning scheme to the new.

property pixels

Returns: pd.DataFrame: Pixels (interaction counts) converted to the new binning scheme.

normalise_pixels(n_fragment_correction: bool = True, n_interaction_correction: bool = True, scale_factor: int = 1000000.0)

Normalises pixels (interactions).

Normalises pixels according to the number of restriction fragments per bin and the number of cis interactions. If both normalisation options are selected, will also provide a dual normalised column.

Parameters
  • n_fragment_correction (bool, optional) – Updates the pixels DataFrame with counts corrected for the number of restriction fragments per bin. Defaults to True.

  • n_interaction_correction (bool, optional) – Updates the pixels DataFrame with counts corrected for the number of cis interactions. Defaults to True.

  • scale_factor (int, optional) – Scaling factor for n_interaction_correction. Defaults to 1e6.

to_cooler(store, normalise=False, **normalise_options)

Reduces cooler storage space by linking “bins” table.

All of the cooler “bins” tables containing the genomic coordinates of each bin are identical for all cooler files of the same resoultion. As cooler.create_cooler generates a new bins table for each cooler, this leads to a high degree of duplication.

This function hard links the bins tables for a given resolution to reduce the degree of duplication.

Parameters

clr (os.PathLike) – Path to cooler hdf5 produced by the merge command.