CCanalyser tools¶

ccanalyser.tools.annotate module¶

class ccanalyser.tools.annotate.BedIntersection(bed1: Union[str, pybedtools.bedtool.BedTool, pandas.core.frame.DataFrame], bed2: Union[str, pybedtools.bedtool.BedTool, pandas.core.frame.DataFrame], intersection_name: str = 'count', intersection_method: str = 'count', intersection_min_frac: float = 1e-09, intersection_split_chrom: bool = True, n_cores: int = 1, invalid_bed_action='error')¶

Bases: object

Performs intersection between two named bed files.

Wrapper around bedtools intersect designed to intersect in parallel (by splitting file based on chromosome) and handle malformed bed files.

bed1¶

Bed file to intersect. Must be named.

Type: Union[str, BedTool, pd.DataFrame]

bed2¶

Bed file to intersect.

Type: Union[str, BedTool, pd.DataFrame]

intersection_name¶

Name for intersection.

Type: str

min_frac¶

Minimum fraction required for intersection

Type: float

n_cores¶

Number of cores for parallel intersection.

Type: int

invalid_bed_action¶: Method to deal with missing/malformed bed files (“ignore”|”error”)

property intersection: pandas.core.series.Series¶: Intersects the two bed files and returns a pd.Series.

ccanalyser.tools.deduplicate module¶

class ccanalyser.tools.deduplicate.ReadDeduplicationParserProcess(inq: multiprocessing.context.BaseContext.SimpleQueue, outq: multiprocessing.context.BaseContext.SimpleQueue, hash_seed: int = 42, save_hashed_dict_path: os.PathLike = 'parsed.json')¶

Bases: multiprocessing.context.Process

Process subclass for parsing fastq file(s) into a hashed {id:sequence} json format.

inq¶: Input read queue

outq¶: Output read queue (Not currently used)

hash_seed¶: Seed for xxhash64 algorithm to ensure consistency

save_hash_dict_path¶: Path to save hashed dictionary

run()¶

Processes fastq reads from multiple files and generates a hashed json dictionary.

Dictionary is hashed and in the format {(read 1 name + read 2 name): (sequence 1 + sequence 2)}.

Output path is specified by save_hashed_dict_path.

class ccanalyser.tools.deduplicate.ReadDuplicateRemovalProcess(inq: multiprocessing.context.BaseContext.SimpleQueue, outq: multiprocessing.context.BaseContext.SimpleQueue, duplicated_ids: set, statq: Optional[multiprocessing.context.BaseContext.Queue] = None, hash_seed: int = 42)¶

Bases: multiprocessing.context.Process

Process subclass for parsing fastq file(s) and removing identified duplicates.

inq¶: Input read queue

outq¶: Output queue for deduplicated reads.

duplicated_ids¶: Concatenated read ids to remove from input fastq files.

statq¶: Output queue for statistics.

reads_total¶: Number of fastq reads processed.

reads_unique¶: Number of non-duplicated reads output.

hash_seed¶: Seed for xxhash algorithm. MUST be the same as used by ReadDuplicationParserProcess.

run()¶

Performs read deduplication based on sequence.

Unique reads are placed on outq and deduplication statistics are placed on statq.

ccanalyser.tools.digest module¶

class ccanalyser.tools.digest.DigestedChrom(chrom: pysam.libcfaidx.FastqProxy, cutsite: str, fragment_number_offset: int = 0, fragment_min_len: int = 1)¶

Bases: object

Performs in slico digestion of fasta files.

Identifies all restriction sites for a supplied restriction enzyme/restriction site and generates bed style entries.

chrom¶

Chromosome to digest

Type: pysam.FastqProxy

recognition_seq¶

Sequence of restriction recognition site

Type: str

recognition_len¶

Length of restriction recognition site

Type: int

recognition_seq¶

Regular expression for restriction recognition site

Type: re.Pattern

fragment_indexes¶

Indexes of fragment(s) start and end positions.

Type: List[int]

fragment_number_offset¶

Starting fragment number.

Type: int

fragment_min_len¶

Minimum fragment length required to report fragment

Type: int

get_recognition_site_indexes() → List[int]¶

Gets the start position of all recognition sites present in the sequence.

Notes

Also appends the start and end of the sequnece to enable clearer itteration through the indexes.

Returns: Indexes of fragment(s) start and end positions.
Return type: List[int]

property fragments: Iterable[str]¶

Extracts the coordinates of restriction fragments from the sequence.

Yields: Iterator[Iterable[str]] – Identified restriction fragments in bed format.

class ccanalyser.tools.digest.DigestedRead(read: pysam.libcfaidx.FastqProxy, cutsite: str, min_slice_length: int = 18, slice_number_offset: int = 0, allow_undigested: bool = False, read_type: str = 'flashed')¶

Bases: object

Performs in slico digestion of fastq files.

Identifies all restriction sites for a supplied restriction enzyme/restriction site and generates bed style entries.

read¶

Read to digest.

Type: pysam.FastqProxy

recognition_seq¶

Sequence of restriction recognition site.

Type: str

recognition_len¶

Length of restriction recognition site.

Type: int

recognition_seq¶

Regular expression for restriction recognition site.

Type: re.Pattern

slices¶

List of Fastq formatted digested reads (slices).

Type: List[str]

slice_indexes¶

Indexes of fragment(s) start and end positions.

Type: List[int]

slice_number_offset¶

Starting fragment number.

Type: int

min_slice_len¶

Minimum fragment length required to report fragment.

Type: int

has_slices¶

Recognition site(s) present within sequence.

Type: bool

get_recognition_site_indexes() → List[int]¶

class ccanalyser.tools.digest.ReadDigestionProcess(inq: multiprocessing.context.BaseContext.SimpleQueue, outq: multiprocessing.context.BaseContext.SimpleQueue, statq: Optional[multiprocessing.context.BaseContext.Queue] = None, **digestion_kwargs)¶

Bases: multiprocessing.context.Process

Process subclass for multiprocessing fastq digestion.

run()¶

Performs read digestion.

Reads to digest are pulled from inq, digested with the DigestedRead class and the results placed on outq for writing.

If a statq is provided, read digestion stats are placed into this queue for aggregation.

ccanalyser.tools.filter module¶

class ccanalyser.tools.filter.SliceFilter(slices: pandas.core.frame.DataFrame, filter_stages: Optional[dict] = None, sample_name: str = '', read_type: str = '')¶

Bases: object

Perform slice filtering (inplace) and reporter identification.

The SliceFilter classes e.g. CCSliceFilter, TriCSliceFilter, TiledCSliceFilter perform all of the filtering (inplace) and reporter identification whilst also providing statistics of the numbers of slices/reads removed at each stage.

slices¶

Annotated slices dataframe.

Type: pd.DataFrame

fragments¶

Slices dataframe aggregated by parental read.

Type: pd.DataFrame

reporters¶

Slices identified as reporters.

Type: pd.DataFrame

filter_stages¶

Dictionary containg stages and a list of class methods (str) required to get to this stage.

Type: dict

slice_stats¶

Provides slice level statistics.

Type: pd.DataFrame

read_stats¶

Provides statistics of slice filtering at the parental read level.

Type: pd.DataFrame

filter_stats¶

Provides statistics of read filtering.

Type: pd.DataFrame

property filters: list¶

A list of the callable filters present within the slice filterer instance.

Returns: All filters present in the class.
Return type: list

property slice_stats: pandas.core.frame.DataFrame¶

Statistics at the slice level.

Returns: Statistics per slice.
Return type: pd.DataFrame

property filter_stats: pandas.core.frame.DataFrame¶

Statistics for each filter stage.

Returns: Statistics of the number of slices removed at each stage.
Return type: pd.DataFrame

property read_stats: pandas.core.frame.DataFrame¶

Gets statistics at a read level.

Aggregates slices by parental read id and calculates stats.

Returns: Statistics of the slices/fragments removed aggregated by read id.
Return type: pd.DataFrame

property fragments: pandas.core.frame.DataFrame¶

Summarises slices at the fragment level.

Uses pandas groupby to aggregate slices by their parental read name (shared by all slices from the same fragment). Also determines the number of reporter slices for each fragment.

Returns: Slices aggregated by parental read name.
Return type: pd.DataFrame

property reporters: pandas.core.frame.DataFrame¶

Extracts reporter slices from slices dataframe i.e. non-capture slices

Returns: All non-capture slices
Return type: pd.DataFrame

filter_slices(output_slices=False, output_location='.')¶

Performs slice filtering.

Filters are applied to the slices dataframe in the order specified by filter_stages. Filtering stats aggregated at the slice and fragment level are also printed.

Parameters

output_slices (bool, optional) – Determines if slices are to be output to a specified location after each filtering step. Useful for debugging. Defaults to False.
output_location (str, optional) – Location to output slices at each stage. Defaults to “.”.

get_unfiltered_slices()¶: Does not modify slices.

remove_unmapped_slices()¶: Removes slices marked as unmapped (Uncommon)

remove_orphan_slices()¶: Remove fragments with only one aligned slice (Common)

remove_duplicate_re_frags()¶

Prevent the same restriction fragment being counted more than once (Uncommon).

Example

–RE_FRAG1–-—Capture—--–RE_FRAG1—-

remove_slices_without_re_frag_assigned()¶: Removes slices if restriction_fragment column is N/A

remove_duplicate_slices()¶

Remove all slices if the slice coordinates and slice order are shared.

This method is designed to remove a fragment if it is a PCR duplicate (Common).

Example

Frag 1:  chr1:1000-1250 chr1:1500-1750
Frag 2:  chr1:1000-1250 chr1:1500-1750
Frag 3:  chr1:1050-1275 chr1:1600-1755
Frag 4:  chr1:1500-1750 chr1:1000-1250

Frag 2 removed. Frag 1,3,4 retained

remove_duplicate_slices_pe()¶

Removes PCR duplicates from non-flashed (PE) fragments (Common).

Sequence quality is often lower at the 3’ end of reads leading to variance in mapping coordinates. PCR duplicates are removed by checking that the fragment start and end are not duplicated in the dataframe.

remove_excluded_slices()¶: Removes any slices in the exclusion region (default 1kb) (V. Common)

remove_blacklisted_slices()¶: Removes slices marked as being within blacklisted regions

class ccanalyser.tools.filter.CCSliceFilter(slices, filter_stages=None, **sample_kwargs)¶

Bases: ccanalyser.tools.filter.SliceFilter

Perform Capture-C slice filtering (inplace) and reporter identification.

SliceFilter tuned specifically for Capture-C data. This class has addtional methods to remove common artifacts in Capture-C data i.e. multi-capture fragments, non-reporter fragments, multi-capture reporters. The default filter order is as follows:

remove_unmapped_slices

remove_orphan_slices

remove_multi_capture_fragments

remove_excluded_slices

remove_blacklisted_slices

remove_non_reporter_fragments

remove_multicapture_reporters

remove_slices_without_re_frag_assigned

remove_duplicate_re_frags

remove_duplicate_slices

remove_duplicate_slices_pe

remove_non_reporter_fragments

See the individual methods for further details.

slices¶

Annotated slices dataframe.

Type: pd.DataFrame

fragments¶

Slices dataframe aggregated by parental read.

Type: pd.DataFrame

reporters¶

Slices identified as reporters.

Type: pd.DataFrame

filter_stages¶

Dictionary containg stages and a list of class methods (str) required to get to this stage.

Type: dict

slice_stats¶

Provides slice level statistics.

Type: pd.DataFrame

read_stats¶

Provides statistics of slice filtering at the parental read level.

Type: pd.DataFrame

filter_stats¶

Provides statistics of read filtering.

Type: pd.DataFrame

property fragments: pandas.core.frame.DataFrame¶

Summarises slices at the fragment level.

Uses pandas groupby to aggregate slices by their parental read name (shared by all slices from the same fragment). Also determines the number of reporter slices for each fragment.

Returns: Slices aggregated by parental read name.
Return type: pd.DataFrame

property slice_stats¶

Statistics at the slice level.

Returns: Statistics per slice.
Return type: pd.DataFrame

property frag_stats: pandas.core.frame.DataFrame¶

Statistics aggregated at the fragment level.

As this involves slice aggregation it can be rather slow for large datasets. It is recomended to only use this property if it is required.

Returns: Fragment level statistics
Return type: pd.DataFrame

property reporters: pandas.core.frame.DataFrame¶

Extracts reporter slices from slices dataframe i.e. non-capture slices

Returns: All non-capture slices
Return type: pd.DataFrame

property captures: pandas.core.frame.DataFrame¶

Extracts capture slices from slices dataframe

i.e. slices that do not have a null capture name

Returns: Capture slices
Return type: pd.DataFrame

property capture_site_stats: pandas.core.series.Series¶: Extracts the number of unique capture sites.

property merged_captures_and_reporters: pandas.core.frame.DataFrame¶

Merges captures and reporters sharing the same parental id.

Capture slices and reporter slices with the same parental read id are merged together. The prefixes ‘capture’ and ‘reporter’ are used to identify slices marked as either captures or reporters.

Returns: Merged capture and reporter slices
Return type: pd.DataFrame

property cis_or_trans_stats: pandas.core.frame.DataFrame¶

Extracts reporter cis/trans statistics from slices.

Returns: Reporter cis/trans statistics
Return type: pd.DataFrame

remove_non_reporter_fragments()¶: Removes the fragment if it has no reporter slices present (Common)

remove_multi_capture_fragments()¶

Removes double capture fragments.

All slices (i.e. the entire fragment) are removed if more than one capture probe is present i.e. a double capture (V. Common)

remove_multicapture_reporters(n_adjacent: int = 1)¶

Deals with an odd situation in which a reporter spanning two adjacent capture sites is not removed.

Example

——Capture 1—-/——Capture 2—— —–REP——–

In this case the “reporter” slice is not considered either a capture or exclusion.

These cases are dealt with by explicitly removing reporters on restriction fragments adjacent to capture sites.

Parameters: n_adjacent – Number of adjacent restriction fragments to remove

class ccanalyser.tools.filter.TriCSliceFilter(slices, filter_stages=None, **sample_kwargs)¶

Bases: ccanalyser.tools.filter.CCSliceFilter

Perform Tri-C slice filtering (inplace) and reporter identification.

SliceFilter tuned specifically for Tri-C data. Whilst the vast majority of filters are inherited from CCSliceFilter, this class has addtional methods for Tri-C analysis i.e. remove_slices_with_one_reporter. The default filtering order is:

remove_unmapped_slices

remove_slices_without_re_frag_assigned

remove_orphan_slices

remove_multi_capture_fragments

remove_blacklisted_slices

remove_non_reporter_fragments

remove_multicapture_reporters

remove_duplicate_re_frags

remove_duplicate_slices

remove_duplicate_slices_pe

remove_non_reporter_fragments

remove_slices_with_one_reporter

See the individual methods for further details.

slices¶

Annotated slices dataframe.

Type: pd.DataFrame

fragments¶

Slices dataframe aggregated by parental read.

Type: pd.DataFrame

reporters¶

Slices identified as reporters.

Type: pd.DataFrame

filter_stages¶

Dictionary containg stages and a list of class methods (str) required to get to this stage.

Type: dict

slice_stats¶

Provides slice level statistics.

Type: pd.DataFrame

read_stats¶

Provides statistics of slice filtering at the parental read level.

Type: pd.DataFrame

filter_stats¶

Provides statistics of read filtering.

Type: pd.DataFrame

remove_slices_with_one_reporter()¶: Removes fragments if they do not contain at least two reporters.

class ccanalyser.tools.filter.TiledCSliceFilter(slices, filter_stages=None, **sample_kwargs)¶

Bases: ccanalyser.tools.filter.SliceFilter

Perform Tiled-C slice filtering (inplace) and reporter identification.

SliceFilter tuned specifically for Tiled-C data. This class has addtional methods to remove common artifacts in Tiled-C data i.e. non-capture fragments, multi-capture (with different tiled regions) fragments. A reporter is defined differently in a Tiled-C analysis as a reporter slice can also be a capture slice.

The default filter order is as follows:

remove_unmapped_slices

remove_orphan_slices

remove_blacklisted_slices

remove_non_capture_fragments

remove_dual_capture_fragments

remove_slices_without_re_frag_assigned

remove_duplicate_re_frags

remove_duplicate_slices

remove_duplicate_slices_pe

remove_orphan_slices

See the individual methods for further details.

slices¶

Annotated slices dataframe.

Type: pd.DataFrame

fragments¶

Slices dataframe aggregated by parental read.

Type: pd.DataFrame

reporters¶

Slices identified as reporters.

Type: pd.DataFrame

filter_stages¶

Dictionary containg stages and a list of class methods (str) required to get to this stage.

Type: dict

slice_stats¶

Provides slice level statistics.

Type: pd.DataFrame

read_stats¶

Provides statistics of slice filtering at the parental read level.

Type: pd.DataFrame

filter_stats¶

Provides statistics of read filtering.

Type: pd.DataFrame

property fragments: pandas.core.frame.DataFrame¶

Summarises slices at the fragment level.

Uses pandas groupby to aggregate slices by their parental read name (shared by all slices from the same fragment). Also determines the number of reporter slices for each fragment.

Returns: Slices aggregated by parental read name.
Return type: pd.DataFrame

property slice_stats¶

Statistics at the slice level.

Returns: Statistics per slice.
Return type: pd.DataFrame

property cis_or_trans_stats: pandas.core.frame.DataFrame¶

Extracts reporter cis/trans statistics from slices.

Unlike Capture-C/Tri-C reporter slice can also be capture slices as all slices within the capture region are considered as reporters. To extract cis/trans statistics, one capture slice in each fragment is considered to be the “primary capture” this then enables merging of this “primary capture” with the other reporters both inside and outside of the tiled region.

Returns: Reporter cis/trans statistics
Return type: pd.DataFrame

remove_slices_outside_capture()¶: Removes slices outside of capture region(s)

remove_non_capture_fragments()¶: Removes fragments without a capture assigned

remove_dual_capture_fragments()¶

Removes a fragment with multiple different capture sites.

Modified for TiledC filtering as the fragment dataframe is generated slightly differently.

ccanalyser.tools.io module¶

class ccanalyser.tools.io.FastqReaderProcess(input_files: Union[str, list], outq: multiprocessing.context.BaseContext.Queue, read_buffer: int = 100000, read_counter: Optional[multiprocessing.managers.BaseManager.register.<locals>.temp] = None, n_subprocesses: int = 1, statq: Optional[multiprocessing.context.BaseContext.Queue] = None)¶

Bases: multiprocessing.context.Process

Reads fastq file(s) in chunks and places them on a queue.

input_file¶: Input fastq files.

outq¶: Output queue for chunked reads/read pairs.

statq¶: (Not currently used) Queue for read statistics if required.

read_buffer¶: Number of reads to process before placing them on outq

read_counter¶: (Not currently used) Can be used to sync between multiple readers.

n_subproceses¶: Number of processes running concurrently. Used to make sure enough termination signals are used.

run()¶: Performs reading and chunking of fastq file(s).

class ccanalyser.tools.io.FastqReadFormatterProcess(inq: multiprocessing.context.BaseContext.SimpleQueue, outq: multiprocessing.context.BaseContext.SimpleQueue, formatting: Optional[list] = None)¶

Bases: multiprocessing.context.Process

run()¶: Method to be run in sub-process; can be overridden in sub-class

class ccanalyser.tools.io.FastqWriterSplitterProcess(inq: multiprocessing.context.BaseContext.Queue, output_prefix: Union[str, list], paired_output: bool = False, gzip=False, compression_level: int = 3, compression_threads: int = 8, n_subprocesses: int = 1, n_workers_terminated: int = 0, n_files_written: int = 0)¶

Bases: multiprocessing.context.Process

run()¶: Method to be run in sub-process; can be overridden in sub-class

class ccanalyser.tools.io.FastqWriterProcess(inq: multiprocessing.context.BaseContext.Queue, output: Union[str, list], compression_level: int = 5, n_subprocesses: int = 1)¶

Bases: multiprocessing.context.Process

run()¶: Method to be run in sub-process; can be overridden in sub-class

ccanalyser.tools.io.parse_alignment(aln)¶

Parses reads from a bam file into a list.

Extracts:: -read name -parent reads -flashed status -slice number -mapped status -multimapping status -chromosome number (e.g. chr10) -start (e.g. 1000) -end (e.g. 2000) -coords e.g. (chr10:1000-2000)

Parameters: aln – pysam.AlignmentFile.
Returns: Containing the attributes extracted.
Return type: list

ccanalyser.tools.io.parse_bam(bam)¶

Uses parse_alignment function convert bam file to a dataframe.

Extracts:: -‘slice_name’ -‘parent_read’ -‘pe’ -‘slice’ -‘mapped’ -‘multimapped’ -‘chrom’ -‘start’ -‘end’ -‘coordinates’

Parameters: bam – File name of bam file to process.
Returns: DataFrame with the columns listed above.
Return type: pd.Dataframe

ccanalyser.tools.pileup module¶

class ccanalyser.tools.pileup.CoolerBedGraph(cooler_fn: str, sparse: bool = True, only_cis: bool = False)¶

Bases: object

Generates a bedgraph file from a cooler file created by interactions-store.

cooler¶

Cooler file to use for bedgraph production

Type: cooler.Cooler

capture_name¶

Name of capture probe being processed.

Type: str

sparse¶

Only output bins with interactions.

Type: bool

only_cis¶

Only output cis interactions.

Type: bool

property bedgraph: pandas.core.frame.DataFrame¶: Returns: pd.DataFrame: DataFrame in bedgraph format.

property reporters: pandas.core.frame.DataFrame¶

Interactions with capture fragments/bins.

Returns: DataFrame containing just bins interacting with the capture probe.
Return type: pd.DataFrame

normalise_bedgraph(scale_factor=1000000.0) → pandas.core.frame.DataFrame¶

Normalises the bedgraph.

Uses the number of cis interactions to normalise the bedgraph counts.

Parameters: scale_factor (int, optional) – Scaling factor for normalisation. Defaults to 1e6.
Returns: Normalised bedgraph formatted DataFrame
Return type: pd.DataFrame

to_file(fn: os.PathLike, normalise: bool = False, **normalise_kwargs)¶

Outputs the bedgraph dataframe to a file.

If normalise is True, will also normalise the counts by the number of cis interactions.

Parameters

fn (os.PathLike) – Output file name.
normalise (bool, optional) – Normalise the bedgraph before writing to file. Defaults to False.

class ccanalyser.tools.pileup.CoolerBedGraphWindowed(cooler_fn: str, binsize: int = 5000.0, binner: Optional[ccanalyser.tools.storage.CoolerBinner] = None, sparse=True)¶

Bases: ccanalyser.tools.pileup.CoolerBedGraph

normalise_bedgraph(scale_factor=1000000.0)¶

Normalises the bedgraph.

Uses the number of cis interactions to normalise the bedgraph counts.

Parameters: scale_factor (int, optional) – Scaling factor for normalisation. Defaults to 1e6.
Returns: Normalised bedgraph formatted DataFrame
Return type: pd.DataFrame

property reporters_binned¶

class ccanalyser.tools.pileup.CCBedgraph(path=None, df=None, capture_name='', capture_chrom='', capture_start='', capture_end='')¶

Bases: object

property score¶

property coordinates¶

to_bedtool()¶

to_file(path)¶

ccanalyser.tools.plotting module¶

class ccanalyser.tools.plotting.CCMatrix(cooler_fn: os.PathLike, binsize: 5000, capture_name: str, remove_capture=False)¶

Bases: object

get_matrix(coordinates, field='count')¶

get_matrix_normalised(coordinates, normalisation_method=None, **normalisation_kwargs)¶

ccanalyser.tools.statistics module¶

class ccanalyser.tools.statistics.DeduplicationStatistics(sample: str, read_type: str = 'pe', reads_total: int = 0, reads_unique: int = 0)¶

Bases: object

property df¶

class ccanalyser.tools.statistics.DigestionStats(read_type, read_number, unfiltered, filtered)¶

Bases: tuple

filtered¶: Alias for field number 3

read_number¶: Alias for field number 1

read_type¶: Alias for field number 0

unfiltered¶: Alias for field number 2

ccanalyser.tools.statistics.collate_histogram_data(fnames)¶

ccanalyser.tools.statistics.collate_read_data(fnames)¶

ccanalyser.tools.statistics.collate_slice_data(fnames)¶

ccanalyser.tools.statistics.collate_cis_trans_data(fnames)¶

ccanalyser.tools.statistics.extract_trimming_stats(fn)¶

ccanalyser.tools.storage module¶

ccanalyser.tools.storage.get_capture_coords(viewpoint_file: str, viewpoint_name: str)¶

ccanalyser.tools.storage.get_capture_bins(bins, viewpoint_chrom, viewpoint_start, viewpoint_end)¶

ccanalyser.tools.storage.create_cooler_cc(output_prefix: str, bins: pandas.core.frame.DataFrame, pixels: pandas.core.frame.DataFrame, capture_name: str, capture_viewpoints: os.PathLike, capture_bins: Optional[Union[int, list]] = None, suffix=None, **cooler_kwargs) → os.PathLike¶

Creates a cooler hdf5 file or cooler formatted group within a hdf5 file.

Parameters

output_prefix (str) – Output path for hdf5 file. If this already exists, will append a new group to the file.
bins (pd.DataFrame) – DataFrame containing the genomic coordinates of all bins in the pixels table.
pixels (pd.DataFrame) – DataFrame with columns: bin1_id, bin2_id, count.
capture_name (str) – Name of capture probe to store.
capture_viewpoints (os.PathLike) – Path to capture viewpoints used for the analysis.
capture_bins (Union[int, list], optional) – Bins containing capture viewpoints. Can be determined from viewpoints if not supplied. Defaults to None.
suffix (str, optional) – Suffix to append before the .hdf5 file extension. Defaults to None.

Raises

ValueError – Capture name must exactly match the name of a supplied capture viewpoint.

Returns

Path of cooler hdf5 file.

Return type

os.PathLike

class ccanalyser.tools.storage.GenomicBinner(chromsizes: Union[os.PathLike, pandas.core.frame.DataFrame, pandas.core.series.Series], fragments: pandas.core.frame.DataFrame, binsize: int = 5000, n_cores: int = 8, method: Literal[midpoint, overlap] = 'midpoint', min_overlap: float = 0.2)¶

Bases: object

Provides a conversion table for converting two sets of bins.

chromsizes¶

Series indexed by chromosome name containg chromosome sizes in bp

Type: pd.Series

fragments¶

DataFrame containing bins to convert to equal genomic intervals

Type: pd.DataFrame

binsize¶

Genomic bin size

Type: int

min_overlap¶

Minimum degree of intersection to define an overlap.

Type: float

n_cores¶

Number of cores to use for bin intersection.

Type: int

property bins: pandas.core.frame.DataFrame¶

Equal genomic bins.

Returns: DataFrame in bed format.
Return type: pd.DataFrame

property bin_conversion_table: pandas.core.frame.DataFrame¶: Returns: pd.DataFrame: Conversion table containing coordinates and ids of intersecting bins.

class ccanalyser.tools.storage.CoolerBinner(cooler_fn: os.PathLike, binsize: Optional[int] = None, n_cores: int = 8, binner: Optional[ccanalyser.tools.storage.GenomicBinner] = None)¶

Bases: object

Bins a cooler file into equal genomic intervals.

cooler¶: (cooler.Cooler): Cooler instance to bin.

binner¶

Binner class to generate bin conversion tables

Type: ccanalyser.storeage.GenomicBinner

binsize¶

Genomic bin size

Type: int

scale_factor¶

Scaling factor for normalising interaction counts.

Type: int

n_cis_interactions¶

Number of cis interactions with the capture bins.

Type: int

n_cores¶

Number of cores to use for binning.

Type: int

property bins: pandas.core.frame.DataFrame¶: Returns: pd.DataFrame: Even genomic bins of a specified binsize.

property bin_conversion_table: pandas.core.frame.DataFrame¶: Returns: pd.DataFrame: Conversion table containing coordinates and ids of intersecting bins.

property capture_bins¶: Returns: pd.DataFrame: Capture bins converted to the new even genomic bin format.

property pixel_conversion_table¶: Returns: pd.DataFrame: Conversion table to convert old binning scheme to the new.

property pixels¶: Returns: pd.DataFrame: Pixels (interaction counts) converted to the new binning scheme.

normalise_pixels(n_fragment_correction: bool = True, n_interaction_correction: bool = True, scale_factor: int = 1000000.0)¶

Normalises pixels (interactions).

Normalises pixels according to the number of restriction fragments per bin and the number of cis interactions. If both normalisation options are selected, will also provide a dual normalised column.

Parameters

n_fragment_correction (bool, optional) – Updates the pixels DataFrame with counts corrected for the number of restriction fragments per bin. Defaults to True.
n_interaction_correction (bool, optional) – Updates the pixels DataFrame with counts corrected for the number of cis interactions. Defaults to True.
scale_factor (int, optional) – Scaling factor for n_interaction_correction. Defaults to 1e6.

to_cooler(store, normalise=False, **normalise_options)¶

ccanalyser.tools.storage.link_bins(clr: os.PathLike)¶

Reduces cooler storage space by linking “bins” table.

All of the cooler “bins” tables containing the genomic coordinates of each bin are identical for all cooler files of the same resoultion. As cooler.create_cooler generates a new bins table for each cooler, this leads to a high degree of duplication.

This function hard links the bins tables for a given resolution to reduce the degree of duplication.

Parameters: clr (os.PathLike) – Path to cooler hdf5 produced by the merge command.