CCanalyser tools¶
ccanalyser.tools.annotate module¶
- class ccanalyser.tools.annotate.BedIntersection(bed1: Union[str, pybedtools.bedtool.BedTool, pandas.core.frame.DataFrame], bed2: Union[str, pybedtools.bedtool.BedTool, pandas.core.frame.DataFrame], intersection_name: str = 'count', intersection_method: str = 'count', intersection_min_frac: float = 1e-09, intersection_split_chrom: bool = True, n_cores: int = 1, invalid_bed_action='error')¶
Bases:
objectPerforms intersection between two named bed files.
Wrapper around bedtools intersect designed to intersect in parallel (by splitting file based on chromosome) and handle malformed bed files.
- bed1¶
Bed file to intersect. Must be named.
- Type
Union[str, BedTool, pd.DataFrame]
- bed2¶
Bed file to intersect.
- Type
Union[str, BedTool, pd.DataFrame]
- intersection_name¶
Name for intersection.
- Type
str
- min_frac¶
Minimum fraction required for intersection
- Type
float
- n_cores¶
Number of cores for parallel intersection.
- Type
int
- invalid_bed_action¶
Method to deal with missing/malformed bed files (“ignore”|”error”)
- property intersection: pandas.core.series.Series¶
Intersects the two bed files and returns a pd.Series.
ccanalyser.tools.deduplicate module¶
- class ccanalyser.tools.deduplicate.ReadDeduplicationParserProcess(inq: multiprocessing.context.BaseContext.SimpleQueue, outq: multiprocessing.context.BaseContext.SimpleQueue, hash_seed: int = 42, save_hashed_dict_path: os.PathLike = 'parsed.json')¶
Bases:
multiprocessing.context.ProcessProcess subclass for parsing fastq file(s) into a hashed {id:sequence} json format.
- inq¶
Input read queue
- outq¶
Output read queue (Not currently used)
- hash_seed¶
Seed for xxhash64 algorithm to ensure consistency
- save_hash_dict_path¶
Path to save hashed dictionary
- run()¶
Processes fastq reads from multiple files and generates a hashed json dictionary.
Dictionary is hashed and in the format {(read 1 name + read 2 name): (sequence 1 + sequence 2)}.
Output path is specified by save_hashed_dict_path.
- class ccanalyser.tools.deduplicate.ReadDuplicateRemovalProcess(inq: multiprocessing.context.BaseContext.SimpleQueue, outq: multiprocessing.context.BaseContext.SimpleQueue, duplicated_ids: set, statq: Optional[multiprocessing.context.BaseContext.Queue] = None, hash_seed: int = 42)¶
Bases:
multiprocessing.context.ProcessProcess subclass for parsing fastq file(s) and removing identified duplicates.
- inq¶
Input read queue
- outq¶
Output queue for deduplicated reads.
- duplicated_ids¶
Concatenated read ids to remove from input fastq files.
- statq¶
Output queue for statistics.
- reads_total¶
Number of fastq reads processed.
- reads_unique¶
Number of non-duplicated reads output.
- hash_seed¶
Seed for xxhash algorithm. MUST be the same as used by ReadDuplicationParserProcess.
- run()¶
Performs read deduplication based on sequence.
Unique reads are placed on outq and deduplication statistics are placed on statq.
ccanalyser.tools.digest module¶
- class ccanalyser.tools.digest.DigestedChrom(chrom: pysam.libcfaidx.FastqProxy, cutsite: str, fragment_number_offset: int = 0, fragment_min_len: int = 1)¶
Bases:
objectPerforms in slico digestion of fasta files.
Identifies all restriction sites for a supplied restriction enzyme/restriction site and generates bed style entries.
- chrom¶
Chromosome to digest
- Type
pysam.FastqProxy
- recognition_seq¶
Sequence of restriction recognition site
- Type
str
- recognition_len¶
Length of restriction recognition site
- Type
int
- recognition_seq¶
Regular expression for restriction recognition site
- Type
re.Pattern
- fragment_indexes¶
Indexes of fragment(s) start and end positions.
- Type
List[int]
- fragment_number_offset¶
Starting fragment number.
- Type
int
- fragment_min_len¶
Minimum fragment length required to report fragment
- Type
int
- get_recognition_site_indexes() → List[int]¶
Gets the start position of all recognition sites present in the sequence.
Notes
Also appends the start and end of the sequnece to enable clearer itteration through the indexes.
- Returns
Indexes of fragment(s) start and end positions.
- Return type
List[int]
- property fragments: Iterable[str]¶
Extracts the coordinates of restriction fragments from the sequence.
- Yields
Iterator[Iterable[str]] – Identified restriction fragments in bed format.
- class ccanalyser.tools.digest.DigestedRead(read: pysam.libcfaidx.FastqProxy, cutsite: str, min_slice_length: int = 18, slice_number_offset: int = 0, allow_undigested: bool = False, read_type: str = 'flashed')¶
Bases:
objectPerforms in slico digestion of fastq files.
Identifies all restriction sites for a supplied restriction enzyme/restriction site and generates bed style entries.
- read¶
Read to digest.
- Type
pysam.FastqProxy
- recognition_seq¶
Sequence of restriction recognition site.
- Type
str
- recognition_len¶
Length of restriction recognition site.
- Type
int
- recognition_seq¶
Regular expression for restriction recognition site.
- Type
re.Pattern
- slices¶
List of Fastq formatted digested reads (slices).
- Type
List[str]
- slice_indexes¶
Indexes of fragment(s) start and end positions.
- Type
List[int]
- slice_number_offset¶
Starting fragment number.
- Type
int
- min_slice_len¶
Minimum fragment length required to report fragment.
- Type
int
- has_slices¶
Recognition site(s) present within sequence.
- Type
bool
- get_recognition_site_indexes() → List[int]¶
- class ccanalyser.tools.digest.ReadDigestionProcess(inq: multiprocessing.context.BaseContext.SimpleQueue, outq: multiprocessing.context.BaseContext.SimpleQueue, statq: Optional[multiprocessing.context.BaseContext.Queue] = None, **digestion_kwargs)¶
Bases:
multiprocessing.context.ProcessProcess subclass for multiprocessing fastq digestion.
- run()¶
Performs read digestion.
Reads to digest are pulled from inq, digested with the DigestedRead class and the results placed on outq for writing.
If a statq is provided, read digestion stats are placed into this queue for aggregation.
ccanalyser.tools.filter module¶
- class ccanalyser.tools.filter.SliceFilter(slices: pandas.core.frame.DataFrame, filter_stages: Optional[dict] = None, sample_name: str = '', read_type: str = '')¶
Bases:
objectPerform slice filtering (inplace) and reporter identification.
The SliceFilter classes e.g. CCSliceFilter, TriCSliceFilter, TiledCSliceFilter perform all of the filtering (inplace) and reporter identification whilst also providing statistics of the numbers of slices/reads removed at each stage.
- slices¶
Annotated slices dataframe.
- Type
pd.DataFrame
- fragments¶
Slices dataframe aggregated by parental read.
- Type
pd.DataFrame
- reporters¶
Slices identified as reporters.
- Type
pd.DataFrame
- filter_stages¶
Dictionary containg stages and a list of class methods (str) required to get to this stage.
- Type
dict
- slice_stats¶
Provides slice level statistics.
- Type
pd.DataFrame
- read_stats¶
Provides statistics of slice filtering at the parental read level.
- Type
pd.DataFrame
- filter_stats¶
Provides statistics of read filtering.
- Type
pd.DataFrame
- property filters: list¶
A list of the callable filters present within the slice filterer instance.
- Returns
All filters present in the class.
- Return type
list
- property slice_stats: pandas.core.frame.DataFrame¶
Statistics at the slice level.
- Returns
Statistics per slice.
- Return type
pd.DataFrame
- property filter_stats: pandas.core.frame.DataFrame¶
Statistics for each filter stage.
- Returns
Statistics of the number of slices removed at each stage.
- Return type
pd.DataFrame
- property read_stats: pandas.core.frame.DataFrame¶
Gets statistics at a read level.
Aggregates slices by parental read id and calculates stats.
- Returns
Statistics of the slices/fragments removed aggregated by read id.
- Return type
pd.DataFrame
- property fragments: pandas.core.frame.DataFrame¶
Summarises slices at the fragment level.
Uses pandas groupby to aggregate slices by their parental read name (shared by all slices from the same fragment). Also determines the number of reporter slices for each fragment.
- Returns
Slices aggregated by parental read name.
- Return type
pd.DataFrame
- property reporters: pandas.core.frame.DataFrame¶
Extracts reporter slices from slices dataframe i.e. non-capture slices
- Returns
All non-capture slices
- Return type
pd.DataFrame
- filter_slices(output_slices=False, output_location='.')¶
Performs slice filtering.
Filters are applied to the slices dataframe in the order specified by filter_stages. Filtering stats aggregated at the slice and fragment level are also printed.
- Parameters
output_slices (bool, optional) – Determines if slices are to be output to a specified location after each filtering step. Useful for debugging. Defaults to False.
output_location (str, optional) – Location to output slices at each stage. Defaults to “.”.
- get_unfiltered_slices()¶
Does not modify slices.
- remove_unmapped_slices()¶
Removes slices marked as unmapped (Uncommon)
- remove_orphan_slices()¶
Remove fragments with only one aligned slice (Common)
- remove_duplicate_re_frags()¶
Prevent the same restriction fragment being counted more than once (Uncommon).
Example
–RE_FRAG1–-—Capture—--–RE_FRAG1—-
- remove_slices_without_re_frag_assigned()¶
Removes slices if restriction_fragment column is N/A
- remove_duplicate_slices()¶
Remove all slices if the slice coordinates and slice order are shared.
This method is designed to remove a fragment if it is a PCR duplicate (Common).
Example
Frag 1: chr1:1000-1250 chr1:1500-1750Frag 2: chr1:1000-1250 chr1:1500-1750Frag 3: chr1:1050-1275 chr1:1600-1755Frag 4: chr1:1500-1750 chr1:1000-1250Frag 2 removed. Frag 1,3,4 retained
- remove_duplicate_slices_pe()¶
Removes PCR duplicates from non-flashed (PE) fragments (Common).
Sequence quality is often lower at the 3’ end of reads leading to variance in mapping coordinates. PCR duplicates are removed by checking that the fragment start and end are not duplicated in the dataframe.
- remove_excluded_slices()¶
Removes any slices in the exclusion region (default 1kb) (V. Common)
- remove_blacklisted_slices()¶
Removes slices marked as being within blacklisted regions
- class ccanalyser.tools.filter.CCSliceFilter(slices, filter_stages=None, **sample_kwargs)¶
Bases:
ccanalyser.tools.filter.SliceFilterPerform Capture-C slice filtering (inplace) and reporter identification.
SliceFilter tuned specifically for Capture-C data. This class has addtional methods to remove common artifacts in Capture-C data i.e. multi-capture fragments, non-reporter fragments, multi-capture reporters. The default filter order is as follows:
remove_unmapped_slices
remove_orphan_slices
remove_multi_capture_fragments
remove_excluded_slices
remove_blacklisted_slices
remove_non_reporter_fragments
remove_multicapture_reporters
remove_slices_without_re_frag_assigned
remove_duplicate_re_frags
remove_duplicate_slices
remove_duplicate_slices_pe
remove_non_reporter_fragments
See the individual methods for further details.
- slices¶
Annotated slices dataframe.
- Type
pd.DataFrame
- fragments¶
Slices dataframe aggregated by parental read.
- Type
pd.DataFrame
- reporters¶
Slices identified as reporters.
- Type
pd.DataFrame
- filter_stages¶
Dictionary containg stages and a list of class methods (str) required to get to this stage.
- Type
dict
- slice_stats¶
Provides slice level statistics.
- Type
pd.DataFrame
- read_stats¶
Provides statistics of slice filtering at the parental read level.
- Type
pd.DataFrame
- filter_stats¶
Provides statistics of read filtering.
- Type
pd.DataFrame
- property fragments: pandas.core.frame.DataFrame¶
Summarises slices at the fragment level.
Uses pandas groupby to aggregate slices by their parental read name (shared by all slices from the same fragment). Also determines the number of reporter slices for each fragment.
- Returns
Slices aggregated by parental read name.
- Return type
pd.DataFrame
- property slice_stats¶
Statistics at the slice level.
- Returns
Statistics per slice.
- Return type
pd.DataFrame
- property frag_stats: pandas.core.frame.DataFrame¶
Statistics aggregated at the fragment level.
As this involves slice aggregation it can be rather slow for large datasets. It is recomended to only use this property if it is required.
- Returns
Fragment level statistics
- Return type
pd.DataFrame
- property reporters: pandas.core.frame.DataFrame¶
Extracts reporter slices from slices dataframe i.e. non-capture slices
- Returns
All non-capture slices
- Return type
pd.DataFrame
- property captures: pandas.core.frame.DataFrame¶
Extracts capture slices from slices dataframe
i.e. slices that do not have a null capture name
- Returns
Capture slices
- Return type
pd.DataFrame
- property capture_site_stats: pandas.core.series.Series¶
Extracts the number of unique capture sites.
- property merged_captures_and_reporters: pandas.core.frame.DataFrame¶
Merges captures and reporters sharing the same parental id.
Capture slices and reporter slices with the same parental read id are merged together. The prefixes ‘capture’ and ‘reporter’ are used to identify slices marked as either captures or reporters.
- Returns
Merged capture and reporter slices
- Return type
pd.DataFrame
- property cis_or_trans_stats: pandas.core.frame.DataFrame¶
Extracts reporter cis/trans statistics from slices.
- Returns
Reporter cis/trans statistics
- Return type
pd.DataFrame
- remove_non_reporter_fragments()¶
Removes the fragment if it has no reporter slices present (Common)
- remove_multi_capture_fragments()¶
Removes double capture fragments.
All slices (i.e. the entire fragment) are removed if more than one capture probe is present i.e. a double capture (V. Common)
- remove_multicapture_reporters(n_adjacent: int = 1)¶
Deals with an odd situation in which a reporter spanning two adjacent capture sites is not removed.
Example
——Capture 1—-/——Capture 2—— —–REP——–
In this case the “reporter” slice is not considered either a capture or exclusion.
These cases are dealt with by explicitly removing reporters on restriction fragments adjacent to capture sites.
- Parameters
n_adjacent – Number of adjacent restriction fragments to remove
- class ccanalyser.tools.filter.TriCSliceFilter(slices, filter_stages=None, **sample_kwargs)¶
Bases:
ccanalyser.tools.filter.CCSliceFilterPerform Tri-C slice filtering (inplace) and reporter identification.
SliceFilter tuned specifically for Tri-C data. Whilst the vast majority of filters are inherited from CCSliceFilter, this class has addtional methods for Tri-C analysis i.e. remove_slices_with_one_reporter. The default filtering order is:
remove_unmapped_slices
remove_slices_without_re_frag_assigned
remove_orphan_slices
remove_multi_capture_fragments
remove_blacklisted_slices
remove_non_reporter_fragments
remove_multicapture_reporters
remove_duplicate_re_frags
remove_duplicate_slices
remove_duplicate_slices_pe
remove_non_reporter_fragments
remove_slices_with_one_reporter
See the individual methods for further details.
- slices¶
Annotated slices dataframe.
- Type
pd.DataFrame
- fragments¶
Slices dataframe aggregated by parental read.
- Type
pd.DataFrame
- reporters¶
Slices identified as reporters.
- Type
pd.DataFrame
- filter_stages¶
Dictionary containg stages and a list of class methods (str) required to get to this stage.
- Type
dict
- slice_stats¶
Provides slice level statistics.
- Type
pd.DataFrame
- read_stats¶
Provides statistics of slice filtering at the parental read level.
- Type
pd.DataFrame
- filter_stats¶
Provides statistics of read filtering.
- Type
pd.DataFrame
- remove_slices_with_one_reporter()¶
Removes fragments if they do not contain at least two reporters.
- class ccanalyser.tools.filter.TiledCSliceFilter(slices, filter_stages=None, **sample_kwargs)¶
Bases:
ccanalyser.tools.filter.SliceFilterPerform Tiled-C slice filtering (inplace) and reporter identification.
SliceFilter tuned specifically for Tiled-C data. This class has addtional methods to remove common artifacts in Tiled-C data i.e. non-capture fragments, multi-capture (with different tiled regions) fragments. A reporter is defined differently in a Tiled-C analysis as a reporter slice can also be a capture slice.
The default filter order is as follows:
remove_unmapped_slices
remove_orphan_slices
remove_blacklisted_slices
remove_non_capture_fragments
remove_dual_capture_fragments
remove_slices_without_re_frag_assigned
remove_duplicate_re_frags
remove_duplicate_slices
remove_duplicate_slices_pe
remove_orphan_slices
See the individual methods for further details.
- slices¶
Annotated slices dataframe.
- Type
pd.DataFrame
- fragments¶
Slices dataframe aggregated by parental read.
- Type
pd.DataFrame
- reporters¶
Slices identified as reporters.
- Type
pd.DataFrame
- filter_stages¶
Dictionary containg stages and a list of class methods (str) required to get to this stage.
- Type
dict
- slice_stats¶
Provides slice level statistics.
- Type
pd.DataFrame
- read_stats¶
Provides statistics of slice filtering at the parental read level.
- Type
pd.DataFrame
- filter_stats¶
Provides statistics of read filtering.
- Type
pd.DataFrame
- property fragments: pandas.core.frame.DataFrame¶
Summarises slices at the fragment level.
Uses pandas groupby to aggregate slices by their parental read name (shared by all slices from the same fragment). Also determines the number of reporter slices for each fragment.
- Returns
Slices aggregated by parental read name.
- Return type
pd.DataFrame
- property slice_stats¶
Statistics at the slice level.
- Returns
Statistics per slice.
- Return type
pd.DataFrame
- property cis_or_trans_stats: pandas.core.frame.DataFrame¶
Extracts reporter cis/trans statistics from slices.
Unlike Capture-C/Tri-C reporter slice can also be capture slices as all slices within the capture region are considered as reporters. To extract cis/trans statistics, one capture slice in each fragment is considered to be the “primary capture” this then enables merging of this “primary capture” with the other reporters both inside and outside of the tiled region.
- Returns
Reporter cis/trans statistics
- Return type
pd.DataFrame
- remove_slices_outside_capture()¶
Removes slices outside of capture region(s)
- remove_non_capture_fragments()¶
Removes fragments without a capture assigned
- remove_dual_capture_fragments()¶
Removes a fragment with multiple different capture sites.
Modified for TiledC filtering as the fragment dataframe is generated slightly differently.
ccanalyser.tools.io module¶
- class ccanalyser.tools.io.FastqReaderProcess(input_files: Union[str, list], outq: multiprocessing.context.BaseContext.Queue, read_buffer: int = 100000, read_counter: Optional[multiprocessing.managers.BaseManager.register.<locals>.temp] = None, n_subprocesses: int = 1, statq: Optional[multiprocessing.context.BaseContext.Queue] = None)¶
Bases:
multiprocessing.context.ProcessReads fastq file(s) in chunks and places them on a queue.
- input_file¶
Input fastq files.
- outq¶
Output queue for chunked reads/read pairs.
- statq¶
(Not currently used) Queue for read statistics if required.
- read_buffer¶
Number of reads to process before placing them on outq
- read_counter¶
(Not currently used) Can be used to sync between multiple readers.
- n_subproceses¶
Number of processes running concurrently. Used to make sure enough termination signals are used.
- run()¶
Performs reading and chunking of fastq file(s).
- class ccanalyser.tools.io.FastqReadFormatterProcess(inq: multiprocessing.context.BaseContext.SimpleQueue, outq: multiprocessing.context.BaseContext.SimpleQueue, formatting: Optional[list] = None)¶
Bases:
multiprocessing.context.Process- run()¶
Method to be run in sub-process; can be overridden in sub-class
- class ccanalyser.tools.io.FastqWriterSplitterProcess(inq: multiprocessing.context.BaseContext.Queue, output_prefix: Union[str, list], paired_output: bool = False, gzip=False, compression_level: int = 3, compression_threads: int = 8, n_subprocesses: int = 1, n_workers_terminated: int = 0, n_files_written: int = 0)¶
Bases:
multiprocessing.context.Process- run()¶
Method to be run in sub-process; can be overridden in sub-class
- class ccanalyser.tools.io.FastqWriterProcess(inq: multiprocessing.context.BaseContext.Queue, output: Union[str, list], compression_level: int = 5, n_subprocesses: int = 1)¶
Bases:
multiprocessing.context.Process- run()¶
Method to be run in sub-process; can be overridden in sub-class
- ccanalyser.tools.io.parse_alignment(aln)¶
Parses reads from a bam file into a list.
- Extracts:
-read name -parent reads -flashed status -slice number -mapped status -multimapping status -chromosome number (e.g. chr10) -start (e.g. 1000) -end (e.g. 2000) -coords e.g. (chr10:1000-2000)
- Parameters
aln – pysam.AlignmentFile.
- Returns
Containing the attributes extracted.
- Return type
list
- ccanalyser.tools.io.parse_bam(bam)¶
Uses parse_alignment function convert bam file to a dataframe.
- Extracts:
-‘slice_name’ -‘parent_read’ -‘pe’ -‘slice’ -‘mapped’ -‘multimapped’ -‘chrom’ -‘start’ -‘end’ -‘coordinates’
- Parameters
bam – File name of bam file to process.
- Returns
DataFrame with the columns listed above.
- Return type
pd.Dataframe
ccanalyser.tools.pileup module¶
- class ccanalyser.tools.pileup.CoolerBedGraph(cooler_fn: str, sparse: bool = True, only_cis: bool = False)¶
Bases:
objectGenerates a bedgraph file from a cooler file created by interactions-store.
- cooler¶
Cooler file to use for bedgraph production
- Type
cooler.Cooler
- capture_name¶
Name of capture probe being processed.
- Type
str
- sparse¶
Only output bins with interactions.
- Type
bool
- only_cis¶
Only output cis interactions.
- Type
bool
- property bedgraph: pandas.core.frame.DataFrame¶
Returns: pd.DataFrame: DataFrame in bedgraph format.
- property reporters: pandas.core.frame.DataFrame¶
Interactions with capture fragments/bins.
- Returns
DataFrame containing just bins interacting with the capture probe.
- Return type
pd.DataFrame
- normalise_bedgraph(scale_factor=1000000.0) → pandas.core.frame.DataFrame¶
Normalises the bedgraph.
Uses the number of cis interactions to normalise the bedgraph counts.
- Parameters
scale_factor (int, optional) – Scaling factor for normalisation. Defaults to 1e6.
- Returns
Normalised bedgraph formatted DataFrame
- Return type
pd.DataFrame
- to_file(fn: os.PathLike, normalise: bool = False, **normalise_kwargs)¶
Outputs the bedgraph dataframe to a file.
If normalise is True, will also normalise the counts by the number of cis interactions.
- Parameters
fn (os.PathLike) – Output file name.
normalise (bool, optional) – Normalise the bedgraph before writing to file. Defaults to False.
- class ccanalyser.tools.pileup.CoolerBedGraphWindowed(cooler_fn: str, binsize: int = 5000.0, binner: Optional[ccanalyser.tools.storage.CoolerBinner] = None, sparse=True)¶
Bases:
ccanalyser.tools.pileup.CoolerBedGraph- normalise_bedgraph(scale_factor=1000000.0)¶
Normalises the bedgraph.
Uses the number of cis interactions to normalise the bedgraph counts.
- Parameters
scale_factor (int, optional) – Scaling factor for normalisation. Defaults to 1e6.
- Returns
Normalised bedgraph formatted DataFrame
- Return type
pd.DataFrame
- property reporters_binned¶
ccanalyser.tools.plotting module¶
ccanalyser.tools.statistics module¶
- class ccanalyser.tools.statistics.DeduplicationStatistics(sample: str, read_type: str = 'pe', reads_total: int = 0, reads_unique: int = 0)¶
Bases:
object- property df¶
- class ccanalyser.tools.statistics.DigestionStats(read_type, read_number, unfiltered, filtered)¶
Bases:
tuple- filtered¶
Alias for field number 3
- read_number¶
Alias for field number 1
- read_type¶
Alias for field number 0
- unfiltered¶
Alias for field number 2
- ccanalyser.tools.statistics.collate_histogram_data(fnames)¶
- ccanalyser.tools.statistics.collate_read_data(fnames)¶
- ccanalyser.tools.statistics.collate_slice_data(fnames)¶
- ccanalyser.tools.statistics.collate_cis_trans_data(fnames)¶
- ccanalyser.tools.statistics.extract_trimming_stats(fn)¶
ccanalyser.tools.storage module¶
- ccanalyser.tools.storage.get_capture_coords(viewpoint_file: str, viewpoint_name: str)¶
- ccanalyser.tools.storage.get_capture_bins(bins, viewpoint_chrom, viewpoint_start, viewpoint_end)¶
- ccanalyser.tools.storage.create_cooler_cc(output_prefix: str, bins: pandas.core.frame.DataFrame, pixels: pandas.core.frame.DataFrame, capture_name: str, capture_viewpoints: os.PathLike, capture_bins: Optional[Union[int, list]] = None, suffix=None, **cooler_kwargs) → os.PathLike¶
Creates a cooler hdf5 file or cooler formatted group within a hdf5 file.
- Parameters
output_prefix (str) – Output path for hdf5 file. If this already exists, will append a new group to the file.
bins (pd.DataFrame) – DataFrame containing the genomic coordinates of all bins in the pixels table.
pixels (pd.DataFrame) – DataFrame with columns: bin1_id, bin2_id, count.
capture_name (str) – Name of capture probe to store.
capture_viewpoints (os.PathLike) – Path to capture viewpoints used for the analysis.
capture_bins (Union[int, list], optional) – Bins containing capture viewpoints. Can be determined from viewpoints if not supplied. Defaults to None.
suffix (str, optional) – Suffix to append before the .hdf5 file extension. Defaults to None.
- Raises
ValueError – Capture name must exactly match the name of a supplied capture viewpoint.
- Returns
Path of cooler hdf5 file.
- Return type
os.PathLike
- class ccanalyser.tools.storage.GenomicBinner(chromsizes: Union[os.PathLike, pandas.core.frame.DataFrame, pandas.core.series.Series], fragments: pandas.core.frame.DataFrame, binsize: int = 5000, n_cores: int = 8, method: Literal[midpoint, overlap] = 'midpoint', min_overlap: float = 0.2)¶
Bases:
objectProvides a conversion table for converting two sets of bins.
- chromsizes¶
Series indexed by chromosome name containg chromosome sizes in bp
- Type
pd.Series
- fragments¶
DataFrame containing bins to convert to equal genomic intervals
- Type
pd.DataFrame
- binsize¶
Genomic bin size
- Type
int
- min_overlap¶
Minimum degree of intersection to define an overlap.
- Type
float
- n_cores¶
Number of cores to use for bin intersection.
- Type
int
- property bins: pandas.core.frame.DataFrame¶
Equal genomic bins.
- Returns
DataFrame in bed format.
- Return type
pd.DataFrame
- property bin_conversion_table: pandas.core.frame.DataFrame¶
Returns: pd.DataFrame: Conversion table containing coordinates and ids of intersecting bins.
- class ccanalyser.tools.storage.CoolerBinner(cooler_fn: os.PathLike, binsize: Optional[int] = None, n_cores: int = 8, binner: Optional[ccanalyser.tools.storage.GenomicBinner] = None)¶
Bases:
objectBins a cooler file into equal genomic intervals.
- cooler¶
(cooler.Cooler): Cooler instance to bin.
- binner¶
Binner class to generate bin conversion tables
- Type
ccanalyser.storeage.GenomicBinner
- binsize¶
Genomic bin size
- Type
int
- scale_factor¶
Scaling factor for normalising interaction counts.
- Type
int
- n_cis_interactions¶
Number of cis interactions with the capture bins.
- Type
int
- n_cores¶
Number of cores to use for binning.
- Type
int
- property bins: pandas.core.frame.DataFrame¶
Returns: pd.DataFrame: Even genomic bins of a specified binsize.
- property bin_conversion_table: pandas.core.frame.DataFrame¶
Returns: pd.DataFrame: Conversion table containing coordinates and ids of intersecting bins.
- property capture_bins¶
Returns: pd.DataFrame: Capture bins converted to the new even genomic bin format.
- property pixel_conversion_table¶
Returns: pd.DataFrame: Conversion table to convert old binning scheme to the new.
- property pixels¶
Returns: pd.DataFrame: Pixels (interaction counts) converted to the new binning scheme.
- normalise_pixels(n_fragment_correction: bool = True, n_interaction_correction: bool = True, scale_factor: int = 1000000.0)¶
Normalises pixels (interactions).
Normalises pixels according to the number of restriction fragments per bin and the number of cis interactions. If both normalisation options are selected, will also provide a dual normalised column.
- Parameters
n_fragment_correction (bool, optional) – Updates the pixels DataFrame with counts corrected for the number of restriction fragments per bin. Defaults to True.
n_interaction_correction (bool, optional) – Updates the pixels DataFrame with counts corrected for the number of cis interactions. Defaults to True.
scale_factor (int, optional) – Scaling factor for n_interaction_correction. Defaults to 1e6.
- to_cooler(store, normalise=False, **normalise_options)¶
- ccanalyser.tools.storage.link_bins(clr: os.PathLike)¶
Reduces cooler storage space by linking “bins” table.
All of the cooler “bins” tables containing the genomic coordinates of each bin are identical for all cooler files of the same resoultion. As cooler.create_cooler generates a new bins table for each cooler, this leads to a high degree of duplication.
This function hard links the bins tables for a given resolution to reduce the degree of duplication.
- Parameters
clr (os.PathLike) – Path to cooler hdf5 produced by the merge command.