compact.main

This module contains the high-level workflow of this project, to run the major steps of the CompaCt analsysis.

The main functions:
  • between_scoring: rbo scoring between all provided correlation datasets

  • score_comparison: rbo scoring and top hit selection between two correlation datasete

  • mcl_clustering: perform clustering on computed rbo scores

  • process_mcl_result: process raw MCL output to get annotated CompaCt result clusters and subclusters

  • save_results: save CompaCt results to an output folder of choice

compact.main.between_scoring(nested_tags, int_matrices, mappings, p=0.9, min_search_weight=0.99, th_criterium='percent', th_percent=1, processes=1, chunksize=1000)

computes top hit scores between all interaction matrices

first performs rbo scoring for all pairs between samples, then determines pairs that are reciprocal top hits

Args:
nested_tags (dict): dict with nested tag structure for samples

keys: collection-level tags values: sample-level tags

int_matrices (dict): contains interaction matrices

keys: sample-level tag of interaction matrix values: pd dataframe with pairwise interaction scores

mappings (dict): id mappings between collections

keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject

p (float, optional): Defaults to 0.90.

“top heaviness” parameter of rbo scoring lower values for p result in increased top-heaviness

min_search_weight (float, optional): Defaults to 0.99.

determines search depth of ranked lists. will search to a depth that results computation of fraction of total possible score equal to min_search_weight

th_criterium (str, optional): Defaults to ‘percent’.

criterium to determine top hit “percent” counts proteins if both are in each other’s top n % percent of ranked interactor lists

th_percent (int, optional): Defaults to 1.

top percent to consider when using “percent” th_criterium

processes (int, optional): Defaults to 1.

number of cpu cores to use

chunksize (int, optional): Defaults to 1000.

parallelization parameter

Returns:
dict: structure: {(left_tag)(right_tag):top_hits}

top hits (pd series): top hits between two samples

compact.main.score_comparison(left_scores, right_scores, mapping=False, p=0.9, search_depth=None, th_criterium='percent', th_percent=1, processes=1, chunksize=1000)

Determine RBO scores and reciprocal top hits between pair of int matrices

Args:
[left|right]_scores (df): symmetric interaction matrix

values: within-sample pairwise interaction scores

mappings (dict): id mappings between collections

keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject

p (float, optional): Defaults to 0.90.

“top heaviness” parameter of rbo scoring lower values for p result in increased top-heaviness

search_depth (_type_, optional): Defaults to None.

number of ranks to consider when computing RBO scores if None, considers complete ranked lists

th_criterium (str, optional): Defaults to ‘percent’.

criterium to determine top hit “percent” counts proteins if both are in each other’s top n % percent of ranked interactor lists

th_percent (int, optional): Defaults to 1.

top percent to consider when using “percent” th_criterium

processes (int, optional): Defaults to 1.

number of cpu cores to use

chunksize (int, optional): Defaults to 1000.

parallelization parameter

Returns:
pd.Series: reciprocal top hits in pd series format

2-level multiindex with id pair, values are scores

compact.main.mcl_clustering(within_top_hits, between_top_hits, out_folder, include_within=False, wbratio=1, mcl_inflation=2, processes=1)

perform mcl on network from a set of samples

Args:
[within|between]_top_hits (pd series):

dict with within/between top hits per profile/comparison

out_folder (string):

filepath of output directory

include_within (bool, optional):Defaults to False.

whether to include within profile interaction scores in combined network used as input for MCL clustering

wbratio (int, optional): Defaults to 1.

ratio of within/between score averages. within (interaction) and between (rbo) scores are normalised based on their average

mcl_inflation (int, optional): Defaults to 2.

inflation parameter of the mcl clustering algorithm. determines granularity of clustering result.

processes (int, optional): Defaults to 1.

number of cpu cores to use

Returns:
[mcl|network]_outfn (string):

filepath of saved cluster and network results

compact.main.process_mcl_result(mcl_outfn, nested_tags, network_outfn, mappings, report_threshold=0.5, filter_clusters=True, perf_cluster_annotation=False, reference_groups=None, reference_tag=None, annot_fraction_threshold=0.5, annot_filter_mem_threshold=0.25)

processes and annotates raw mcl results for interpretation

separate clusters per collection, computes cluster metrics

Args:
mcl_outfn (string):

filepath of mcl result

nested_tags (dict of dicts):

nested tag structure for profiles keys: collection-level tags values: sample-level tags

network_outfn (string):

filepath of combined network

mappings (dict):

containing id mappings between collections keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject

report_threshold (float, optional):

in reporting mcl results of multi-sample collections, proteins are counted as member if they are a member of the cluster in a fraction of the samples >= report_threshold. Defaults to 0.5.

filter_clusters (bool, optional): default True

whether to filter out clusters with less than 2 matches or proteins over report_threshold

perf_cluster_annotation (bool, optional): Defaults to False.

whether to perform automatic annotation of clusters using ref

reference_groups (dict, optional): Defaults to None.

only used if perf_cluster_annotation == True. contains the names(keys) and members(values) of the reference groups

reference_tag (string, optional): Defaults to None.

only used if perf_cluster_annotation == True. tag of collection on which annotation is to be based. member names in collection should match member names in reference

annot_fraction_threshold (float, optional): Defaults to 0.5.

minimum fraction of reference that should be present in the cluster to get as assignment

annot_filter_mem_threshold (float, optional): Defaults to 0.25.

cluster members scoring below threshold will be ignored in determining overlap of cluster with reference. if value is None no filtering is applied

Returns:
dict: processed mcl clustering results, containing:
‘clust_info’: pd dataframe

table with information of each identified cluster

‘clusts’: dict

clusters with all members together

‘clusts_split’: dict

cluster members, split per collection, samples aggregated

‘edges’: dataframe

network edges part of one of the clusters

‘nodes’: dataframe

network nodes part of one of the clusters

compact.main.save_results(mcl_res, out_folder, mappings)

process and write human-readable results to file

Args:
mcl_res (dict):

dictionary with processed clustering results

out_folder (string):

path of output directory

mappings (dict):

containing id mappings between collections keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject

compact.main.main(nested_tags, int_matrices, mappings, p=0.9, min_search_weight=0.99, th_criterium='percent', th_percent=1, include_within=False, wbratio=1, mcl_inflation=2, output_location='.', job_name=None, report_threshold=0.5, filter_clusters=True, save_rthits=False, perf_cluster_annotation=False, reference_groups=None, reference_tag=None, annot_fraction_threshold=0.5, annot_filter_mem_threshold=0.25, processes=1, chunksize=1000)

complete rbo and clustering analysis from interaction matrices

Args:
nested_tags (dict): dict with nested tag structure for samples

keys: collection-level tags values: sample-level tags

int_matrices (dict): contains interaction matrices

keys: sample-level tag of interaction matrix values: pd dataframe with pairwise interaction scores

mappings (dict): id mappings between collections

keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject

p (float, optional): Defaults to 0.90.

“top heaviness” parameter of rbo scoring lower values for p result in increased top-heaviness

min_search_weight (float, optional): Defaults to 0.99.

determines search depth of ranked lists. will search to a depth that results computation of fraction of total possible score equal to min_search_weight

th_criterium (str, optional): Defaults to ‘percent’.

criterium to determine top hit “percent” counts proteins if both are in each other’s top n % percent of ranked interactor lists “best” takes only the single best hit

th_percent (int, optional): Defaults to 1.

top percent to consider when using “percent” th_criterium

include_within (bool, optional): Defaults to False.

whether to include within sample interaction scores in combined network used as input for MCL clustering

wbratio (int, optional): Defaults to 1.

ratio of within/between score averages. within (interaction) and between (rbo) scores are normalised based on their average

mcl_inflation (int, optional): Defaults to 2.

inflation parameter of the mcl clustering algorithm. determines granularity of clustering result.

output_location (str, optional): Defaults to current working directory.

filepath of output dir to be created

job_name: str, default: concatenated collection tags

name of this job, used in output dir name

save_rthits (bool, optional): Defaults to False.

whether to save reciprocal top hits to disk

report_threshold (float, optional): in reporting mcl results of

sampled collections, proteins are counted as member if they are a member of the cluster in a fraction of the samples >= report_threshold. Defaults to 0.5.

filter_clusters (bool, optional): default True

whether to filter out clusters with less than 2 matches or proteins over report_threshold

perf_cluster_annotation (bool, optional): Defaults to False.

whether to perform automatic annotation of clusters using ref

reference_groups (dict, optional): Defaults to None.

only used if perf_cluster_annotation == True. contains the names(keys) and members(values) of the reference groups

reference_tag (string, optional): Defaults to None.

only used if perf_cluster_annotation == True. tag of collection on which annotation is to be based. member names in collection should match member names in reference

annot_fraction_threshold (float, optional): Defaults to 0.5.

minimum fraction of reference that should be present in the cluster to get as assignment

annot_filter_mem_threshold (float, optional): Defaults to 0.25.

cluster members scoring below threshold will be ignored in determining overlap of cluster with reference. if value is None no filtering is applied

processes (int, optional): Defaults to 1.

number of cpu cores to use

Returns (dict): mcl_res, contains:
‘clust_info’: pd dataframe

table with information of each identified cluster

‘clusts’: dict

clusters with all members together

‘clusts_split’: dict

cluster members, split per collection, samples aggregated

‘edges’: dataframe

network edges part of one of the clusters

‘nodes’: dataframe

network nodes part of one of the clusters