compact.main

This module contains the high-level workflow of this project, to run the major steps of the CompaCt analsysis.

The main functions:

between_scoring: rbo scoring between all provided correlation datasets
score_comparison: rbo scoring and top hit selection between two correlation datasete
mcl_clustering: perform clustering on computed rbo scores
process_mcl_result: process raw MCL output to get annotated CompaCt result clusters and subclusters
save_results: save CompaCt results to an output folder of choice

compact.main.between_scoring(nested_tags, int_matrices, mappings, p=0.9, min_search_weight=0.99, th_criterium='percent', th_percent=1, processes=1, chunksize=1000)

computes top hit scores between all interaction matrices

first performs rbo scoring for all pairs between samples, then determines pairs that are reciprocal top hits

Args:

nested_tags (dict): dict with nested tag structure for samples: keys: collection-level tags values: sample-level tags
int_matrices (dict): contains interaction matrices: keys: sample-level tag of interaction matrix values: pd dataframe with pairwise interaction scores
mappings (dict): id mappings between collections: keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject
p (float, optional): Defaults to 0.90.: “top heaviness” parameter of rbo scoring lower values for p result in increased top-heaviness
min_search_weight (float, optional): Defaults to 0.99.: determines search depth of ranked lists. will search to a depth that results computation of fraction of total possible score equal to min_search_weight
th_criterium (str, optional): Defaults to ‘percent’.: criterium to determine top hit “percent” counts proteins if both are in each other’s top n % percent of ranked interactor lists
th_percent (int, optional): Defaults to 1.: top percent to consider when using “percent” th_criterium
processes (int, optional): Defaults to 1.: number of cpu cores to use
chunksize (int, optional): Defaults to 1000.: parallelization parameter

Returns:

dict: structure: {(left_tag)(right_tag):top_hits}: top hits (pd series): top hits between two samples

compact.main.score_comparison(left_scores, right_scores, mapping=False, p=0.9, search_depth=None, th_criterium='percent', th_percent=1, processes=1, chunksize=1000)

Determine RBO scores and reciprocal top hits between pair of int matrices

Args:

[left|right]_scores (df): symmetric interaction matrix: values: within-sample pairwise interaction scores
mappings (dict): id mappings between collections: keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject
p (float, optional): Defaults to 0.90.: “top heaviness” parameter of rbo scoring lower values for p result in increased top-heaviness
search_depth (_type_, optional): Defaults to None.: number of ranks to consider when computing RBO scores if None, considers complete ranked lists
th_criterium (str, optional): Defaults to ‘percent’.: criterium to determine top hit “percent” counts proteins if both are in each other’s top n % percent of ranked interactor lists
th_percent (int, optional): Defaults to 1.: top percent to consider when using “percent” th_criterium
processes (int, optional): Defaults to 1.: number of cpu cores to use
chunksize (int, optional): Defaults to 1000.: parallelization parameter

Returns:

pd.Series: reciprocal top hits in pd series format: 2-level multiindex with id pair, values are scores

compact.main.mcl_clustering(within_top_hits, between_top_hits, out_folder, include_within=False, wbratio=1, mcl_inflation=2, processes=1)

perform mcl on network from a set of samples

Args:

[within|between]_top_hits (pd series):: dict with within/between top hits per profile/comparison
out_folder (string):: filepath of output directory
include_within (bool, optional):Defaults to False.: whether to include within profile interaction scores in combined network used as input for MCL clustering
wbratio (int, optional): Defaults to 1.: ratio of within/between score averages. within (interaction) and between (rbo) scores are normalised based on their average
mcl_inflation (int, optional): Defaults to 2.: inflation parameter of the mcl clustering algorithm. determines granularity of clustering result.
processes (int, optional): Defaults to 1.: number of cpu cores to use

Returns:

[mcl|network]_outfn (string):: filepath of saved cluster and network results

compact.main.process_mcl_result(mcl_outfn, nested_tags, network_outfn, mappings, report_threshold=0.5, filter_clusters=True, perf_cluster_annotation=False, reference_groups=None, reference_tag=None, annot_fraction_threshold=0.5, annot_filter_mem_threshold=0.25)

processes and annotates raw mcl results for interpretation

separate clusters per collection, computes cluster metrics

Args:

mcl_outfn (string):: filepath of mcl result
nested_tags (dict of dicts):: nested tag structure for profiles keys: collection-level tags values: sample-level tags
network_outfn (string):: filepath of combined network
mappings (dict):: containing id mappings between collections keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject
report_threshold (float, optional):: in reporting mcl results of multi-sample collections, proteins are counted as member if they are a member of the cluster in a fraction of the samples >= report_threshold. Defaults to 0.5.
filter_clusters (bool, optional): default True: whether to filter out clusters with less than 2 matches or proteins over report_threshold
perf_cluster_annotation (bool, optional): Defaults to False.: whether to perform automatic annotation of clusters using ref
reference_groups (dict, optional): Defaults to None.: only used if perf_cluster_annotation == True. contains the names(keys) and members(values) of the reference groups
reference_tag (string, optional): Defaults to None.: only used if perf_cluster_annotation == True. tag of collection on which annotation is to be based. member names in collection should match member names in reference
annot_fraction_threshold (float, optional): Defaults to 0.5.: minimum fraction of reference that should be present in the cluster to get as assignment
annot_filter_mem_threshold (float, optional): Defaults to 0.25.: cluster members scoring below threshold will be ignored in determining overlap of cluster with reference. if value is None no filtering is applied

Returns:

dict: processed mcl clustering results, containing:

‘clust_info’: pd dataframe: table with information of each identified cluster
‘clusts’: dict: clusters with all members together
‘clusts_split’: dict: cluster members, split per collection, samples aggregated
‘edges’: dataframe: network edges part of one of the clusters
‘nodes’: dataframe: network nodes part of one of the clusters

compact.main.save_results(mcl_res, out_folder, mappings)

process and write human-readable results to file

Args:

mcl_res (dict):: dictionary with processed clustering results
out_folder (string):: path of output directory
mappings (dict):: containing id mappings between collections keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject

compact.main.main(nested_tags, int_matrices, mappings, p=0.9, min_search_weight=0.99, th_criterium='percent', th_percent=1, include_within=False, wbratio=1, mcl_inflation=2, output_location='.', job_name=None, report_threshold=0.5, filter_clusters=True, save_rthits=False, perf_cluster_annotation=False, reference_groups=None, reference_tag=None, annot_fraction_threshold=0.5, annot_filter_mem_threshold=0.25, processes=1, chunksize=1000)

complete rbo and clustering analysis from interaction matrices

Args:

nested_tags (dict): dict with nested tag structure for samples: keys: collection-level tags values: sample-level tags
int_matrices (dict): contains interaction matrices: keys: sample-level tag of interaction matrix values: pd dataframe with pairwise interaction scores
mappings (dict): id mappings between collections: keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject
p (float, optional): Defaults to 0.90.: “top heaviness” parameter of rbo scoring lower values for p result in increased top-heaviness
min_search_weight (float, optional): Defaults to 0.99.: determines search depth of ranked lists. will search to a depth that results computation of fraction of total possible score equal to min_search_weight
th_criterium (str, optional): Defaults to ‘percent’.: criterium to determine top hit “percent” counts proteins if both are in each other’s top n % percent of ranked interactor lists “best” takes only the single best hit
th_percent (int, optional): Defaults to 1.: top percent to consider when using “percent” th_criterium
include_within (bool, optional): Defaults to False.: whether to include within sample interaction scores in combined network used as input for MCL clustering
wbratio (int, optional): Defaults to 1.: ratio of within/between score averages. within (interaction) and between (rbo) scores are normalised based on their average
mcl_inflation (int, optional): Defaults to 2.: inflation parameter of the mcl clustering algorithm. determines granularity of clustering result.
output_location (str, optional): Defaults to current working directory.: filepath of output dir to be created
job_name: str, default: concatenated collection tags: name of this job, used in output dir name
save_rthits (bool, optional): Defaults to False.: whether to save reciprocal top hits to disk
report_threshold (float, optional): in reporting mcl results of: sampled collections, proteins are counted as member if they are a member of the cluster in a fraction of the samples >= report_threshold. Defaults to 0.5.
filter_clusters (bool, optional): default True: whether to filter out clusters with less than 2 matches or proteins over report_threshold
perf_cluster_annotation (bool, optional): Defaults to False.: whether to perform automatic annotation of clusters using ref
reference_groups (dict, optional): Defaults to None.: only used if perf_cluster_annotation == True. contains the names(keys) and members(values) of the reference groups
reference_tag (string, optional): Defaults to None.: only used if perf_cluster_annotation == True. tag of collection on which annotation is to be based. member names in collection should match member names in reference
annot_fraction_threshold (float, optional): Defaults to 0.5.: minimum fraction of reference that should be present in the cluster to get as assignment
annot_filter_mem_threshold (float, optional): Defaults to 0.25.: cluster members scoring below threshold will be ignored in determining overlap of cluster with reference. if value is None no filtering is applied
processes (int, optional): Defaults to 1.: number of cpu cores to use

Returns (dict): mcl_res, contains:

‘clust_info’: pd dataframe: table with information of each identified cluster
‘clusts’: dict: clusters with all members together
‘clusts_split’: dict: cluster members, split per collection, samples aggregated
‘edges’: dataframe: network edges part of one of the clusters
‘nodes’: dataframe: network nodes part of one of the clusters