compact.main
This module contains the high-level workflow of this project, to run the major steps of the CompaCt analsysis.
- The main functions:
between_scoring: rbo scoring between all provided correlation datasets
score_comparison: rbo scoring and top hit selection between two correlation datasete
mcl_clustering: perform clustering on computed rbo scores
process_mcl_result: process raw MCL output to get annotated CompaCt result clusters and subclusters
save_results: save CompaCt results to an output folder of choice
- compact.main.between_scoring(nested_tags, int_matrices, mappings, p=0.9, min_search_weight=0.99, th_criterium='percent', th_percent=1, processes=1, chunksize=1000)
computes top hit scores between all interaction matrices
first performs rbo scoring for all pairs between samples, then determines pairs that are reciprocal top hits
- Args:
- nested_tags (dict): dict with nested tag structure for samples
keys: collection-level tags values: sample-level tags
- int_matrices (dict): contains interaction matrices
keys: sample-level tag of interaction matrix values: pd dataframe with pairwise interaction scores
- mappings (dict): id mappings between collections
keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject
- p (float, optional): Defaults to 0.90.
“top heaviness” parameter of rbo scoring lower values for p result in increased top-heaviness
- min_search_weight (float, optional): Defaults to 0.99.
determines search depth of ranked lists. will search to a depth that results computation of fraction of total possible score equal to min_search_weight
- th_criterium (str, optional): Defaults to ‘percent’.
criterium to determine top hit “percent” counts proteins if both are in each other’s top n % percent of ranked interactor lists
- th_percent (int, optional): Defaults to 1.
top percent to consider when using “percent” th_criterium
- processes (int, optional): Defaults to 1.
number of cpu cores to use
- chunksize (int, optional): Defaults to 1000.
parallelization parameter
- Returns:
- dict: structure: {(left_tag)(right_tag):top_hits}
top hits (pd series): top hits between two samples
- compact.main.score_comparison(left_scores, right_scores, mapping=False, p=0.9, search_depth=None, th_criterium='percent', th_percent=1, processes=1, chunksize=1000)
Determine RBO scores and reciprocal top hits between pair of int matrices
- Args:
- [left|right]_scores (df): symmetric interaction matrix
values: within-sample pairwise interaction scores
- mappings (dict): id mappings between collections
keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject
- p (float, optional): Defaults to 0.90.
“top heaviness” parameter of rbo scoring lower values for p result in increased top-heaviness
- search_depth (_type_, optional): Defaults to None.
number of ranks to consider when computing RBO scores if None, considers complete ranked lists
- th_criterium (str, optional): Defaults to ‘percent’.
criterium to determine top hit “percent” counts proteins if both are in each other’s top n % percent of ranked interactor lists
- th_percent (int, optional): Defaults to 1.
top percent to consider when using “percent” th_criterium
- processes (int, optional): Defaults to 1.
number of cpu cores to use
- chunksize (int, optional): Defaults to 1000.
parallelization parameter
- Returns:
- pd.Series: reciprocal top hits in pd series format
2-level multiindex with id pair, values are scores
- compact.main.mcl_clustering(within_top_hits, between_top_hits, out_folder, include_within=False, wbratio=1, mcl_inflation=2, processes=1)
perform mcl on network from a set of samples
- Args:
- [within|between]_top_hits (pd series):
dict with within/between top hits per profile/comparison
- out_folder (string):
filepath of output directory
- include_within (bool, optional):Defaults to False.
whether to include within profile interaction scores in combined network used as input for MCL clustering
- wbratio (int, optional): Defaults to 1.
ratio of within/between score averages. within (interaction) and between (rbo) scores are normalised based on their average
- mcl_inflation (int, optional): Defaults to 2.
inflation parameter of the mcl clustering algorithm. determines granularity of clustering result.
- processes (int, optional): Defaults to 1.
number of cpu cores to use
- Returns:
- [mcl|network]_outfn (string):
filepath of saved cluster and network results
- compact.main.process_mcl_result(mcl_outfn, nested_tags, network_outfn, mappings, report_threshold=0.5, filter_clusters=True, perf_cluster_annotation=False, reference_groups=None, reference_tag=None, annot_fraction_threshold=0.5, annot_filter_mem_threshold=0.25)
processes and annotates raw mcl results for interpretation
separate clusters per collection, computes cluster metrics
- Args:
- mcl_outfn (string):
filepath of mcl result
- nested_tags (dict of dicts):
nested tag structure for profiles keys: collection-level tags values: sample-level tags
- network_outfn (string):
filepath of combined network
- mappings (dict):
containing id mappings between collections keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject
- report_threshold (float, optional):
in reporting mcl results of multi-sample collections, proteins are counted as member if they are a member of the cluster in a fraction of the samples >= report_threshold. Defaults to 0.5.
- filter_clusters (bool, optional): default True
whether to filter out clusters with less than 2 matches or proteins over report_threshold
- perf_cluster_annotation (bool, optional): Defaults to False.
whether to perform automatic annotation of clusters using ref
- reference_groups (dict, optional): Defaults to None.
only used if perf_cluster_annotation == True. contains the names(keys) and members(values) of the reference groups
- reference_tag (string, optional): Defaults to None.
only used if perf_cluster_annotation == True. tag of collection on which annotation is to be based. member names in collection should match member names in reference
- annot_fraction_threshold (float, optional): Defaults to 0.5.
minimum fraction of reference that should be present in the cluster to get as assignment
- annot_filter_mem_threshold (float, optional): Defaults to 0.25.
cluster members scoring below threshold will be ignored in determining overlap of cluster with reference. if value is None no filtering is applied
- Returns:
- dict: processed mcl clustering results, containing:
- ‘clust_info’: pd dataframe
table with information of each identified cluster
- ‘clusts’: dict
clusters with all members together
- ‘clusts_split’: dict
cluster members, split per collection, samples aggregated
- ‘edges’: dataframe
network edges part of one of the clusters
- ‘nodes’: dataframe
network nodes part of one of the clusters
- compact.main.save_results(mcl_res, out_folder, mappings)
process and write human-readable results to file
- Args:
- mcl_res (dict):
dictionary with processed clustering results
- out_folder (string):
path of output directory
- mappings (dict):
containing id mappings between collections keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject
- compact.main.main(nested_tags, int_matrices, mappings, p=0.9, min_search_weight=0.99, th_criterium='percent', th_percent=1, include_within=False, wbratio=1, mcl_inflation=2, output_location='.', job_name=None, report_threshold=0.5, filter_clusters=True, save_rthits=False, perf_cluster_annotation=False, reference_groups=None, reference_tag=None, annot_fraction_threshold=0.5, annot_filter_mem_threshold=0.25, processes=1, chunksize=1000)
complete rbo and clustering analysis from interaction matrices
- Args:
- nested_tags (dict): dict with nested tag structure for samples
keys: collection-level tags values: sample-level tags
- int_matrices (dict): contains interaction matrices
keys: sample-level tag of interaction matrix values: pd dataframe with pairwise interaction scores
- mappings (dict): id mappings between collections
keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject
- p (float, optional): Defaults to 0.90.
“top heaviness” parameter of rbo scoring lower values for p result in increased top-heaviness
- min_search_weight (float, optional): Defaults to 0.99.
determines search depth of ranked lists. will search to a depth that results computation of fraction of total possible score equal to min_search_weight
- th_criterium (str, optional): Defaults to ‘percent’.
criterium to determine top hit “percent” counts proteins if both are in each other’s top n % percent of ranked interactor lists “best” takes only the single best hit
- th_percent (int, optional): Defaults to 1.
top percent to consider when using “percent” th_criterium
- include_within (bool, optional): Defaults to False.
whether to include within sample interaction scores in combined network used as input for MCL clustering
- wbratio (int, optional): Defaults to 1.
ratio of within/between score averages. within (interaction) and between (rbo) scores are normalised based on their average
- mcl_inflation (int, optional): Defaults to 2.
inflation parameter of the mcl clustering algorithm. determines granularity of clustering result.
- output_location (str, optional): Defaults to current working directory.
filepath of output dir to be created
- job_name: str, default: concatenated collection tags
name of this job, used in output dir name
- save_rthits (bool, optional): Defaults to False.
whether to save reciprocal top hits to disk
- report_threshold (float, optional): in reporting mcl results of
sampled collections, proteins are counted as member if they are a member of the cluster in a fraction of the samples >= report_threshold. Defaults to 0.5.
- filter_clusters (bool, optional): default True
whether to filter out clusters with less than 2 matches or proteins over report_threshold
- perf_cluster_annotation (bool, optional): Defaults to False.
whether to perform automatic annotation of clusters using ref
- reference_groups (dict, optional): Defaults to None.
only used if perf_cluster_annotation == True. contains the names(keys) and members(values) of the reference groups
- reference_tag (string, optional): Defaults to None.
only used if perf_cluster_annotation == True. tag of collection on which annotation is to be based. member names in collection should match member names in reference
- annot_fraction_threshold (float, optional): Defaults to 0.5.
minimum fraction of reference that should be present in the cluster to get as assignment
- annot_filter_mem_threshold (float, optional): Defaults to 0.25.
cluster members scoring below threshold will be ignored in determining overlap of cluster with reference. if value is None no filtering is applied
- processes (int, optional): Defaults to 1.
number of cpu cores to use
- Returns (dict): mcl_res, contains:
- ‘clust_info’: pd dataframe
table with information of each identified cluster
- ‘clusts’: dict
clusters with all members together
- ‘clusts_split’: dict
cluster members, split per collection, samples aggregated
- ‘edges’: dataframe
network edges part of one of the clusters
- ‘nodes’: dataframe
network nodes part of one of the clusters