compact.MCL_clustering
create combined network from rbo scored data, cluster with MCL, and processes cluster results.
- main functions
create_combined_network: given rbo scores between datasets, create combined network with normalised scores
run_MCL: runs MCL command line tool as subprocess
separate_subclusters: given MCL clusters, separates per collection, using replicates to compute fraction clustered scores
process_annot_MCL_res: processes MCL clusters into annotated CompaCt clusters and subclusters
- compact.MCL_clustering.create_combined_network(between_scores, network_fn, include_within=False, within_scores=None, wbratio=1)
generates network with combined edges from multiple comparisons
- Args:
- [within|between]_scores (list of pd series):
top hit scores for each comparison.
- network_fn (string):
filepath of outut network
- include_within (bool, optional): Defaults to False.
Whether to include within scores in combined network
- wbratio: numeric
ratio between average within/between scores only used when include_within=True
- compact.MCL_clustering.run_MCL(input_fn, output_fn, inflation=2, processes=1)
runs mcl command line tool as subprocess
- Args:
- input_fn (string):
filepath of input network to be clustered
- output_fn (string):
filepath of output result
- inflation (int, optional):
mcl inflation param. Defaults to 2.
- processes (int, optional):
number of processes/threads. Defaults to 1.
- compact.MCL_clustering.separate_subclusters(clusters, nested_tags)
separates clusters into subclusters per collection, pooling samples
counts occurences of member in cluster for each sample
- Args:
- clusters (dict): all clusters
keys: numeric cluster id value: list of cluster members
- nested_tags (dict of dicts): nested tag structure for profiles
keys: collection-level tags values: sample-level tags
- Returns:
- dict of dicts: cluster members, separated per collection
member occurence aggregated over samples per collection
- compact.MCL_clustering.process_annot_MCL_res(res_fn, nested_tags, network_fn, mappings, report_threshold=0.5, filter_clusters=True)
processes and annotates MCL results
Parses raw MCL cluster results. Separates clusters per collection, aggregating membership over samples within a collection. Generates cluster info table, and optionally produces node and edge tables for network nodes and edges that are part of clusters for further analysis.
- Args:
- res_fn (str):
filepath of MCL output file
- nested_tags (dict of dicts):
nested tag structure for profiles keys: collection-level tags values: sample-level tags
- network_fn (str):
filepath, location of combined network file
- mappings (dict): containing id mappings between collections
keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject
- report_threshold (float, optional): Defaults to 0.5.
in reporting mcl results of multi-sample collections, proteins are counted as member if they are a member of the cluster in a fraction of the samples >= report_threshold.
- filter_clusters (bool, optional): default True
whether to filter out clusters with less than 2 matches or proteins over report_threshold
- Returns:
- dict: processed mcl clustering results, containing:
- ‘clust_info’: pd dataframe
table with information of each identified cluster
- ‘clusts’: dict
clusters with all members together
- ‘clusts_split’: (nested dict):
cluster members, split per collection, samples aggregated members are dict with key: member id, value: member weight
- ‘best_guess’: dict of dicts
contains dict of subclusters for each collection subclusters: list of members that pass best guess selection
- ‘match_over_threshold’: dict of dicts
contains dict of subclusters for each collection subclusters: list of members that have match in another collection that is over fraction clustered threshold
- ‘edges’: dataframe
network edges part of one of the clusters
- ‘nodes’: dataframe
network nodes part of one of the clusters