compact.MCL_clustering

create combined network from rbo scored data, cluster with MCL, and processes cluster results.

main functions
  • create_combined_network: given rbo scores between datasets, create combined network with normalised scores

  • run_MCL: runs MCL command line tool as subprocess

  • separate_subclusters: given MCL clusters, separates per collection, using replicates to compute fraction clustered scores

  • process_annot_MCL_res: processes MCL clusters into annotated CompaCt clusters and subclusters

compact.MCL_clustering.create_combined_network(between_scores, network_fn, include_within=False, within_scores=None, wbratio=1)

generates network with combined edges from multiple comparisons

Args:
[within|between]_scores (list of pd series):

top hit scores for each comparison.

network_fn (string):

filepath of outut network

include_within (bool, optional): Defaults to False.

Whether to include within scores in combined network

wbratio: numeric

ratio between average within/between scores only used when include_within=True

compact.MCL_clustering.run_MCL(input_fn, output_fn, inflation=2, processes=1)

runs mcl command line tool as subprocess

Args:
input_fn (string):

filepath of input network to be clustered

output_fn (string):

filepath of output result

inflation (int, optional):

mcl inflation param. Defaults to 2.

processes (int, optional):

number of processes/threads. Defaults to 1.

compact.MCL_clustering.separate_subclusters(clusters, nested_tags)

separates clusters into subclusters per collection, pooling samples

counts occurences of member in cluster for each sample

Args:
clusters (dict): all clusters

keys: numeric cluster id value: list of cluster members

nested_tags (dict of dicts): nested tag structure for profiles

keys: collection-level tags values: sample-level tags

Returns:
dict of dicts: cluster members, separated per collection

member occurence aggregated over samples per collection

compact.MCL_clustering.process_annot_MCL_res(res_fn, nested_tags, network_fn, mappings, report_threshold=0.5, filter_clusters=True)

processes and annotates MCL results

Parses raw MCL cluster results. Separates clusters per collection, aggregating membership over samples within a collection. Generates cluster info table, and optionally produces node and edge tables for network nodes and edges that are part of clusters for further analysis.

Args:
res_fn (str):

filepath of MCL output file

nested_tags (dict of dicts):

nested tag structure for profiles keys: collection-level tags values: sample-level tags

network_fn (str):

filepath, location of combined network file

mappings (dict): containing id mappings between collections

keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject

report_threshold (float, optional): Defaults to 0.5.

in reporting mcl results of multi-sample collections, proteins are counted as member if they are a member of the cluster in a fraction of the samples >= report_threshold.

filter_clusters (bool, optional): default True

whether to filter out clusters with less than 2 matches or proteins over report_threshold

Returns:
dict: processed mcl clustering results, containing:
‘clust_info’: pd dataframe

table with information of each identified cluster

‘clusts’: dict

clusters with all members together

‘clusts_split’: (nested dict):

cluster members, split per collection, samples aggregated members are dict with key: member id, value: member weight

‘best_guess’: dict of dicts

contains dict of subclusters for each collection subclusters: list of members that pass best guess selection

‘match_over_threshold’: dict of dicts

contains dict of subclusters for each collection subclusters: list of members that have match in another collection that is over fraction clustered threshold

‘edges’: dataframe

network edges part of one of the clusters

‘nodes’: dataframe

network nodes part of one of the clusters