compact.MCL_clustering

create combined network from rbo scored data, cluster with MCL, and processes cluster results.

main functions

create_combined_network: given rbo scores between datasets, create combined network with normalised scores
run_MCL: runs MCL command line tool as subprocess
separate_subclusters: given MCL clusters, separates per collection, using replicates to compute fraction clustered scores
process_annot_MCL_res: processes MCL clusters into annotated CompaCt clusters and subclusters

compact.MCL_clustering.create_combined_network(between_scores, network_fn, include_within=False, within_scores=None, wbratio=1)

generates network with combined edges from multiple comparisons

Args:

[within|between]_scores (list of pd series):: top hit scores for each comparison.
network_fn (string):: filepath of outut network
include_within (bool, optional): Defaults to False.: Whether to include within scores in combined network
wbratio: numeric: ratio between average within/between scores only used when include_within=True

compact.MCL_clustering.run_MCL(input_fn, output_fn, inflation=2, processes=1)

runs mcl command line tool as subprocess

Args:

input_fn (string):: filepath of input network to be clustered
output_fn (string):: filepath of output result
inflation (int, optional):: mcl inflation param. Defaults to 2.
processes (int, optional):: number of processes/threads. Defaults to 1.

compact.MCL_clustering.separate_subclusters(clusters, nested_tags)

separates clusters into subclusters per collection, pooling samples

counts occurences of member in cluster for each sample

Args:

clusters (dict): all clusters: keys: numeric cluster id value: list of cluster members
nested_tags (dict of dicts): nested tag structure for profiles: keys: collection-level tags values: sample-level tags

Returns:

dict of dicts: cluster members, separated per collection: member occurence aggregated over samples per collection

compact.MCL_clustering.process_annot_MCL_res(res_fn, nested_tags, network_fn, mappings, report_threshold=0.5, filter_clusters=True)

processes and annotates MCL results

Parses raw MCL cluster results. Separates clusters per collection, aggregating membership over samples within a collection. Generates cluster info table, and optionally produces node and edge tables for network nodes and edges that are part of clusters for further analysis.

Args:

res_fn (str):: filepath of MCL output file
nested_tags (dict of dicts):: nested tag structure for profiles keys: collection-level tags values: sample-level tags
network_fn (str):: filepath, location of combined network file
mappings (dict): containing id mappings between collections: keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject
report_threshold (float, optional): Defaults to 0.5.: in reporting mcl results of multi-sample collections, proteins are counted as member if they are a member of the cluster in a fraction of the samples >= report_threshold.
filter_clusters (bool, optional): default True: whether to filter out clusters with less than 2 matches or proteins over report_threshold

Returns:

dict: processed mcl clustering results, containing:

‘clust_info’: pd dataframe: table with information of each identified cluster
‘clusts’: dict: clusters with all members together
‘clusts_split’: (nested dict):: cluster members, split per collection, samples aggregated members are dict with key: member id, value: member weight
‘best_guess’: dict of dicts: contains dict of subclusters for each collection subclusters: list of members that pass best guess selection
‘match_over_threshold’: dict of dicts: contains dict of subclusters for each collection subclusters: list of members that have match in another collection that is over fraction clustered threshold
‘edges’: dataframe: network edges part of one of the clusters
‘nodes’: dataframe: network nodes part of one of the clusters