compact.process_data

general functions for processing,parsing data etc.

compact.process_data.parse_settings(settings_fn)

parse input settings tsvfile

Args:
settings_fn (string):

path to settings file to be parsed

Raises:
ValueError:

if file content doesn’t match specifications

Returns tuple with (sample_data,mapping_data):
sample_data (dict of dicts):

filenames of interaction data per collection

mapping_data (dict):

filepaths of orthology files per comparison

compact.process_data.get_nested_tags(corr_dict)

fetches collection-replicate structure from parsed sample data

Returns: nested_tags (dict)

nested tag structure for profiles keys: collection-level tags values: sample-level tags

compact.process_data.parse_profile(tsv_fn)

parses standard format complexome profile into df

Args:
tsv_fn (str): filepath of file containing:

complexome profile in tab separated text format. single header col with fraction ids single index row with protein ids numeric abundance values

Returns:

pd df: complexome profile as dataframe

compact.process_data.parse_profiles(fn_dict, flat_output=True)

parse profiles for all collections,samples in fn_dict

Args:
fn_dict (nested dict):

filenames of interaction data per collection

flat_output (bool, optional). Defaults to True.

if True, will return a flat dictionary without a nested dictionary per collection. If False, will keep the nested structure with a separate nested dict per collection

Returns:

dict of dicts: parsed sample data in nested collection-replicate structure

compact.process_data.flatten_nested_dict(nested_dict)

converts nested dict to flat dict.

compact.process_data.parse_mapping(tsv_fn)

parse file with identifier mapping into dict

Args:
tsv_fn (str): filepath of mapping file

each line contains 2 tab-separated ids

Returns:

dict: id mapping as {left:right,..}

compact.process_data.parse_mappings(fn_dict)

parses all orthology files in provided fn_dict

Args:

fn_dict (dict): filepaths of orthology files per comparison

Returns:

dict of dicts: parsed pairwise orthologies as dicts

compact.process_data.fetch_mapping(left_tag, right_tag, mappings)

fetches correct mapping for left:right from mappings

inverts mapping if only right:left mapping is present returns None if no mapping is available

Args:
[left|right]_tag (str): collection level tags

for which to fetch mapping

mappings (dict): containing id mappings between collections

keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject

Returns:

dict: dict with mapping for given ids

compact.process_data.invert_mapping(mapping_dict)

inverts given dict

compact.process_data.split_prot_ids(df)

if index contains rows with multiple protein ids, split them

Args:

df (pd df): protein df for which index should be split

Returns:

list of lists: list of row indices, each index is list with one or more ids

compact.process_data.match_ids(ids_list, target_list)

match list with given ids to target list

to be used when dealing with data that have multiple ids per row that need to be narrowed down to 1. It will pick one of the ids per row that ensures the most matches with ids in target list CARE: IF MULTIPLE MATCHES WITH TARGET, ARBITRARILY PICKS FIRST

Args

ids_list (list-of-lists): target_list (list): list of ids

Returns

matched_list: list

compact.process_data.split_match_ids(df, target_list)

reduces df with multiple ids per row to 1 id per row

picks id so that matches with target_list is maximized

Args:
df (pd df): index that can have multiple

ids per row, in comma-separated strings

target_list (list): list of ids to which

dataframe index should be matched

Returns:

pd df: copy of input df with single id per row

compact.process_data.parse_annotation(annot_fn)

parse annotation file into pd.DataFrame

Args:
annot_fn (string): filepath of annotation file

should be tab-separated file with prot_ids as first col and column headers as first row

Returns:

pd.DataFrame: table with protein annotations

compact.process_data.parse_scores(score_fn, rename_dups=True)

parses matrix-structured scores into dataframe

Args:
score_fn: path/fn of tsv table with scores

matrix structure with index and columns

Returns: pd.DataFrame

compact.process_data.parse_top_hits(top_hit_fn)

parse a top hit file

Args:

top_hit_fn (string): path of top hit file

Returns:

pd.DataFrame: top hits as dataframe

compact.process_data.parse_network(net_fn)

parses combined network as generated by CompaCt

Args:

net_fn (string): path to combined network file

Returns:
pd.DataFrame: combined network edges in table format

columns: left_id,right_id,weight

compact.process_data.write_network(net, out_fn)

Saves combined network as generated by CompaCt to file

Args:

net (pd.DataFrame): network to be saved in df format out_fn (string): location of file to write to

compact.process_data.filter_network(network, comps)

filter provided comparisons from given network

Args:
network (pd df):

network containing multiple comparisons

comps (list of tuples):

comparisons to include in the output network comparison (tuple, (str,str)): left and right tags of compared samples

Returns:

pd df: subnetwork containing only provided comparisons

compact.process_data.parse_MCL_result(res_fn)

parses MCL result from file into dict

Args:

res_fn (string): filepath of MCL result

Returns:

dict: containing all MCL clusters

compact.process_data.annotate_df(to_annot, annot_fn)

annotate given dataframe of proteins with annotation file

compact.process_data.rename_duplicates(sorted_list, separator='::')

in sorted list of ids, renames occurences after first

Args:
sorted_list (list):

list of string identifiers

separator (str, optional): Defaults to “::”.

str that will separate original id and number that will be appended

Returns:

list: list with duplicate ids renamed

compact.process_data.remove_appendices(id_list, separator='::')

removes trailing appendices from ids

Args:
id_list (list):

ids that might have appendix

separator (str, optional): Defaults to “::”.

everything including and after separator will be stripped from the string

Returns:

list of strings: stripped ids

compact.process_data.rename_duplicates_int_matrix(df, separator='::')

renames duplicate row and col ids inplace in currenf df

Args:
df (pd df):

dataframe from which duplicate column and rows will be renamed

separator (str, optional): Defaults to “::”.

everything including and after separator will be stripped from the string

Returns:

int: number of duplicate ids in df index

compact.process_data.add_tag_df(rowtag, coltag, matrix)

prepends the collection tag to each protein id

applied to both rows and columns of the matrix

Args:

[row|col]tag (str): tag to prepend to ids matrix (pd df): matrix with index and columns

Returns:

pd df: copy of input df with tagged ids

compact.process_data.add_tag_multiindex(left_tag, right_tag, multiindex)

prepend complexome tag to protein ids in 2-level multiindex

Args:
[left|right]_tag (str):

tag to prepend to ids

multiindex (pd MultiIndex): 2-level multiindex,

tags will be prepended to both levels

Returns:

pd MultiIndex: new index with tagged ids

compact.process_data.split_clustmember_tables(nodes, mappings)

get complexome-specific cluster membership tables from nodes

uses mapping to add mappings to other complexomes

Args:
nodes (pd df):

table of cluster nodes

mappings (dict):

containing id mappings between collections keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject

Returns:
dict: containing separate table of nodes per tag

key (str): tag, value (pd df): node table

compact.process_data.parse_gmt(filename)

parses gmt format file with named reference groups

Args:
filename (string):

filepath of .gmt file

Returns:

dict: dict with group names and members