compact.process_data
general functions for processing,parsing data etc.
- compact.process_data.parse_settings(settings_fn)
parse input settings tsvfile
- Args:
- settings_fn (string):
path to settings file to be parsed
- Raises:
- ValueError:
if file content doesn’t match specifications
- Returns tuple with (sample_data,mapping_data):
- sample_data (dict of dicts):
filenames of interaction data per collection
- mapping_data (dict):
filepaths of orthology files per comparison
- compact.process_data.get_nested_tags(corr_dict)
fetches collection-replicate structure from parsed sample data
- Returns: nested_tags (dict)
nested tag structure for profiles keys: collection-level tags values: sample-level tags
- compact.process_data.parse_profile(tsv_fn)
parses standard format complexome profile into df
- Args:
- tsv_fn (str): filepath of file containing:
complexome profile in tab separated text format. single header col with fraction ids single index row with protein ids numeric abundance values
- Returns:
pd df: complexome profile as dataframe
- compact.process_data.parse_profiles(fn_dict, flat_output=True)
parse profiles for all collections,samples in fn_dict
- Args:
- fn_dict (nested dict):
filenames of interaction data per collection
- flat_output (bool, optional). Defaults to True.
if True, will return a flat dictionary without a nested dictionary per collection. If False, will keep the nested structure with a separate nested dict per collection
- Returns:
dict of dicts: parsed sample data in nested collection-replicate structure
- compact.process_data.flatten_nested_dict(nested_dict)
converts nested dict to flat dict.
- compact.process_data.parse_mapping(tsv_fn)
parse file with identifier mapping into dict
- Args:
- tsv_fn (str): filepath of mapping file
each line contains 2 tab-separated ids
- Returns:
dict: id mapping as {left:right,..}
- compact.process_data.parse_mappings(fn_dict)
parses all orthology files in provided fn_dict
- Args:
fn_dict (dict): filepaths of orthology files per comparison
- Returns:
dict of dicts: parsed pairwise orthologies as dicts
- compact.process_data.fetch_mapping(left_tag, right_tag, mappings)
fetches correct mapping for left:right from mappings
inverts mapping if only right:left mapping is present returns None if no mapping is available
- Args:
- [left|right]_tag (str): collection level tags
for which to fetch mapping
- mappings (dict): containing id mappings between collections
keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject
- Returns:
dict: dict with mapping for given ids
- compact.process_data.invert_mapping(mapping_dict)
inverts given dict
- compact.process_data.split_prot_ids(df)
if index contains rows with multiple protein ids, split them
- Args:
df (pd df): protein df for which index should be split
- Returns:
list of lists: list of row indices, each index is list with one or more ids
- compact.process_data.match_ids(ids_list, target_list)
match list with given ids to target list
to be used when dealing with data that have multiple ids per row that need to be narrowed down to 1. It will pick one of the ids per row that ensures the most matches with ids in target list CARE: IF MULTIPLE MATCHES WITH TARGET, ARBITRARILY PICKS FIRST
- Args
ids_list (list-of-lists): target_list (list): list of ids
- Returns
matched_list: list
- compact.process_data.split_match_ids(df, target_list)
reduces df with multiple ids per row to 1 id per row
picks id so that matches with target_list is maximized
- Args:
- df (pd df): index that can have multiple
ids per row, in comma-separated strings
- target_list (list): list of ids to which
dataframe index should be matched
- Returns:
pd df: copy of input df with single id per row
- compact.process_data.parse_annotation(annot_fn)
parse annotation file into pd.DataFrame
- Args:
- annot_fn (string): filepath of annotation file
should be tab-separated file with prot_ids as first col and column headers as first row
- Returns:
pd.DataFrame: table with protein annotations
- compact.process_data.parse_scores(score_fn, rename_dups=True)
parses matrix-structured scores into dataframe
- Args:
- score_fn: path/fn of tsv table with scores
matrix structure with index and columns
Returns: pd.DataFrame
- compact.process_data.parse_top_hits(top_hit_fn)
parse a top hit file
- Args:
top_hit_fn (string): path of top hit file
- Returns:
pd.DataFrame: top hits as dataframe
- compact.process_data.parse_network(net_fn)
parses combined network as generated by CompaCt
- Args:
net_fn (string): path to combined network file
- Returns:
- pd.DataFrame: combined network edges in table format
columns: left_id,right_id,weight
- compact.process_data.write_network(net, out_fn)
Saves combined network as generated by CompaCt to file
- Args:
net (pd.DataFrame): network to be saved in df format out_fn (string): location of file to write to
- compact.process_data.filter_network(network, comps)
filter provided comparisons from given network
- Args:
- network (pd df):
network containing multiple comparisons
- comps (list of tuples):
comparisons to include in the output network comparison (tuple, (str,str)): left and right tags of compared samples
- Returns:
pd df: subnetwork containing only provided comparisons
- compact.process_data.parse_MCL_result(res_fn)
parses MCL result from file into dict
- Args:
res_fn (string): filepath of MCL result
- Returns:
dict: containing all MCL clusters
- compact.process_data.annotate_df(to_annot, annot_fn)
annotate given dataframe of proteins with annotation file
- compact.process_data.rename_duplicates(sorted_list, separator='::')
in sorted list of ids, renames occurences after first
- Args:
- sorted_list (list):
list of string identifiers
- separator (str, optional): Defaults to “::”.
str that will separate original id and number that will be appended
- Returns:
list: list with duplicate ids renamed
- compact.process_data.remove_appendices(id_list, separator='::')
removes trailing appendices from ids
- Args:
- id_list (list):
ids that might have appendix
- separator (str, optional): Defaults to “::”.
everything including and after separator will be stripped from the string
- Returns:
list of strings: stripped ids
- compact.process_data.rename_duplicates_int_matrix(df, separator='::')
renames duplicate row and col ids inplace in currenf df
- Args:
- df (pd df):
dataframe from which duplicate column and rows will be renamed
- separator (str, optional): Defaults to “::”.
everything including and after separator will be stripped from the string
- Returns:
int: number of duplicate ids in df index
- compact.process_data.add_tag_df(rowtag, coltag, matrix)
prepends the collection tag to each protein id
applied to both rows and columns of the matrix
- Args:
[row|col]tag (str): tag to prepend to ids matrix (pd df): matrix with index and columns
- Returns:
pd df: copy of input df with tagged ids
- compact.process_data.add_tag_multiindex(left_tag, right_tag, multiindex)
prepend complexome tag to protein ids in 2-level multiindex
- Args:
- [left|right]_tag (str):
tag to prepend to ids
- multiindex (pd MultiIndex): 2-level multiindex,
tags will be prepended to both levels
- Returns:
pd MultiIndex: new index with tagged ids
- compact.process_data.split_clustmember_tables(nodes, mappings)
get complexome-specific cluster membership tables from nodes
uses mapping to add mappings to other complexomes
- Args:
- nodes (pd df):
table of cluster nodes
- mappings (dict):
containing id mappings between collections keys: tuple with (query,subject) collection-level tags values: dicts with id mappings from query to subject
- Returns:
- dict: containing separate table of nodes per tag
key (str): tag, value (pd df): node table
- compact.process_data.parse_gmt(filename)
parses gmt format file with named reference groups
- Args:
- filename (string):
filepath of .gmt file
- Returns:
dict: dict with group names and members