psite_annotation.addPeptideAndPsitePositions

psite_annotation.addPeptideAndPsitePositions(df, fastaFile, pspInput=False, returnAllPotentialSites=False, localization_uncertainty=0, context_left=15, context_right=15, retain_other_mods=False, mod_dict=None, return_unique=False, return_sorted=False, organism='human')

Annotate pandas dataframe with positions of the peptide within the protein sequence based on a fasta file.

Adds the following annotation columns to dataframe:

‘Matched proteins’ = subset of ‘Proteins’ in the input column in which the protein could indeed be found. If the same peptide is found multiple times in the same protein sequence, the protein identifier will be repeated.
‘Start positions’ = starting positions of the modified peptide in the protein sequence (1-based, methionine is counted). If multiple isoforms/proteins contain the sequence, the starting positions are separated by semicolons in the same order as they are listed in the ‘Matched proteins’ column
‘End positions’ = end positions of the modified peptide in the protein sequence (see above for details)
‘Site positions’ = position of the modification (see ‘Start positions’ above for details on how the position is counted)
‘Site sequence context’ = +/- 15 amino acids around each of the modified sites, separated by semicolons

Example

Annotate with psite positions as given by PhosphoSitePlus:

import psite_annotation as pa
df = pa.addPeptideAndPsitePositions(df, pa.pspFastaFile, pspInput = True)

Annotate a custom modification:

df = pa.addPeptideAndPsitePositions(df, pa.pspFastaFile, pspInput = True, mod_dict={'R[0.9840]': 'r'})

Required columns:

Proteins, Modified sequence

Parameters:

df (DataFrame) – pandas dataframe with “Proteins” and “Modified sequence” columns
fastaFile (str) – fasta file containing protein sequences
pspInput (bool) – set to True if fasta file was obtained from PhosphositePlus
returnAllPotentialSites (bool) – return all modifiable positions within the peptide as potential p-sites.
localization_uncertainty (int) – return all modifiable positions within n positions of modified sites as potential p-sites.
context_left (int) – number of amino acids to the left of the modification to include
context_right (int) – number of amino acids to the right of the modification to include
retain_other_mods (bool) – retain other modifications from the modified peptide in the sequence context in lower case
mod_dict (Optional[Dict[str, str]]) – dictionary of modifications to single amino acid replacements, e.g. {"S(ph)": "s", "T(ph)": "t", "Y(ph)": "y"}. If set to None, uses the default annotations for S, T and Y phosphorylation.
return_unique (bool) – eliminate duplicates from the ‘Site sequence context’ and Site positions’ columns, not preserving the order between the them and the rest of the data frame
return_sorted (bool) – sort the ‘Site sequence context’ and Site positions’ columns alphabetically, not preserving the order between the them and the rest of the data frame

Returns:

annotated dataframe

Return type:

pd.DataFrame