psite_annotation.addPeptideAndPsitePositions

psite_annotation.addPeptideAndPsitePositions(df, fastaFile, pspInput=False, returnAllPotentialSites=False, localization_uncertainty=0, context_left=15, context_right=15, retain_other_mods=False, mod_dict=None, return_unique=False, return_sorted=False, organism='human')

Annotate pandas dataframe with positions of the peptide within the protein sequence based on a fasta file.

Adds the following annotation columns to dataframe:

  • ‘Matched proteins’ = subset of ‘Proteins’ in the input column in which the protein could indeed be found. If the same peptide is found multiple times in the same protein sequence, the protein identifier will be repeated.

  • ‘Start positions’ = starting positions of the modified peptide in the protein sequence (1-based, methionine is counted). If multiple isoforms/proteins contain the sequence, the starting positions are separated by semicolons in the same order as they are listed in the ‘Matched proteins’ column

  • ‘End positions’ = end positions of the modified peptide in the protein sequence (see above for details)

  • ‘Site positions’ = position of the modification (see ‘Start positions’ above for details on how the position is counted)

  • ‘Site sequence context’ = +/- 15 amino acids around each of the modified sites, separated by semicolons

Example

Annotate with psite positions as given by PhosphoSitePlus:

import psite_annotation as pa
df = pa.addPeptideAndPsitePositions(df, pa.pspFastaFile, pspInput = True)

Annotate a custom modification:

df = pa.addPeptideAndPsitePositions(df, pa.pspFastaFile, pspInput = True, mod_dict={'R[0.9840]': 'r'})
Required columns:

Proteins, Modified sequence

Parameters:
  • df (DataFrame) – pandas dataframe with “Proteins” and “Modified sequence” columns

  • fastaFile (str) – fasta file containing protein sequences

  • pspInput (bool) – set to True if fasta file was obtained from PhosphositePlus

  • returnAllPotentialSites (bool) – return all modifiable positions within the peptide as potential p-sites.

  • localization_uncertainty (int) – return all modifiable positions within n positions of modified sites as potential p-sites.

  • context_left (int) – number of amino acids to the left of the modification to include

  • context_right (int) – number of amino acids to the right of the modification to include

  • retain_other_mods (bool) – retain other modifications from the modified peptide in the sequence context in lower case

  • mod_dict (Optional[Dict[str, str]]) – dictionary of modifications to single amino acid replacements, e.g. {"S(ph)": "s", "T(ph)": "t", "Y(ph)": "y"}. If set to None, uses the default annotations for S, T and Y phosphorylation.

  • return_unique (bool) – eliminate duplicates from the ‘Site sequence context’ and Site positions’ columns, not preserving the order between the them and the rest of the data frame

  • return_sorted (bool) – sort the ‘Site sequence context’ and Site positions’ columns alphabetically, not preserving the order between the them and the rest of the data frame

Returns:

annotated dataframe

Return type:

pd.DataFrame