psite_annotation.addPeptideAndPsitePositions
- psite_annotation.addPeptideAndPsitePositions(df, fastaFile, pspInput=False, returnAllPotentialSites=False, localization_uncertainty=0, context_left=15, context_right=15, retain_other_mods=False, mod_dict=None, return_unique=False, return_sorted=False, organism='human')
Annotate pandas dataframe with positions of the peptide within the protein sequence based on a fasta file.
Adds the following annotation columns to dataframe:
‘Matched proteins’ = subset of ‘Proteins’ in the input column in which the protein could indeed be found. If the same peptide is found multiple times in the same protein sequence, the protein identifier will be repeated.
‘Start positions’ = starting positions of the modified peptide in the protein sequence (1-based, methionine is counted). If multiple isoforms/proteins contain the sequence, the starting positions are separated by semicolons in the same order as they are listed in the ‘Matched proteins’ column
‘End positions’ = end positions of the modified peptide in the protein sequence (see above for details)
‘Site positions’ = position of the modification (see ‘Start positions’ above for details on how the position is counted)
‘Site sequence context’ = +/- 15 amino acids around each of the modified sites, separated by semicolons
Example
Annotate with psite positions as given by PhosphoSitePlus:
import psite_annotation as pa df = pa.addPeptideAndPsitePositions(df, pa.pspFastaFile, pspInput = True)
Annotate a custom modification:
df = pa.addPeptideAndPsitePositions(df, pa.pspFastaFile, pspInput = True, mod_dict={'R[0.9840]': 'r'})
- Required columns:
Proteins,Modified sequence- Parameters:
df (
DataFrame) – pandas dataframe with “Proteins” and “Modified sequence” columnsfastaFile (
str) – fasta file containing protein sequencespspInput (
bool) – set to True if fasta file was obtained from PhosphositePlusreturnAllPotentialSites (
bool) – return all modifiable positions within the peptide as potential p-sites.localization_uncertainty (
int) – return all modifiable positions within n positions of modified sites as potential p-sites.context_left (
int) – number of amino acids to the left of the modification to includecontext_right (
int) – number of amino acids to the right of the modification to includeretain_other_mods (
bool) – retain other modifications from the modified peptide in the sequence context in lower casemod_dict (
Optional[Dict[str,str]]) – dictionary of modifications to single amino acid replacements, e.g.{"S(ph)": "s", "T(ph)": "t", "Y(ph)": "y"}. If set toNone, uses the default annotations for S, T and Y phosphorylation.return_unique (
bool) – eliminate duplicates from the ‘Site sequence context’ and Site positions’ columns, not preserving the order between the them and the rest of the data framereturn_sorted (
bool) – sort the ‘Site sequence context’ and Site positions’ columns alphabetically, not preserving the order between the them and the rest of the data frame
- Returns:
annotated dataframe
- Return type:
pd.DataFrame