psite_annotation.annotators.PeptidePositionAnnotator

class psite_annotation.annotators.PeptidePositionAnnotator(annotation_file, pspInput=False, returnAllPotentialSites=False, localization_uncertainty=0, mod_dict={'S(Phospho (STY))': 's', 'S(ph)': 's', 'T(Phospho (STY))': 't', 'T(ph)': 't', 'Y(Phospho (STY))': 'y', 'Y(ph)': 'y', 'pS': 's', 'pT': 't', 'pY': 'y'}, return_unique=False, return_sorted=False, organism='human')

Bases: object

Annotate pandas dataframe with positions of the peptide within the protein sequence based on a fasta file.

Example

annotator = PeptidePositionAnnotator(<path_to_annotation_file>)
annotator.load_annotations()
df = annotator.annotate(df)

Initialize the input files and options for PeptidePositionAnnotator.

Parameters:

annotation_file (str) – fasta file containing protein sequences
pspInput (bool) – set to True if fasta file was obtained from PhosphositePlus
returnAllPotentialSites (bool) – return all modifiable positions within the peptide as potential p-sites.
localization_uncertainty (int) – return all modifiable positions within n positions of modified sites as potential p-sites.
mod_regex – regex to capture all modification strings

Methods

`annotate`	Adds columns regarding the peptide position within the protein to a pandas dataframe.
`load_annotations`	Reads in protein sequences from fasta file.

annotate(df, inplace=False)

Adds columns regarding the peptide position within the protein to a pandas dataframe.

Adds the following annotation columns to dataframe:

‘Matched proteins’ = subset of ‘Proteins’ in the input column in which the protein could indeed be found. If the same peptide is found multiple times, the protein identifier will be repeated.
‘Start positions’ = starting positions of the modified peptide in the protein sequence (1-based, methionine is counted). If multiple isoforms/proteins contain the sequence, the starting positions are separated by semicolons in the same order as they are listed in the ‘Matched proteins’ column
‘End positions’ = end positions of the modified peptide in the protein sequence (see above for details)
‘Site positions’ = position of the modification (see ‘Start positions’ above for details on how the position is counted)

Parameters:

df (DataFrame) – pandas dataframe to be annotated with “Proteins” and “Modified sequence” columns
inplace (bool) – add the new column to df in place

Returns:

annotated dataframe

Return type:

pd.DataFrame

Required columns:

Proteins, Modified sequence

load_annotations()

Reads in protein sequences from fasta file.

Return type:: None