psite_annotation.annotators.PeptidePositionAnnotator

class psite_annotation.annotators.PeptidePositionAnnotator(annotation_file, pspInput=False, returnAllPotentialSites=False, localization_uncertainty=0, mod_dict={'S(Phospho (STY))': 's', 'S(ph)': 's', 'T(Phospho (STY))': 't', 'T(ph)': 't', 'Y(Phospho (STY))': 'y', 'Y(ph)': 'y', 'pS': 's', 'pT': 't', 'pY': 'y'}, return_unique=False, return_sorted=False, organism='human')

Bases: object

Annotate pandas dataframe with positions of the peptide within the protein sequence based on a fasta file.

Example

annotator = PeptidePositionAnnotator(<path_to_annotation_file>)
annotator.load_annotations()
df = annotator.annotate(df)

Initialize the input files and options for PeptidePositionAnnotator.

Parameters:
  • annotation_file (str) – fasta file containing protein sequences

  • pspInput (bool) – set to True if fasta file was obtained from PhosphositePlus

  • returnAllPotentialSites (bool) – return all modifiable positions within the peptide as potential p-sites.

  • localization_uncertainty (int) – return all modifiable positions within n positions of modified sites as potential p-sites.

  • mod_regex – regex to capture all modification strings

Methods

annotate

Adds columns regarding the peptide position within the protein to a pandas dataframe.

load_annotations

Reads in protein sequences from fasta file.

annotate(df, inplace=False)

Adds columns regarding the peptide position within the protein to a pandas dataframe.

Adds the following annotation columns to dataframe:

  • ‘Matched proteins’ = subset of ‘Proteins’ in the input column in which the protein could indeed be found. If the same peptide is found multiple times, the protein identifier will be repeated.

  • ‘Start positions’ = starting positions of the modified peptide in the protein sequence (1-based, methionine is counted). If multiple isoforms/proteins contain the sequence, the starting positions are separated by semicolons in the same order as they are listed in the ‘Matched proteins’ column

  • ‘End positions’ = end positions of the modified peptide in the protein sequence (see above for details)

  • ‘Site positions’ = position of the modification (see ‘Start positions’ above for details on how the position is counted)

Parameters:
  • df (DataFrame) – pandas dataframe to be annotated with “Proteins” and “Modified sequence” columns

  • inplace (bool) – add the new column to df in place

Returns:

annotated dataframe

Return type:

pd.DataFrame

Required columns:

Proteins, Modified sequence

load_annotations()

Reads in protein sequences from fasta file.

Return type:

None