OpenMS-PEP Google Summer of Code 2018

Percolator features utilizing information from multiple replicate runs

In many use cases, a protein sample is analysed multiple times. Information from such replicate runs can be used to improve peptide/protein identification. Here, we implemented a handful of new features that utilize information from multiple identical MS pipelines:

  • REP:siblingSearches: sum of scores of hits sharing the same peptide in other runs
  • REP:siblingSearchesTop: sum of scores of top hits sharing the same peptide in other runs

  • REP:replicateSpectra: sum of scores of hits sharing the same precursor ion and peptide within a run (not strictly a “replicate runs” feature)
  • REP:siblingExperiments: sum of scores of hits sharing the same precursor ion and peptide in other runs

  • REP:siblingModifications: sum of scores of hits sharing the same peptide with different modification in other runs
  • REP:siblingModificationsTop: sum of scores of top hits sharing the same peptide with different modification in other runs

  • REP:siblingIons: sum of scores of hits sharing the same peptide with different charge in other runs
  • REP:siblingIonsTop: sum of scores of top hits sharing the same peptide with different charge in other runs

E-value scores are log-transformed. The pairs siblingSearches/siblingSearchesTop, siblingModifications/siblingModificationsTop, and siblingIons/siblingIonsTop are two versions of similar ideas. Here, we show that the “Top”-version of the features in question result in superior Percolator performance. To this end, Percolator has been run twice:

  • Percolator-Repl-All feat. siblingSearches, replicateSpectra, siblingExperiments, siblingModifications, and siblingIons
  • Percolator-Repl-Top feat. siblingSearchesTop, replicateSpectra, siblingExperiments, siblingModificationsTop, and siblingIonsTop

In both cases, added features result in better (or, at least not worse) Percolator performance (looking at the the number of correct identifications and ROC curves). However, using the “Top”-features has an advantage over using the alternative.

Correct Identifications

Number of correct identifications
plot_correct_identifications()

png

png

png

png

png

png

png

png

png

Proportion of correct identifications
plot_correct_identifications(absolute=False)

png

png

png

png

png

png

png

png

png

ROC curves

png

png

png

png

png

png

png

png

png