OpenMS-PEP Google Summer of Code 2018

Performance Contribution of Distinct Feature Sets

In order to determine which new features are worth keeping, I analysed to which degree a group of features contributes to Percolator’s performance improvement. The idea is to ignore a single feature (or a group of features) and determine the drop in performance. For a fair comparison, it is important to account for correlations between features.

png

Shown are pairwise Pearson correlations (all p-values < 0.001). We can determine two groups of features that are stronger correlated with each other than with other features:

  1. REP:siblingExperiments and REP:replicateSpectra: Both features are based on PSMs that not only share the same (unmodified) peptide sequence but also the same precursor ion. (Label: “Extra precursor features”)
  2. REP:siblingSearches, REP:siblingModifications and REP:siblingIons: These features are based on PSMs that only share the same (unmodified) peptide sequence. (Label: “Extra peptide features”)

Percolator was then run in four configurations:

  • Default features (Label: “No extra features”)
  • Default features + Extra peptide features
  • Default features + Extra precursor features
  • Default features + Extra peptide features + Extra precursor features (Label: “All extra features”)

The plots below reveal that most of the performance improvement stems from the extra peptide features, extra precursor features contribute very little.

Number of correct identifications

png

png

png

png

png

png

png

png

png

ROC curves

png

png

png

png

png

png

png

png

png