Posterior Error Probability Estimation for Peptide Search Engine
Results in OpenMS
Tandem mass spectrometry (MS/MS) is the standard technique for
protein identification and quantification. The first step in a
bottom-up shotgun MS/MS experiment typically involves enzymatic
digestion of the protein sample of interest, resulting in a complex
peptide mixture that is then subject to mass spectrometric analysis.
After searching and scoring the observed MS/MS spectra against a
protein database, it has become common practice to assign
statistical significance to peptide spectrum matches (PSMs), thereby
bypassing the need for arbitrarily chosen rigid score cutoffs when
determining a list of accepted PSMs. One state-of-the-art method for
postprocessing peptide search engine results is Percolator, which
estimates the probability for a peptide identification to be correct
- expressed as (posterior) error probability (PEP). However,
although in many MS-based proteomic studies a single peptide mixture
is analysed multiple times in replicate experiments, Percolator
processes MS runs individually and is thus blind to information from
sibling experiments. Encouraged by previous work showing that
naïvely combining individually analysed replicate MS runs improves
peptide inference, we here combine evidence across multiple sibling
experiments by designing novel Percolator features, which take into
account different precursor ion charge states and peptide
modifications. We find that leveraging information between replicate
MS runs increases the number of correctly identified PSMs by up to
22%, but has little impact on the peptide level. We make our work
publicly available by integrating the improved version of Percolator
into OpenMS, a popular open source C++ software platform for MS
analysis, replacing its current tool for PEP estimation after
proving Percolator's superiority (pending;
GitHub pull request).
Author
Sophia Mersmann
as part of
M.Sc. Bioinformatics and Theoretical Systems Biology
at Imperial College London
Supervisors
Juliane Liepe
Oliver Alka*
Julianus Pfeuffer*
Timo Sachsenberg*
*from
OpenMS
Affiliations
Imperial College London, UK
Google Summer Of Code 2018 (Project)
Date
submitted on
published on
Introduction
OpenMS facilitates rapid development of complex mass
spectrometric workflows.
Tandem mass spectrometry (MS/MS;
Fig. 1) is the de facto standard experimental method for
high-throughput protein identification and quantification
[]. Analysis and interpretation of mass spectrometric (MS) data,
however, typically require a multitude of steps - from processing
raw experimental data over peptide to protein identification and
possibly quantification, often using several software tools and
various file formats
[,,,]. To support researchers working with MS data, there has been a
tremendous effort to develop software applications that allow the
rapid development of easily reproducible mass spectrometric
pipelines by providing quick access to MS data processing and
analysis tools
[,,,,]. One widely used software platform for MS data analysis is OpenMS,
an open-source software platform implemented in C++, which enables
users to conveniently develop complex quantitative mass
spectrometric workflows via a drag-and-drop scheme of application
nodes
[].
Peptide identifications reported by search engines often miss a
notion of statistical significance.
A common task in proteomics is to identify proteins in complex
biological mixtures. In a standard shotgun proteomics protocol, a
sample of unknown proteins is first digested by enzymes such as
Trypsin. The identification of peptides in the resulting mixture is
then a critical step in mass spectrometric pipelines (Fig. 2). To this end, experimentally produced MS/MS spectra are typically
searched and scored against a database of theoretical spectra
constructed from peptides of known proteins. Since observed mass
spectra are subject to noise, finding an accurate mapping from an
observed to a theoretical spectrum is challenging. Peptide-spectrum
match (PSM) scores generally reflect the similarity of compared
spectra, regardless of the specific search engine used. However,
most PSM scores are not statistically sound. Assigning significance
to peptide identifications is thus crucial to validate and filter
PSMs. To this end, robust statistical models, that postprocess
search engine results, are needed to control the uncertainty in
peptide identifications.
Statistical validation of peptide identifications in OpenMS
Posterior error probabilities quantify the uncertainty in
individual peptide identifications.
Several statistical measures have been developed to quantify the
confidence in either a single or a set of peptide identifications
scored by a peptide search engine
[]. In this work, we focus on situations in which one is interested
in the identification of individual proteins (rather than a set of
proteins); for example, when determining whether a certain protein
is expressed in a certain cell type under a certain set of
conditions. In OpenMS, an in-house tool called
IDPosteriorErrorProbability (IDPEP) is used to assign a confidence
measure to individual PSMs. IDPEP assumes PSM scores to be generated
by a mixture model composed of two distinct distributions, one for
correct and one for incorrect identifications. Described within a
Bayesian framework, the posterior probability that a specific
peptide assignment with score \(s\) is correct (denoted by \(+\))
can then be computed as \[ p(+|s) = \frac{p(s|+)p(+)}{p(s)} =
\frac{p(s|+)p(+)}{p(s|+)p(+) + p(s|-)p(-)}, \] where the likelihoods
\(p(s|+)\) and \(p(s|-)\) are expressed by a Gaussian and Gumbel
distribution, respectively. The model learns its parameters in an
unsupervised fashion using maximum likelihood estimation computed by
the expectation-maximization algorithm. The posterior error
probability (PEP) \(p(-|s) = 1 - p(+|s)\) then quantifies the
confidence in a single identified spectrum.
OpenMS' tool for PEP estimation leaves room for
improvement.
The described model currently implemented in OpenMS suffers from
several shortcomings. First, it is not guaranteed that PSM scores
follow the chosen parametric distributions. Finding appropriate
default score distributions that are applicable in the general case
proves to be difficult since these depend on the scoring algorithm
used
[,]. Secondly, only a single PSM score retrieved from a peptide search
algorithm is taken into account ignoring other valuable information
a search engine might return. The added information gained from
incorporating additional PSM scores as well as peptide properties
has been shown to result in improved performance
[,,]. Addressing these shortcomings in OpenMS promises to lead to
improved PEP estimation for peptide search engine results.
PeptideProphet: A family of sophisticated methods for accurate
PEP estimation.
IDPEP is loosely based on a widely used empirical statistical model
to estimate the accuracy of peptide identifications called
PeptideProphet
[]. Having evolved into a whole family of methods, PeptideProphet has
already proposed more sophisticated methods that address the
described limitations in OpenMS
[,]. To overcome restrictive parametric forms of the mixture model,
Nesvizhskii and colleagues developed a variety of more flexible
probability models, all having strengths and limitations of their
own
[,,,]. Although shown to work reasonably well in the general case, some
suffer from potential overfitting
[]
while others are computationally demanding
[]. Furthermore, to incorporate more than a single score into the
model of PeptideProphet, Keller et al. first summarise
multiple quantities into a single entity preserving most of the
intrinsic information using linear discriminant analysis in a
supervised machine learning approach
[]. There are, however, several difficulties in realizing this
approach. First, supervised training requires labelled high-quality
MS data of known proteins which is difficult to obtain. Secondly, a
pre-computed "fixed" discriminant function inferred from a specific
dataset might not generalize well to data collected under different
experimental conditions.
An alternative state-of-the-art tool for PEP estimation
Semi-supervised learning tailors PEP estimation to the mass
spectrometric data at hand.
An interesting, alternative approach to quantify the confidence in
PSM scores has been proposed by Käll et al.[,,]. The general idea is to establish a semi-supervised machine
learning method that is dynamically trained on data of a particular
MS/MS experiment. This eliminates the need to construct manually
curated training sets, while tailoring the model to each specific
use case individually. Additionally, there is no need to specify
underlying parametric assumptions. To achieve this, a support vector
machine (SVM) called Percolator is iteratively trained on false
matches found in an artificially generated dataset of peptides known
to be incorrect (decoy dataset) and a subset of high-scoring PSMs
considered to be correct. The SVM then learns to discriminate
between correctly and incorrectly identified PSMs. Based on
Percolator scores, PEPs are estimated using non-parametric logistic
regression
[,].
Enhancing Percolator's performance when a single protein sample
has been analysed multiple times.
In an attempt to increase the number of statistically significant
peptides detected, previous work has mainly focused on combining
search results reported by different engines applied to the same set
of observed mass spectra
[,,,,,]. However, it has become common practice to analyse a protein
sample of interest multiple times via replicate MS runs
[,], thereby generating highly overlapping datasets used to, for
example, reconstruct protein interaction networks
[,]
or characterise proteomes of model organisms
[,,]. Naïvely, replicate datasets can be analysed individually before
combining reported results in the final step of a protocol (e.g. by
considering the union or intersection of peptides detected in
different runs). Even though taking into account replicate runs has
been shown to improve peptide inference
[,,], there has been little work to leverage information
between replicate runs when estimating PEPs
[,]. A notable exception is iProphet, which refines an initial
PeptideProphet analysis by using information available from highly
overlapping datasets
[]. We here expand on Percolator by encoding novel features that
combine evidence across replicate runs for a protein sample that has
been analysed multiple times.
Scope of this work
Improve PEP estimation in OpenMS.
Since IDPEP has not yet been tested thoroughly - even though it is
already fully supported by OpenMS - we, firstly, assess IDPEP's
ability to accurately estimate PEPs using a rich dataset of known
ground truth. To ensure our findings are valid across various mass
spectrometric workflows, we evaluate IDPEP in combination with a
range of popular peptide search engines including X!Tandem
[], MS-GF+
[], and Comet
[]. Secondly, we compare IDPEP with Percolator with the intent to
replace the former by the latter in OpenMS (in case of superior
performance). Finally, we aim to improve Percolator's performance
when a protein sample has been analysed multiple times by combining
evidence across replicate runs, encoded as novel Percolator
features, which take into account different precursor ion charge
states and peptide modifications.
Material and Methods
Experimental data.
For evaluation purposes, we use MS data produced and published by
The Proteome Informatics Research Group (iPRG) for the iPRG
2016/2017 study "Inferring Proteoforms from Bottom-up Proteomics
Data"
[,]. For the study, different combinations of partially overlapping
oligopeptides (protein epitope signature tags (PrESTs)) were spiked
into a constant background of E. Coli proteins after tryptic
digestion, resulting in four distinctive samples: mixture A+B (383
PrESTs), mixture A (192 PrESTs), mixture B (191 PrESTs), and a
"blank" sample containing background proteins only. Mixture A+B
contains partially overlapping peptides whereas mixtures A and B do
not. In addition to raw experimental MS data, iPRG published a
database compromised of 5592 entries to search experimental spectra
against. The dataset contains the 383 target PrESTs present in the
samples, a set of 1000 PrEST-like entrapment sequences absent from
the samples, and other E. Coli proteins. Each sample was analysed in
triplicates by liquid chromatography-MS/MS using higher-energy
collisional dissociation (HCD) on a Q Exactive Orbitrap mass
spectrometer. In this study, computational analysis is conducted
using mixtures A+B, A and B; the blank mixture is ignored due to the
lack of true positives.
Peptide search engines.
After conversion of the given raw experimental data to mzML format
using a conversion tool provided in the ProteoWizard Toolkit
[], experimental spectra were searched against the iPRG dataset using
peptide search engines X!Tandem (release VENGEANCE (2015.12.15))
[], MS-GF+ (version v20180130)
[], and Comet (version 2016013)
[]. All three search engines considered semi-tryptic peptides only
and were constrained to a precursor ion mass tolerance of 15ppm that
has been shown to be optimal for Orbitrap machines
[]. For MS-GF+, the experimental set up was mirrored in the parameter
settings by specifying the instrument (Q Exactive Orbitrap) and
fragmentation method (HCD) used. The remaining parameters were left
to their defaults. All engines were run in OpenMS (version 2.4).
Decoy database and search strategy.
For each dataset, experimental spectra were searched against the
iPRG database and a decoy dataset in a separate target-decoy search
strategy. The decoy dataset was derived from the target database
simply by reversing each protein sequence.
PEP estimation.
IDPEP and Percolator (version 3.2) were both run in OpenMS (version
2.4), all parameters were left to their defaults.
Statistics.
For comparison of distributions we used either the
Kolmogorov-Smirnov (KS) test statistic (when one of the
distributions was empirical) or the two-sample KS test (when both
tested distributions were empirical). All statistical tests were
conducted at a confidence level of 5%.
Results
Percolator outperforms IDPEP
OpenMS' in-house tool IDPEP does not meet the latest developments of
state-of-the-art methods for PEP estimation, and has been noticed to
behave unreliably in practice. In an attempt to find a suitable
replacement, we compare IDPEP with a modern widely-used method for
PEP estimation, called Percolator
[,,], using experimental data from three protein mixtures (á three
technical replicates) in combination with three popular peptide
search engines (X!Tandem
[], MS-GF+
[], and Comet
[]).
Peptide search engines rank detected PSMs according to a custom
scoring scheme, thereby providing a starting point for PEP
estimation tools.
Although the comparison of search engine performance is not the
focus of this work, we here provide a brief overview of their
performances which, we believe, makes it easier for the reader to
comprehend the subsequent analysis. In a typical database-centred
protein identification protocol, the first step involves searching
experimental spectra against a set of peptides using a preferred
search engine. Scoring schemes employed by search engines to map
observed spectra to peptides are diverse, and so are the resulting
match rankings, usually spanning from promising, high-scoring to
low-scoring, potentially incorrect identifications. Applied to the
iPRG2016 study dataset, we find that Comet and MS-GF+ produce
rankings that are stable across all analysed samples (Fig. 3). In particular, Comet tends to separate correct from incorrect
PSMs more clearly than MS-GF+. X!Tandem, however, exhibits unstable
behaviour; for one replicate of each protein mixture, X!Tandem
assigns scores to PSMs whose distributions, when divided into
correct and incorrect matches, are nearly identical (KS distances
< 0.12;
Fig. 3). Furthermore, regardless of the search engine used, mixture A+B
tends to be the most challenging to score accurately, followed by
the individual mixtures A and B. Search results reported by an
engine are then fed to IDPEP or Percolator that aim to further
untangle correct from incorrect PSMs by estimating the confidence in
identified PSMs.
Broken assumptions underlying IDPEP contribute to unsuccessful
fitting.
Noticeably, IDPEP fails to fit its proposed mixture model to any
search engine result regarding protein mixture A+B. Mixture A+B
differs from the other mixtures by containing partially overlapping
peptides whose pairwise similarity makes accurate peptide-spectrum
mapping harder. However, we find that none of the engine score
distributions meets the parametric assumptions underlying IDPEP -
neither for mixture A+B nor for mixtures A or B (KS test, p-value <
0.0001). Hence, none of the distributions of correct and incorrect
matches resemble a Gaussian or Gumbel distribution, respectively, to a
significant degree. In the particular case of protein mixture A+B, the
score distributions of incorrect identifications deviate more strongly
from the assumed Gumbel distribution than those for the individual
mixtures (difference in KS distance: 0.01 (MS-GF+), 0.02 (Comet), 0.04
(X!Tandem)). In fact, the deviation seems to be significant enough to
prevent successful fitting. In contrast, Percolator demonstrates its
ability to handle challenging cases for which IDPEP fails, such as
mixture A+B. Since pinpointing incorrect matches seems to be a
limiting factor, this might be explained by the fact that Percolator
exploits decoy searches, a common technique in the field to learn from
identifications known to be incorrect.
Percolator correctly identifies a higher number of PSMs than IDPEP
while picking up less false positives.
For search engines MS-GF+ and Comet (X!Tandem results are discussed
separately), Percolator correctly identifies more PSMs than IDPEP
(Fig. 4, left). Moreover, improved receiver operating characteristic (ROC)
area under the curve (AUC) values (improvement of 0.03-0.05;
Fig. 4, right) ensure that higher numbers of correct identifications do not
come at the cost of increased numbers of false positives. In
particular, we find that Percolator's improvement over IDPEP is
greater for search engine results produced by MS-GF+ than Comet
(average percentage change at constant PEP threshold 0.01: 16.6%
(MS-GF+), 6.3% (Comet)). Since MS-GF+'s main score, as discussed
earlier, discriminates less clearly between correct and incorrect
identifications than Comet's score (Fig. 3), processing MS-GF+ results is more demanding for IDPEP which, in
turn, leaves more room for improvement to Percolator.
A peek under the hood: Percolator's interior scoring scheme
significantly discriminates correct from incorrect PSMs.
Percolator tackles PEP estimation by applying non-parametric logistic
regression to re-evaluated PSMs scored by its SVM. Ideally, correct
and incorrect identifications are assigned positive and negative
scores, respectively. For experimental spectra preprocessed by MS-GF+
and Comet, incorrect identifications clearly peak below zero whereas
the majority of correct identifications lie in the positive range
(Fig. 5). Furthermore, we find that all pairwise empirical distributions of
correct and incorrect identifications are statistically different from
each other (KS test, p-value < 0.0001). Hence, Percolator
effectively untangles correct from incorrect PSMs identified by search
engines MS-GF+ or Comet.
When presented with challenging input data, Percolator's
performance is more stable than IDPEP's.
As noted earlier, X!Tandem's ability to discriminate correct from
incorrect identifications is limited when compared to MS-GF+ and
Comet. Consequently, postprocessing X!Tandem's search results is more
challenging for PEP estimation methods. In fact, IDPEP and Percolator
both exhibit unstable behaviour when presented with data preprocessed
by X!Tandem. However, IDPEP either entirely fails to fit its model to
the data (e.g. mixture A+B, as discussed earlier) or results in fits
that are so conservative that no correct PSMs are found at PEP
thresholds 0.01 or 0.05, both commonly used in practice (Fig. 6, left). Percolator, on the contrary, performs "normally" for two out
of three replicates for each sample, and produces results out of the
ordinary for a single replicate only (Fig. 6, bottom). However, the wealth of detected true positives is punished
by a considerably increased number of false positives, with ROC-AUC
values dropping by more than 20% (from 0.83-0.86 to 0.60-0.62;
Fig. 6). Picking up high numbers of false positives when presented with
challenging data might be rooted in its iterative refinement of
correct PSMs: If the employed semi-supervised training starts with a
poor set of potentially correct identifications as positive examples,
then an iterative refinement procedure tends to result in picking up
more false PSMs each round. In practice, however, even though
Percolator performs poorly for one replicate, it still provides the
user with accurate PEPs from the remaining replicates, whereas IDPEP
does not return PEPs that are usable in real-world scenarios.
Leveraging information across multiple sibling experiments improves
peptide identification
After showing Percolator's superiority over IDPEP, the second aim of
this work is to effectively utilise information from multiple MS
experiments that analyse the same protein sample. Since Percolator is
a machine learning approach with an SVM at its core, integrating new
information is straightforward as it simply requires to define
additional features. Inspired by work that has been done by Shteynberg
et al. in iProphet
[], we here introduce five novel Percolator features that are computed
across multiple replicate experiments of a given protein sample. In
this section, we first motivate and describe each feature, and then
analyse their impact on peptide inference.
Novel Percolator features aggregate information from multiple sibling
experiments
The proposed features build on a common assumption: If a particular
peptide is identified in multiple experiments, it is more likely to be
correct. Based on replicate "sibling" experiments, the introduced
features (hereinafter referred to as "sibling features") can be
divided into two groups: Peptide-based features require sibling PSMs
simply to share a common (modified or unmodified) peptide sequence,
whereas precursor-based features enforce sibling PSMs to not only
share a common peptide but also a common precursor ion (summarised in
Table 1).
Table 1:
Brief summary of sibling features.
Note that a grey dot does not mean that the respective condition
cannot occur; it simply means that it is not enforced.
Sibling Scores
Sibling Modifications
Sibling Ions
Precursor Scores
Replicate Spectra
Sibling PSMs possess...
...a common peptide
...a common precursor ion
...different modifications
...a different charge state
Peptide-based sibling features track a single peptide across
multiple datasets.
Peptide-based features build on the assumption that multiple
identifications of the same peptide across several experiments are
more likely to be a signal from the underlying truth than to be an
accumulation of random matches coincidentally detecting the same
peptide. For a particular PSM, that intuition is captured by summing
over the engine's main score of top-scoring PSMs referring to the same
peptide found in sibling experiments (Sibling Scores). The
chemical or post-translational modification of a peptide is ignored.
Hence, Sibling Scores penalise peptides that are either found
in very few experiments only or are assigned low scores in several
experiments, while pulling up peptides with high scores in many
experiments. Furthermore, we define two variations of
Sibling Scores, called Sibling Modifications and
Sibling Ions, that reward peptides with different mass
modifications and ion charges, respectively. Analogous to
Sibling Scores, the two variations are defined as the sum
over the engine's main score of top-scoring hits in sibling
experiments with the additional constraint of showing a different
modification or being identified by precursors with different ion
charge.
Precursor-based sibling features model multiple identifications of
the same precursor ion.
For this set of features, we assume that a particular precursor ion
observed in multiple experiments is more likely to be correct. Hence,
precursor-based statistics require sibling PSMs not only to share a
common peptide but also to agree on the particular precursor ion that
was matched to that peptide. For any given PSM, the
Precursor Score sums over the engine's main score of all hits
in sibling runs that share the same peptide (modification ignored) and
precursor ion. Note that this feature, as opposed to peptide-based
features, is not constrained to top-scoring hits only but picks up
every sibling PSM detected. Two PSMs are defined to refer to the same
precursor ion if all of the following conditions hold true: (i) the
mass-to-charge ratios are within a 10ppm mass tolerance, (ii) the
precursor ion's charge is identical, and (iii) the retention times are
within a 1200s tolerance. Additionally, we introduce another statistic
called Replicate Spectra that applies the same idea within a
single dataset, i.e. Replicate Spectra is computed for each
experiment individually and as such, strictly speaking, does not
belong to the family of sibling features.
Peptide-based statistics improve Percolator's performance more than
precursor-based statistics
Sibling statistics assign higher scores to correctly identified
peptides than peptides absent from a protein sample.
Features fed to SVMs are most effective when they discriminate target
classes clearly - here, correct and incorrect identifications. In
order to investigate whether the proposed sibling features show that
behaviour, we compare the means of feature distributions for correct
and incorrect identifications across all datasets (Fig. 7). For fair comparison, each feature space has been transformed to
the (0,1)-interval using min-max normalization (for analysis purposes
only). For all statistics, correct peptide identifications are, on
average, assigned higher scores. However, peptide-based features
distinguish correct from incorrect identifications more clearly than
precursor-based features. Especially Sibling Scores and
Sibling Ions seem to be quite effective. Hence, finding
sibling PSMs with precursor ions of different charge state seems to be
more informative about the peptide's presence in a sample than finding
sibling PSMs with peptides of different modifications. Both
precursor-based features, on the other hand, seem to be less
successful at untangling correct and incorrect PSMs (e.g.
Precursor Scores). A major difference between peptide- and
precursor-based features is that peptide-based features take into
account top-scoring PSMs only whereas precursor-based features sum
over all PSMs found in a particular search engine run that meet the
defined criteria. We hypothesize that summing over many low-scoring
PSMs (i.e. likely to be incorrect) might result in false high scores.
Leveraging information between multiple technical replicates
increases the number of correct PSMs, but fails to detect higher
numbers of unique peptides.
When compared to Percolator not using information gathered from
sibling experiments, we record a substantially higher number of
peptides and spectra correctly matched, at least for search engines
X!Tandem and Comet; in the case of MS-GF+, the gain is negligible
(data not shown). However, even though the number of correct PSMs is
increased, the unique set of correctly identified peptides remains
unchanged (data not shown). Intuitively, finding an increased number
of peptides that have already been identified is compatible with the
features' design as they favour high-scoring peptides.
Most of the described performance improvement stems from
peptide-based features, whereas precursor-based features contribute
very little.
To measure the degree of performance contribution of the proposed
sibling features, we record the drop in performance when removing a
single feature (or, more precisely, a set of features). To this end,
taking into account correlations between features becomes necessary
since information shared by two correlated features might still be
available to the SVM when only one of those features is dropped. Not
surprisingly, peptide- and precursor-based features are found to
represent separated feature groups that correlate stronger within
their respective group than with outsider features (Fig. 8). Notably, Precursor Scores and Replicate Spectra,
i.e. the precursor-based features, are strongly correlated, possibly
due to the fact that these are essentially based on the same statistic
applied to different but highly overlapping datasets. However,
precursor-based features seem to contribute very little to the
discrimination of correct and incorrect identifications (Table 2). When data was preprocessed with X!Tandem, integrating
precursor-based features even leads to a slight decline in performance
by 1.5%. In contrast, we find that leaving out peptide-based features
leads to a severe drop in performance for X!Tandem (14.6%) and Comet
(10.4%). In combination with MS-GF+, however, leveraging information
between experiments using peptide-based features contributes little to
improved performance. These findings highlight the difficulty of
developing statistics that are effective to a similar degree for a
range of search engines.
Table 2:
Contribution of peptide-based and precursor-based feature subsets
to Percolator's improved performance.
For each search engine separately, we record the drop in
performance when one of the defined feature subsets is left out.
Drop in performance is measured as the percentage change in
correct PSMs at a constant PEP threshold of 0.01.
Features dropped
Peptide-based
Precursor-based
X!Tandem & Percolator
-14.6%
1.5%
MS-GF+ & Percolator
-0.3%
-1.2%
Comet & Percolator
-10.4%
-0.3%
Discussion and Outlook
IDPEP versus Percolator.
In this work, we have conducted an in-depth analysis of IDPEP, the PEP
estimation tool currently implemented in OpenMS
[]. Using a dataset of known ground truth, we have shown that the
parametric assumptions underlying IDPEP's mixture model are not
satisfied when used in combination with three popular peptide search
engines (X!Tandem
[], MS-GF+
[], and Comet
[]). In severe cases, IDPEP's heavy reliance on the engine's ability to
sufficiently distinguish correct from incorrect identifications leads
to unsuccessful or bad fitting, thereby returning a list of PEPs that
is virtually unusable. Furthermore, Percolator
[,,]
has been shown to consistently outperform IDPEP over all analysed
datasets and search engines considered. Encouraged by its robustness
and reliability, Percolator has subsequently been integrated into the
OpenMS framework substituting IDPEP as default PEP estimation tool.
We, therefore, provide easy access to Percolator and enable users
within the OpenMS community to easily integrate Percolator into their
mass spectrometric workflows.
Replicate MS experiments improve peptide identification.
The second part of this work focused on improving PEP estimation by
leveraging information between replicate runs in cases where a single
protein sample has been analysed multiple times. To this end, we
proposed novel Percolator features based on shared peptides or
precursor ions between PSMs reported by different database searches of
the same engine. For two out of three search engines, X!Tandem and
Comet, peptide-based features have been shown to greatly increase the
number of PSMs correctly identified. Especially features
Sibling Scores and Sibling Ions showed high
discriminative power (with respect to correct and incorrect
identifications) which is in accordance with findings regarding
iProphet, a tool using similar sibling statistics
[]. In contrast, while iProphet reports an increase in correct PSMs at
a constant false discovery rate when accounting for precursor
identical PSMs, we here find little contribution of those features
towards Percolator's improved performance. However, even though
iProphet's and our sibling statistics are similarly motivated, a
direct comparison is difficult since they are defined and implemented
in different ways and embedded in different frameworks.
Implicit bias towards high-abundance proteins.
Sibling features build on the general assumption that peptides
identified in multiple runs are more likely to be correct. This
assumption holds true when proteins of the same concentration are
spiked into a common background. However, mixtures often contain
proteins whose abundance might differ by magnitudes. In such cases,
especially when used in conjunction with data-dependent acquisition of
MS/MS spectra, the overlap of replicate runs is mostly restricted to
high-abundant proteins
[]. Consequently, sibling features might fail to pick up low-abundant
peptides. It is, therefore, essential to take mixture composition and
the technical approach used into consideration when deciding on a
strategy to deal with multiple replicate MS runs.
Finding optimal Percolator configurations that are effective for a
wide range of mass spectrometric workflows is challenging.
Even though leveraging information between replicate runs has been
shown to improve PSM identification, there is a discrepancy in its
effectiveness between different search engines used as preprocessors.
While X!Tandem and Comet combined with Percolator respond well to the
newly implemented features, there is little gain in the case of
MS-GF+. These findings highlight how the variety of search engines
that can be used upstream to Percolator complicate the problem of
developing a feature set that is universally effective. In view of the
complexity and variability of mass spectrometric workflows, a
one-fits-all configuration for Percolator might simply not exist.
However, Percolator can be easily adapted for use with a particular
search engine since integrating information is as simple as defining
new features. To effectively tailor the proposed features to a
particular search engine, one might consider taking into account more
than solely the engine's main score. Instead, one can use carefully
designed engine-specific functions distinguishing correct from
incorrect identifications as clearly as possible by combining various
bits of information an engine might return, similarly to what has been
done in PeptideProphet
[]. This approach might improve PEP estimation in the case of MS-GF+,
but might be beneficial in the case of other search engines as well.
From peptide to protein inference.
Peptide inference is a critical step in a shotgun proteomics pipeline.
However, in most cases, one is ultimately interested in protein
inference while peptide inference is more of a necessity due to the
shotgun approach used. In a typical mass spectrometric workflow,
protein inference is based on the unique set of peptides identified
from experimental spectra. In agreement with analysis results
regarding iProphet
[], sibling features failed to correctly identify a higher number of
unique peptides. However, even though it is not the common use case,
there are scenarios in which one is interested in identifying the
peptides itself, e.g. for identification of expressed sequence tags to
correct genome annotations
[], or for collection of proteomic data in, for example, the Peptide
Atlas database
[].
Extension of the proposed framework to other highly overlapping
datasets generated using approaches different from simple experiment
replication.
Essentially, simultaneously analysed MS datasets need to fulfil a
single condition only: They are required to refer to the same protein
sample, i.e. they are assumed to be observed evidence regarding the
same (unknown) truth. This holds true not only for technical
replicates of an experiment, as used in this study, but also for
datasets generated by, for example, analysing a sample multiple times
using a range of mass spectrometers, or by searching a set of observed
spectra from the same experiment against a common peptide database
using various peptide search engines. Regardless of the specific
technique used, the resulting datasets tend to be highly overlapping
while still offering some individual information
[], and as such are suited for the proposed analysis leveraging
information between sibling datasets of PSMs. In the case of
replicates generated by different machines, Percolator extended by
sibling features can directly be applied. However, when using
different peptide search engines, one needs to be careful about
combining scores retrieved from different engines due to their
heterogeneity
[].