Posterior Error Probability Estimation for Peptide Search Engine Results in OpenMS

Tandem mass spectrometry (MS/MS) is the standard technique for protein identification and quantification. The first step in a bottom-up shotgun MS/MS experiment typically involves enzymatic digestion of the protein sample of interest, resulting in a complex peptide mixture that is then subject to mass spectrometric analysis. After searching and scoring the observed MS/MS spectra against a protein database, it has become common practice to assign statistical significance to peptide spectrum matches (PSMs), thereby bypassing the need for arbitrarily chosen rigid score cutoffs when determining a list of accepted PSMs. One state-of-the-art method for postprocessing peptide search engine results is Percolator, which estimates the probability for a peptide identification to be correct - expressed as (posterior) error probability (PEP). However, although in many MS-based proteomic studies a single peptide mixture is analysed multiple times in replicate experiments, Percolator processes MS runs individually and is thus blind to information from sibling experiments. Encouraged by previous work showing that naïvely combining individually analysed replicate MS runs improves peptide inference, we here combine evidence across multiple sibling experiments by designing novel Percolator features, which take into account different precursor ion charge states and peptide modifications. We find that leveraging information between replicate MS runs increases the number of correctly identified PSMs by up to 22%, but has little impact on the peptide level. We make our work publicly available by integrating the improved version of Percolator into OpenMS, a popular open source C++ software platform for MS analysis, replacing its current tool for PEP estimation after proving Percolator's superiority (pending; GitHub pull request).

Introduction

OpenMS facilitates rapid development of complex mass spectrometric workflows. Tandem mass spectrometry (MS/MS; Fig. 1) is the de facto standard experimental method for high-throughput protein identification and quantification []. Analysis and interpretation of mass spectrometric (MS) data, however, typically require a multitude of steps - from processing raw experimental data over peptide to protein identification and possibly quantification, often using several software tools and various file formats [,,,]. To support researchers working with MS data, there has been a tremendous effort to develop software applications that allow the rapid development of easily reproducible mass spectrometric pipelines by providing quick access to MS data processing and analysis tools [,,,,]. One widely used software platform for MS data analysis is OpenMS, an open-source software platform implemented in C++, which enables users to conveniently develop complex quantitative mass spectrometric workflows via a drag-and-drop scheme of application nodes [].

**Figure 1: Schematic illustration of basic principles in tandem mass spectrometry.** Tandem mass spectrometry (MS/MS) accomplishes protein sequencing by analysing ion masses twice in succession coupled through an intermediate fragmentation stage. The first step of MS/MS involves ionizing a sample of interest, thereby generating a mixture of ions. The first mass analyser then selects and isolates so-called precursor ions of a specific mass-to-charge ratio (m/z) each representing a single peptide. After fragmentation of the selected precursor ions, the second mass analyser measures m/z values of the fragment ions produced from a precursor ion. The acquired tandem mass spectra represent characteristic "fingerprints" of peptides and can be further analysed to obtain sequence information.

Peptide identifications reported by search engines often miss a notion of statistical significance. A common task in proteomics is to identify proteins in complex biological mixtures. In a standard shotgun proteomics protocol, a sample of unknown proteins is first digested by enzymes such as Trypsin. The identification of peptides in the resulting mixture is then a critical step in mass spectrometric pipelines (Fig. 2). To this end, experimentally produced MS/MS spectra are typically searched and scored against a database of theoretical spectra constructed from peptides of known proteins. Since observed mass spectra are subject to noise, finding an accurate mapping from an observed to a theoretical spectrum is challenging. Peptide-spectrum match (PSM) scores generally reflect the similarity of compared spectra, regardless of the specific search engine used. However, most PSM scores are not statistically sound. Assigning significance to peptide identifications is thus crucial to validate and filter PSMs. To this end, robust statistical models, that postprocess search engine results, are needed to control the uncertainty in peptide identifications.

**Figure 2: Schematic illustration of basic principles in tandem mass spectrometry.** Tandem mass spectrometry (MS/MS) accomplishes protein sequencing by analysing ion masses twice in succession coupled through an intermediate fragmentation stage. The first step of MS/MS involves ionizing a sample of interest, thereby generating a mixture of ions. The first mass analyser then selects and isolates so-called precursor ions of a specific mass-to-charge ratio (m/z) each representing a single peptide. After fragmentation of the selected precursor ions, the second mass analyser measures m/z values of the fragment ions produced from a precursor ion. The acquired tandem mass spectra represent characteristic "fingerprints" of peptides and can be further analysed to obtain sequence information.

Statistical validation of peptide identifications in OpenMS

Posterior error probabilities quantify the uncertainty in individual peptide identifications. Several statistical measures have been developed to quantify the confidence in either a single or a set of peptide identifications scored by a peptide search engine []. In this work, we focus on situations in which one is interested in the identification of individual proteins (rather than a set of proteins); for example, when determining whether a certain protein is expressed in a certain cell type under a certain set of conditions. In OpenMS, an in-house tool called IDPosteriorErrorProbability (IDPEP) is used to assign a confidence measure to individual PSMs. IDPEP assumes PSM scores to be generated by a mixture model composed of two distinct distributions, one for correct and one for incorrect identifications. Described within a Bayesian framework, the posterior probability that a specific peptide assignment with score \(s\) is correct (denoted by \(+\)) can then be computed as \[ p(+|s) = \frac{p(s|+)p(+)}{p(s)} = \frac{p(s|+)p(+)}{p(s|+)p(+) + p(s|-)p(-)}, \] where the likelihoods \(p(s|+)\) and \(p(s|-)\) are expressed by a Gaussian and Gumbel distribution, respectively. The model learns its parameters in an unsupervised fashion using maximum likelihood estimation computed by the expectation-maximization algorithm. The posterior error probability (PEP) \(p(-|s) = 1 - p(+|s)\) then quantifies the confidence in a single identified spectrum.

OpenMS' tool for PEP estimation leaves room for improvement. The described model currently implemented in OpenMS suffers from several shortcomings. First, it is not guaranteed that PSM scores follow the chosen parametric distributions. Finding appropriate default score distributions that are applicable in the general case proves to be difficult since these depend on the scoring algorithm used [,]. Secondly, only a single PSM score retrieved from a peptide search algorithm is taken into account ignoring other valuable information a search engine might return. The added information gained from incorporating additional PSM scores as well as peptide properties has been shown to result in improved performance [,,]. Addressing these shortcomings in OpenMS promises to lead to improved PEP estimation for peptide search engine results.

PeptideProphet: A family of sophisticated methods for accurate PEP estimation. IDPEP is loosely based on a widely used empirical statistical model to estimate the accuracy of peptide identifications called PeptideProphet []. Having evolved into a whole family of methods, PeptideProphet has already proposed more sophisticated methods that address the described limitations in OpenMS [,]. To overcome restrictive parametric forms of the mixture model, Nesvizhskii and colleagues developed a variety of more flexible probability models, all having strengths and limitations of their own [,,,]. Although shown to work reasonably well in the general case, some suffer from potential overfitting [] while others are computationally demanding []. Furthermore, to incorporate more than a single score into the model of PeptideProphet, Keller et al. first summarise multiple quantities into a single entity preserving most of the intrinsic information using linear discriminant analysis in a supervised machine learning approach []. There are, however, several difficulties in realizing this approach. First, supervised training requires labelled high-quality MS data of known proteins which is difficult to obtain. Secondly, a pre-computed "fixed" discriminant function inferred from a specific dataset might not generalize well to data collected under different experimental conditions.

An alternative state-of-the-art tool for PEP estimation

Semi-supervised learning tailors PEP estimation to the mass spectrometric data at hand. An interesting, alternative approach to quantify the confidence in PSM scores has been proposed by Käll et al. [,,]. The general idea is to establish a semi-supervised machine learning method that is dynamically trained on data of a particular MS/MS experiment. This eliminates the need to construct manually curated training sets, while tailoring the model to each specific use case individually. Additionally, there is no need to specify underlying parametric assumptions. To achieve this, a support vector machine (SVM) called Percolator is iteratively trained on false matches found in an artificially generated dataset of peptides known to be incorrect (decoy dataset) and a subset of high-scoring PSMs considered to be correct. The SVM then learns to discriminate between correctly and incorrectly identified PSMs. Based on Percolator scores, PEPs are estimated using non-parametric logistic regression [,].

Enhancing Percolator's performance when a single protein sample has been analysed multiple times. In an attempt to increase the number of statistically significant peptides detected, previous work has mainly focused on combining search results reported by different engines applied to the same set of observed mass spectra [,,,,,]. However, it has become common practice to analyse a protein sample of interest multiple times via replicate MS runs [,], thereby generating highly overlapping datasets used to, for example, reconstruct protein interaction networks [,] or characterise proteomes of model organisms [,,]. Naïvely, replicate datasets can be analysed individually before combining reported results in the final step of a protocol (e.g. by considering the union or intersection of peptides detected in different runs). Even though taking into account replicate runs has been shown to improve peptide inference [,,], there has been little work to leverage information between replicate runs when estimating PEPs [,]. A notable exception is iProphet, which refines an initial PeptideProphet analysis by using information available from highly overlapping datasets []. We here expand on Percolator by encoding novel features that combine evidence across replicate runs for a protein sample that has been analysed multiple times.

Scope of this work

Improve PEP estimation in OpenMS. Since IDPEP has not yet been tested thoroughly - even though it is already fully supported by OpenMS - we, firstly, assess IDPEP's ability to accurately estimate PEPs using a rich dataset of known ground truth. To ensure our findings are valid across various mass spectrometric workflows, we evaluate IDPEP in combination with a range of popular peptide search engines including X!Tandem [], MS-GF+ [], and Comet []. Secondly, we compare IDPEP with Percolator with the intent to replace the former by the latter in OpenMS (in case of superior performance). Finally, we aim to improve Percolator's performance when a protein sample has been analysed multiple times by combining evidence across replicate runs, encoded as novel Percolator features, which take into account different precursor ion charge states and peptide modifications.

Material and Methods

Experimental data. For evaluation purposes, we use MS data produced and published by The Proteome Informatics Research Group (iPRG) for the iPRG 2016/2017 study "Inferring Proteoforms from Bottom-up Proteomics Data" [,]. For the study, different combinations of partially overlapping oligopeptides (protein epitope signature tags (PrESTs)) were spiked into a constant background of E. Coli proteins after tryptic digestion, resulting in four distinctive samples: mixture A+B (383 PrESTs), mixture A (192 PrESTs), mixture B (191 PrESTs), and a "blank" sample containing background proteins only. Mixture A+B contains partially overlapping peptides whereas mixtures A and B do not. In addition to raw experimental MS data, iPRG published a database compromised of 5592 entries to search experimental spectra against. The dataset contains the 383 target PrESTs present in the samples, a set of 1000 PrEST-like entrapment sequences absent from the samples, and other E. Coli proteins. Each sample was analysed in triplicates by liquid chromatography-MS/MS using higher-energy collisional dissociation (HCD) on a Q Exactive Orbitrap mass spectrometer. In this study, computational analysis is conducted using mixtures A+B, A and B; the blank mixture is ignored due to the lack of true positives.

Peptide search engines. After conversion of the given raw experimental data to mzML format using a conversion tool provided in the ProteoWizard Toolkit [], experimental spectra were searched against the iPRG dataset using peptide search engines X!Tandem (release VENGEANCE (2015.12.15)) [], MS-GF+ (version v20180130) [], and Comet (version 2016013) []. All three search engines considered semi-tryptic peptides only and were constrained to a precursor ion mass tolerance of 15ppm that has been shown to be optimal for Orbitrap machines []. For MS-GF+, the experimental set up was mirrored in the parameter settings by specifying the instrument (Q Exactive Orbitrap) and fragmentation method (HCD) used. The remaining parameters were left to their defaults. All engines were run in OpenMS (version 2.4).

Decoy database and search strategy. For each dataset, experimental spectra were searched against the iPRG database and a decoy dataset in a separate target-decoy search strategy. The decoy dataset was derived from the target database simply by reversing each protein sequence.

PEP estimation. IDPEP and Percolator (version 3.2) were both run in OpenMS (version 2.4), all parameters were left to their defaults.

Statistics. For comparison of distributions we used either the Kolmogorov-Smirnov (KS) test statistic (when one of the distributions was empirical) or the two-sample KS test (when both tested distributions were empirical). All statistical tests were conducted at a confidence level of 5%.

Results

Percolator outperforms IDPEP

OpenMS' in-house tool IDPEP does not meet the latest developments of state-of-the-art methods for PEP estimation, and has been noticed to behave unreliably in practice. In an attempt to find a suitable replacement, we compare IDPEP with a modern widely-used method for PEP estimation, called Percolator [,,], using experimental data from three protein mixtures (á three technical replicates) in combination with three popular peptide search engines (X!Tandem [], MS-GF+ [], and Comet []).

Peptide search engines rank detected PSMs according to a custom scoring scheme, thereby providing a starting point for PEP estimation tools. Although the comparison of search engine performance is not the focus of this work, we here provide a brief overview of their performances which, we believe, makes it easier for the reader to comprehend the subsequent analysis. In a typical database-centred protein identification protocol, the first step involves searching experimental spectra against a set of peptides using a preferred search engine. Scoring schemes employed by search engines to map observed spectra to peptides are diverse, and so are the resulting match rankings, usually spanning from promising, high-scoring to low-scoring, potentially incorrect identifications. Applied to the iPRG2016 study dataset, we find that Comet and MS-GF+ produce rankings that are stable across all analysed samples (Fig. 3). In particular, Comet tends to separate correct from incorrect PSMs more clearly than MS-GF+. X!Tandem, however, exhibits unstable behaviour; for one replicate of each protein mixture, X!Tandem assigns scores to PSMs whose distributions, when divided into correct and incorrect matches, are nearly identical (KS distances < 0.12; Fig. 3). Furthermore, regardless of the search engine used, mixture A+B tends to be the most challenging to score accurately, followed by the individual mixtures A and B. Search results reported by an engine are then fed to IDPEP or Percolator that aim to further untangle correct from incorrect PSMs by estimating the confidence in identified PSMs.

Broken assumptions underlying IDPEP contribute to unsuccessful fitting. Noticeably, IDPEP fails to fit its proposed mixture model to any search engine result regarding protein mixture A+B. Mixture A+B differs from the other mixtures by containing partially overlapping peptides whose pairwise similarity makes accurate peptide-spectrum mapping harder. However, we find that none of the engine score distributions meets the parametric assumptions underlying IDPEP - neither for mixture A+B nor for mixtures A or B (KS test, p-value < 0.0001). Hence, none of the distributions of correct and incorrect matches resemble a Gaussian or Gumbel distribution, respectively, to a significant degree. In the particular case of protein mixture A+B, the score distributions of incorrect identifications deviate more strongly from the assumed Gumbel distribution than those for the individual mixtures (difference in KS distance: 0.01 (MS-GF+), 0.02 (Comet), 0.04 (X!Tandem)). In fact, the deviation seems to be significant enough to prevent successful fitting. In contrast, Percolator demonstrates its ability to handle challenging cases for which IDPEP fails, such as mixture A+B. Since pinpointing incorrect matches seems to be a limiting factor, this might be explained by the fact that Percolator exploits decoy searches, a common technique in the field to learn from identifications known to be incorrect.

Percolator correctly identifies a higher number of PSMs than IDPEP while picking up less false positives. For search engines MS-GF+ and Comet (X!Tandem results are discussed separately), Percolator correctly identifies more PSMs than IDPEP (Fig. 4, left). Moreover, improved receiver operating characteristic (ROC) area under the curve (AUC) values (improvement of 0.03-0.05; Fig. 4, right) ensure that higher numbers of correct identifications do not come at the cost of increased numbers of false positives. In particular, we find that Percolator's improvement over IDPEP is greater for search engine results produced by MS-GF+ than Comet (average percentage change at constant PEP threshold 0.01: 16.6% (MS-GF+), 6.3% (Comet)). Since MS-GF+'s main score, as discussed earlier, discriminates less clearly between correct and incorrect identifications than Comet's score (Fig. 3), processing MS-GF+ results is more demanding for IDPEP which, in turn, leaves more room for improvement to Percolator.

Sibling Scores

Sibling Modifications

Sibling Ions

Precursor Scores

Replicate Spectra

Sibling PSMs possess...

...a common peptide

...a common precursor ion

...different modifications

...a different charge state

Features dropped	Peptide-based	Precursor-based
X!Tandem & Percolator	-14.6%	1.5%
MS-GF+ & Percolator	-0.3%	-1.2%
Comet & Percolator	-10.4%	-0.3%

Features dropped

Peptide-based

Precursor-based

X!Tandem & Percolator

-14.6%

1.5%

MS-GF+ & Percolator

-0.3%

-1.2%

Comet & Percolator

-10.4%

-0.3%

Discussion and Outlook

IDPEP versus Percolator. In this work, we have conducted an in-depth analysis of IDPEP, the PEP estimation tool currently implemented in OpenMS []. Using a dataset of known ground truth, we have shown that the parametric assumptions underlying IDPEP's mixture model are not satisfied when used in combination with three popular peptide search engines (X!Tandem [], MS-GF+ [], and Comet []). In severe cases, IDPEP's heavy reliance on the engine's ability to sufficiently distinguish correct from incorrect identifications leads to unsuccessful or bad fitting, thereby returning a list of PEPs that is virtually unusable. Furthermore, Percolator [,,] has been shown to consistently outperform IDPEP over all analysed datasets and search engines considered. Encouraged by its robustness and reliability, Percolator has subsequently been integrated into the OpenMS framework substituting IDPEP as default PEP estimation tool. We, therefore, provide easy access to Percolator and enable users within the OpenMS community to easily integrate Percolator into their mass spectrometric workflows.

Replicate MS experiments improve peptide identification. The second part of this work focused on improving PEP estimation by leveraging information between replicate runs in cases where a single protein sample has been analysed multiple times. To this end, we proposed novel Percolator features based on shared peptides or precursor ions between PSMs reported by different database searches of the same engine. For two out of three search engines, X!Tandem and Comet, peptide-based features have been shown to greatly increase the number of PSMs correctly identified. Especially features Sibling Scores and Sibling Ions showed high discriminative power (with respect to correct and incorrect identifications) which is in accordance with findings regarding iProphet, a tool using similar sibling statistics []. In contrast, while iProphet reports an increase in correct PSMs at a constant false discovery rate when accounting for precursor identical PSMs, we here find little contribution of those features towards Percolator's improved performance. However, even though iProphet's and our sibling statistics are similarly motivated, a direct comparison is difficult since they are defined and implemented in different ways and embedded in different frameworks.

Implicit bias towards high-abundance proteins. Sibling features build on the general assumption that peptides identified in multiple runs are more likely to be correct. This assumption holds true when proteins of the same concentration are spiked into a common background. However, mixtures often contain proteins whose abundance might differ by magnitudes. In such cases, especially when used in conjunction with data-dependent acquisition of MS/MS spectra, the overlap of replicate runs is mostly restricted to high-abundant proteins []. Consequently, sibling features might fail to pick up low-abundant peptides. It is, therefore, essential to take mixture composition and the technical approach used into consideration when deciding on a strategy to deal with multiple replicate MS runs.

Finding optimal Percolator configurations that are effective for a wide range of mass spectrometric workflows is challenging. Even though leveraging information between replicate runs has been shown to improve PSM identification, there is a discrepancy in its effectiveness between different search engines used as preprocessors. While X!Tandem and Comet combined with Percolator respond well to the newly implemented features, there is little gain in the case of MS-GF+. These findings highlight how the variety of search engines that can be used upstream to Percolator complicate the problem of developing a feature set that is universally effective. In view of the complexity and variability of mass spectrometric workflows, a one-fits-all configuration for Percolator might simply not exist. However, Percolator can be easily adapted for use with a particular search engine since integrating information is as simple as defining new features. To effectively tailor the proposed features to a particular search engine, one might consider taking into account more than solely the engine's main score. Instead, one can use carefully designed engine-specific functions distinguishing correct from incorrect identifications as clearly as possible by combining various bits of information an engine might return, similarly to what has been done in PeptideProphet []. This approach might improve PEP estimation in the case of MS-GF+, but might be beneficial in the case of other search engines as well.

From peptide to protein inference. Peptide inference is a critical step in a shotgun proteomics pipeline. However, in most cases, one is ultimately interested in protein inference while peptide inference is more of a necessity due to the shotgun approach used. In a typical mass spectrometric workflow, protein inference is based on the unique set of peptides identified from experimental spectra. In agreement with analysis results regarding iProphet [], sibling features failed to correctly identify a higher number of unique peptides. However, even though it is not the common use case, there are scenarios in which one is interested in identifying the peptides itself, e.g. for identification of expressed sequence tags to correct genome annotations [], or for collection of proteomic data in, for example, the Peptide Atlas database [].

Extension of the proposed framework to other highly overlapping datasets generated using approaches different from simple experiment replication. Essentially, simultaneously analysed MS datasets need to fulfil a single condition only: They are required to refer to the same protein sample, i.e. they are assumed to be observed evidence regarding the same (unknown) truth. This holds true not only for technical replicates of an experiment, as used in this study, but also for datasets generated by, for example, analysing a sample multiple times using a range of mass spectrometers, or by searching a set of observed spectra from the same experiment against a common peptide database using various peptide search engines. Regardless of the specific technique used, the resulting datasets tend to be highly overlapping while still offering some individual information [], and as such are suited for the proposed analysis leveraging information between sibling datasets of PSMs. In the case of replicates generated by different machines, Percolator extended by sibling features can directly be applied. However, when using different peptide search engines, one needs to be careful about combining scores retrieved from different engines due to their heterogeneity [].

Posterior Error Probability Estimation for Peptide Search Engine Results in OpenMS

Introduction

Statistical validation of peptide identifications in OpenMS

An alternative state-of-the-art tool for PEP estimation

Scope of this work

Material and Methods

Results

Percolator outperforms IDPEP

Leveraging information across multiple sibling experiments improves peptide identification

Novel Percolator features aggregate information from multiple sibling experiments

Peptide-based statistics improve Percolator's performance more than precursor-based statistics

Discussion and Outlook