TABLE OF CONTENT Introduction Material and Methods Results Discussion and Outlook

Posterior Error Probability Estimation for Peptide Search Engine Results in OpenMS

Tandem mass spectrometry (MS/MS) is the standard technique for protein identification and quantification. The first step in a bottom-up shotgun MS/MS experiment typically involves enzymatic digestion of the protein sample of interest, resulting in a complex peptide mixture that is then subject to mass spectrometric analysis. After searching and scoring the observed MS/MS spectra against a protein database, it has become common practice to assign statistical significance to peptide spectrum matches (PSMs), thereby bypassing the need for arbitrarily chosen rigid score cutoffs when determining a list of accepted PSMs. One state-of-the-art method for postprocessing peptide search engine results is Percolator, which estimates the probability for a peptide identification to be correct - expressed as (posterior) error probability (PEP). However, although in many MS-based proteomic studies a single peptide mixture is analysed multiple times in replicate experiments, Percolator processes MS runs individually and is thus blind to information from sibling experiments. Encouraged by previous work showing that naïvely combining individually analysed replicate MS runs improves peptide inference, we here combine evidence across multiple sibling experiments by designing novel Percolator features, which take into account different precursor ion charge states and peptide modifications. We find that leveraging information between replicate MS runs increases the number of correctly identified PSMs by up to 22%, but has little impact on the peptide level. We make our work publicly available by integrating the improved version of Percolator into OpenMS, a popular open source C++ software platform for MS analysis, replacing its current tool for PEP estimation after proving Percolator's superiority (pending; GitHub pull request).

Introduction

OpenMS facilitates rapid development of complex mass spectrometric workflows. Tandem mass spectrometry (MS/MS; Fig. 1) is the de facto standard experimental method for high-throughput protein identification and quantification []. Analysis and interpretation of mass spectrometric (MS) data, however, typically require a multitude of steps - from processing raw experimental data over peptide to protein identification and possibly quantification, often using several software tools and various file formats [,,,]. To support researchers working with MS data, there has been a tremendous effort to develop software applications that allow the rapid development of easily reproducible mass spectrometric pipelines by providing quick access to MS data processing and analysis tools [,,,,]. One widely used software platform for MS data analysis is OpenMS, an open-source software platform implemented in C++, which enables users to conveniently develop complex quantitative mass spectrometric workflows via a drag-and-drop scheme of application nodes [].

Figure 1: Schematic illustration of basic principles in tandem mass spectrometry. Tandem mass spectrometry (MS/MS) accomplishes protein sequencing by analysing ion masses twice in succession coupled through an intermediate fragmentation stage. The first step of MS/MS involves ionizing a sample of interest, thereby generating a mixture of ions. The first mass analyser then selects and isolates so-called precursor ions of a specific mass-to-charge ratio (m/z) each representing a single peptide. After fragmentation of the selected precursor ions, the second mass analyser measures m/z values of the fragment ions produced from a precursor ion. The acquired tandem mass spectra represent characteristic "fingerprints" of peptides and can be further analysed to obtain sequence information.

Peptide identifications reported by search engines often miss a notion of statistical significance. A common task in proteomics is to identify proteins in complex biological mixtures. In a standard shotgun proteomics protocol, a sample of unknown proteins is first digested by enzymes such as Trypsin. The identification of peptides in the resulting mixture is then a critical step in mass spectrometric pipelines (Fig. 2). To this end, experimentally produced MS/MS spectra are typically searched and scored against a database of theoretical spectra constructed from peptides of known proteins. Since observed mass spectra are subject to noise, finding an accurate mapping from an observed to a theoretical spectrum is challenging. Peptide-spectrum match (PSM) scores generally reflect the similarity of compared spectra, regardless of the specific search engine used. However, most PSM scores are not statistically sound. Assigning significance to peptide identifications is thus crucial to validate and filter PSMs. To this end, robust statistical models, that postprocess search engine results, are needed to control the uncertainty in peptide identifications.

Figure 2: Schematic illustration of basic principles in tandem mass spectrometry. Tandem mass spectrometry (MS/MS) accomplishes protein sequencing by analysing ion masses twice in succession coupled through an intermediate fragmentation stage. The first step of MS/MS involves ionizing a sample of interest, thereby generating a mixture of ions. The first mass analyser then selects and isolates so-called precursor ions of a specific mass-to-charge ratio (m/z) each representing a single peptide. After fragmentation of the selected precursor ions, the second mass analyser measures m/z values of the fragment ions produced from a precursor ion. The acquired tandem mass spectra represent characteristic "fingerprints" of peptides and can be further analysed to obtain sequence information.

Statistical validation of peptide identifications in OpenMS

Posterior error probabilities quantify the uncertainty in individual peptide identifications. Several statistical measures have been developed to quantify the confidence in either a single or a set of peptide identifications scored by a peptide search engine []. In this work, we focus on situations in which one is interested in the identification of individual proteins (rather than a set of proteins); for example, when determining whether a certain protein is expressed in a certain cell type under a certain set of conditions. In OpenMS, an in-house tool called IDPosteriorErrorProbability (IDPEP) is used to assign a confidence measure to individual PSMs. IDPEP assumes PSM scores to be generated by a mixture model composed of two distinct distributions, one for correct and one for incorrect identifications. Described within a Bayesian framework, the posterior probability that a specific peptide assignment with score \(s\) is correct (denoted by \(+\)) can then be computed as \[ p(+|s) = \frac{p(s|+)p(+)}{p(s)} = \frac{p(s|+)p(+)}{p(s|+)p(+) + p(s|-)p(-)}, \] where the likelihoods \(p(s|+)\) and \(p(s|-)\) are expressed by a Gaussian and Gumbel distribution, respectively. The model learns its parameters in an unsupervised fashion using maximum likelihood estimation computed by the expectation-maximization algorithm. The posterior error probability (PEP) \(p(-|s) = 1 - p(+|s)\) then quantifies the confidence in a single identified spectrum.

OpenMS' tool for PEP estimation leaves room for improvement. The described model currently implemented in OpenMS suffers from several shortcomings. First, it is not guaranteed that PSM scores follow the chosen parametric distributions. Finding appropriate default score distributions that are applicable in the general case proves to be difficult since these depend on the scoring algorithm used [,]. Secondly, only a single PSM score retrieved from a peptide search algorithm is taken into account ignoring other valuable information a search engine might return. The added information gained from incorporating additional PSM scores as well as peptide properties has been shown to result in improved performance [,,]. Addressing these shortcomings in OpenMS promises to lead to improved PEP estimation for peptide search engine results.

PeptideProphet: A family of sophisticated methods for accurate PEP estimation. IDPEP is loosely based on a widely used empirical statistical model to estimate the accuracy of peptide identifications called PeptideProphet []. Having evolved into a whole family of methods, PeptideProphet has already proposed more sophisticated methods that address the described limitations in OpenMS [,]. To overcome restrictive parametric forms of the mixture model, Nesvizhskii and colleagues developed a variety of more flexible probability models, all having strengths and limitations of their own [,,,]. Although shown to work reasonably well in the general case, some suffer from potential overfitting [] while others are computationally demanding []. Furthermore, to incorporate more than a single score into the model of PeptideProphet, Keller et al. first summarise multiple quantities into a single entity preserving most of the intrinsic information using linear discriminant analysis in a supervised machine learning approach []. There are, however, several difficulties in realizing this approach. First, supervised training requires labelled high-quality MS data of known proteins which is difficult to obtain. Secondly, a pre-computed "fixed" discriminant function inferred from a specific dataset might not generalize well to data collected under different experimental conditions.

An alternative state-of-the-art tool for PEP estimation

Semi-supervised learning tailors PEP estimation to the mass spectrometric data at hand. An interesting, alternative approach to quantify the confidence in PSM scores has been proposed by Käll et al. [,,]. The general idea is to establish a semi-supervised machine learning method that is dynamically trained on data of a particular MS/MS experiment. This eliminates the need to construct manually curated training sets, while tailoring the model to each specific use case individually. Additionally, there is no need to specify underlying parametric assumptions. To achieve this, a support vector machine (SVM) called Percolator is iteratively trained on false matches found in an artificially generated dataset of peptides known to be incorrect (decoy dataset) and a subset of high-scoring PSMs considered to be correct. The SVM then learns to discriminate between correctly and incorrectly identified PSMs. Based on Percolator scores, PEPs are estimated using non-parametric logistic regression [,].

Enhancing Percolator's performance when a single protein sample has been analysed multiple times. In an attempt to increase the number of statistically significant peptides detected, previous work has mainly focused on combining search results reported by different engines applied to the same set of observed mass spectra [,,,,,]. However, it has become common practice to analyse a protein sample of interest multiple times via replicate MS runs [,], thereby generating highly overlapping datasets used to, for example, reconstruct protein interaction networks [,] or characterise proteomes of model organisms [,,]. Naïvely, replicate datasets can be analysed individually before combining reported results in the final step of a protocol (e.g. by considering the union or intersection of peptides detected in different runs). Even though taking into account replicate runs has been shown to improve peptide inference [,,], there has been little work to leverage information between replicate runs when estimating PEPs [,]. A notable exception is iProphet, which refines an initial PeptideProphet analysis by using information available from highly overlapping datasets []. We here expand on Percolator by encoding novel features that combine evidence across replicate runs for a protein sample that has been analysed multiple times.

Scope of this work

Improve PEP estimation in OpenMS. Since IDPEP has not yet been tested thoroughly - even though it is already fully supported by OpenMS - we, firstly, assess IDPEP's ability to accurately estimate PEPs using a rich dataset of known ground truth. To ensure our findings are valid across various mass spectrometric workflows, we evaluate IDPEP in combination with a range of popular peptide search engines including X!Tandem [], MS-GF+ [], and Comet []. Secondly, we compare IDPEP with Percolator with the intent to replace the former by the latter in OpenMS (in case of superior performance). Finally, we aim to improve Percolator's performance when a protein sample has been analysed multiple times by combining evidence across replicate runs, encoded as novel Percolator features, which take into account different precursor ion charge states and peptide modifications.

Material and Methods

Experimental data. For evaluation purposes, we use MS data produced and published by The Proteome Informatics Research Group (iPRG) for the iPRG 2016/2017 study "Inferring Proteoforms from Bottom-up Proteomics Data" [,]. For the study, different combinations of partially overlapping oligopeptides (protein epitope signature tags (PrESTs)) were spiked into a constant background of E. Coli proteins after tryptic digestion, resulting in four distinctive samples: mixture A+B (383 PrESTs), mixture A (192 PrESTs), mixture B (191 PrESTs), and a "blank" sample containing background proteins only. Mixture A+B contains partially overlapping peptides whereas mixtures A and B do not. In addition to raw experimental MS data, iPRG published a database compromised of 5592 entries to search experimental spectra against. The dataset contains the 383 target PrESTs present in the samples, a set of 1000 PrEST-like entrapment sequences absent from the samples, and other E. Coli proteins. Each sample was analysed in triplicates by liquid chromatography-MS/MS using higher-energy collisional dissociation (HCD) on a Q Exactive Orbitrap mass spectrometer. In this study, computational analysis is conducted using mixtures A+B, A and B; the blank mixture is ignored due to the lack of true positives.

Peptide search engines. After conversion of the given raw experimental data to mzML format using a conversion tool provided in the ProteoWizard Toolkit [], experimental spectra were searched against the iPRG dataset using peptide search engines X!Tandem (release VENGEANCE (2015.12.15)) [], MS-GF+ (version v20180130) [], and Comet (version 2016013) []. All three search engines considered semi-tryptic peptides only and were constrained to a precursor ion mass tolerance of 15ppm that has been shown to be optimal for Orbitrap machines []. For MS-GF+, the experimental set up was mirrored in the parameter settings by specifying the instrument (Q Exactive Orbitrap) and fragmentation method (HCD) used. The remaining parameters were left to their defaults. All engines were run in OpenMS (version 2.4).

Decoy database and search strategy. For each dataset, experimental spectra were searched against the iPRG database and a decoy dataset in a separate target-decoy search strategy. The decoy dataset was derived from the target database simply by reversing each protein sequence.

PEP estimation. IDPEP and Percolator (version 3.2) were both run in OpenMS (version 2.4), all parameters were left to their defaults.

Statistics. For comparison of distributions we used either the Kolmogorov-Smirnov (KS) test statistic (when one of the distributions was empirical) or the two-sample KS test (when both tested distributions were empirical). All statistical tests were conducted at a confidence level of 5%.

Results

Percolator outperforms IDPEP

OpenMS' in-house tool IDPEP does not meet the latest developments of state-of-the-art methods for PEP estimation, and has been noticed to behave unreliably in practice. In an attempt to find a suitable replacement, we compare IDPEP with a modern widely-used method for PEP estimation, called Percolator [,,], using experimental data from three protein mixtures (á three technical replicates) in combination with three popular peptide search engines (X!Tandem [], MS-GF+ [], and Comet []).

Peptide search engines rank detected PSMs according to a custom scoring scheme, thereby providing a starting point for PEP estimation tools. Although the comparison of search engine performance is not the focus of this work, we here provide a brief overview of their performances which, we believe, makes it easier for the reader to comprehend the subsequent analysis. In a typical database-centred protein identification protocol, the first step involves searching experimental spectra against a set of peptides using a preferred search engine. Scoring schemes employed by search engines to map observed spectra to peptides are diverse, and so are the resulting match rankings, usually spanning from promising, high-scoring to low-scoring, potentially incorrect identifications. Applied to the iPRG2016 study dataset, we find that Comet and MS-GF+ produce rankings that are stable across all analysed samples (Fig. 3). In particular, Comet tends to separate correct from incorrect PSMs more clearly than MS-GF+. X!Tandem, however, exhibits unstable behaviour; for one replicate of each protein mixture, X!Tandem assigns scores to PSMs whose distributions, when divided into correct and incorrect matches, are nearly identical (KS distances < 0.12; Fig. 3). Furthermore, regardless of the search engine used, mixture A+B tends to be the most challenging to score accurately, followed by the individual mixtures A and B. Search results reported by an engine are then fed to IDPEP or Percolator that aim to further untangle correct from incorrect PSMs by estimating the confidence in identified PSMs.

Broken assumptions underlying IDPEP contribute to unsuccessful fitting. Noticeably, IDPEP fails to fit its proposed mixture model to any search engine result regarding protein mixture A+B. Mixture A+B differs from the other mixtures by containing partially overlapping peptides whose pairwise similarity makes accurate peptide-spectrum mapping harder. However, we find that none of the engine score distributions meets the parametric assumptions underlying IDPEP - neither for mixture A+B nor for mixtures A or B (KS test, p-value < 0.0001). Hence, none of the distributions of correct and incorrect matches resemble a Gaussian or Gumbel distribution, respectively, to a significant degree. In the particular case of protein mixture A+B, the score distributions of incorrect identifications deviate more strongly from the assumed Gumbel distribution than those for the individual mixtures (difference in KS distance: 0.01 (MS-GF+), 0.02 (Comet), 0.04 (X!Tandem)). In fact, the deviation seems to be significant enough to prevent successful fitting. In contrast, Percolator demonstrates its ability to handle challenging cases for which IDPEP fails, such as mixture A+B. Since pinpointing incorrect matches seems to be a limiting factor, this might be explained by the fact that Percolator exploits decoy searches, a common technique in the field to learn from identifications known to be incorrect.

Percolator correctly identifies a higher number of PSMs than IDPEP while picking up less false positives. For search engines MS-GF+ and Comet (X!Tandem results are discussed separately), Percolator correctly identifies more PSMs than IDPEP (Fig. 4, left). Moreover, improved receiver operating characteristic (ROC) area under the curve (AUC) values (improvement of 0.03-0.05; Fig. 4, right) ensure that higher numbers of correct identifications do not come at the cost of increased numbers of false positives. In particular, we find that Percolator's improvement over IDPEP is greater for search engine results produced by MS-GF+ than Comet (average percentage change at constant PEP threshold 0.01: 16.6% (MS-GF+), 6.3% (Comet)). Since MS-GF+'s main score, as discussed earlier, discriminates less clearly between correct and incorrect identifications than Comet's score (Fig. 3), processing MS-GF+ results is more demanding for IDPEP which, in turn, leaves more room for improvement to Percolator.

A peek under the hood: Percolator's interior scoring scheme significantly discriminates correct from incorrect PSMs. Percolator tackles PEP estimation by applying non-parametric logistic regression to re-evaluated PSMs scored by its SVM. Ideally, correct and incorrect identifications are assigned positive and negative scores, respectively. For experimental spectra preprocessed by MS-GF+ and Comet, incorrect identifications clearly peak below zero whereas the majority of correct identifications lie in the positive range (Fig. 5). Furthermore, we find that all pairwise empirical distributions of correct and incorrect identifications are statistically different from each other (KS test, p-value < 0.0001). Hence, Percolator effectively untangles correct from incorrect PSMs identified by search engines MS-GF+ or Comet.

When presented with challenging input data, Percolator's performance is more stable than IDPEP's. As noted earlier, X!Tandem's ability to discriminate correct from incorrect identifications is limited when compared to MS-GF+ and Comet. Consequently, postprocessing X!Tandem's search results is more challenging for PEP estimation methods. In fact, IDPEP and Percolator both exhibit unstable behaviour when presented with data preprocessed by X!Tandem. However, IDPEP either entirely fails to fit its model to the data (e.g. mixture A+B, as discussed earlier) or results in fits that are so conservative that no correct PSMs are found at PEP thresholds 0.01 or 0.05, both commonly used in practice (Fig. 6, left). Percolator, on the contrary, performs "normally" for two out of three replicates for each sample, and produces results out of the ordinary for a single replicate only (Fig. 6, bottom). However, the wealth of detected true positives is punished by a considerably increased number of false positives, with ROC-AUC values dropping by more than 20% (from 0.83-0.86 to 0.60-0.62; Fig. 6). Picking up high numbers of false positives when presented with challenging data might be rooted in its iterative refinement of correct PSMs: If the employed semi-supervised training starts with a poor set of potentially correct identifications as positive examples, then an iterative refinement procedure tends to result in picking up more false PSMs each round. In practice, however, even though Percolator performs poorly for one replicate, it still provides the user with accurate PEPs from the remaining replicates, whereas IDPEP does not return PEPs that are usable in real-world scenarios.

Leveraging information across multiple sibling experiments improves peptide identification

After showing Percolator's superiority over IDPEP, the second aim of this work is to effectively utilise information from multiple MS experiments that analyse the same protein sample. Since Percolator is a machine learning approach with an SVM at its core, integrating new information is straightforward as it simply requires to define additional features. Inspired by work that has been done by Shteynberg et al. in iProphet [], we here introduce five novel Percolator features that are computed across multiple replicate experiments of a given protein sample. In this section, we first motivate and describe each feature, and then analyse their impact on peptide inference.

Novel Percolator features aggregate information from multiple sibling experiments

The proposed features build on a common assumption: If a particular peptide is identified in multiple experiments, it is more likely to be correct. Based on replicate "sibling" experiments, the introduced features (hereinafter referred to as "sibling features") can be divided into two groups: Peptide-based features require sibling PSMs simply to share a common (modified or unmodified) peptide sequence, whereas precursor-based features enforce sibling PSMs to not only share a common peptide but also a common precursor ion (summarised in Table 1).

Table 1: Brief summary of sibling features. Note that a grey dot does not mean that the respective condition cannot occur; it simply means that it is not enforced.
Sibling Scores
Sibling Modifications
Sibling Ions
Precursor Scores
Replicate Spectra
Sibling PSMs possess... ...a common peptide
...a common precursor ion
...different modifications
...a different charge state

Peptide-based sibling features track a single peptide across multiple datasets. Peptide-based features build on the assumption that multiple identifications of the same peptide across several experiments are more likely to be a signal from the underlying truth than to be an accumulation of random matches coincidentally detecting the same peptide. For a particular PSM, that intuition is captured by summing over the engine's main score of top-scoring PSMs referring to the same peptide found in sibling experiments (Sibling Scores). The chemical or post-translational modification of a peptide is ignored. Hence, Sibling Scores penalise peptides that are either found in very few experiments only or are assigned low scores in several experiments, while pulling up peptides with high scores in many experiments. Furthermore, we define two variations of Sibling Scores, called Sibling Modifications and Sibling Ions, that reward peptides with different mass modifications and ion charges, respectively. Analogous to Sibling Scores, the two variations are defined as the sum over the engine's main score of top-scoring hits in sibling experiments with the additional constraint of showing a different modification or being identified by precursors with different ion charge.

Precursor-based sibling features model multiple identifications of the same precursor ion. For this set of features, we assume that a particular precursor ion observed in multiple experiments is more likely to be correct. Hence, precursor-based statistics require sibling PSMs not only to share a common peptide but also to agree on the particular precursor ion that was matched to that peptide. For any given PSM, the Precursor Score sums over the engine's main score of all hits in sibling runs that share the same peptide (modification ignored) and precursor ion. Note that this feature, as opposed to peptide-based features, is not constrained to top-scoring hits only but picks up every sibling PSM detected. Two PSMs are defined to refer to the same precursor ion if all of the following conditions hold true: (i) the mass-to-charge ratios are within a 10ppm mass tolerance, (ii) the precursor ion's charge is identical, and (iii) the retention times are within a 1200s tolerance. Additionally, we introduce another statistic called Replicate Spectra that applies the same idea within a single dataset, i.e. Replicate Spectra is computed for each experiment individually and as such, strictly speaking, does not belong to the family of sibling features.

Peptide-based statistics improve Percolator's performance more than precursor-based statistics

Sibling statistics assign higher scores to correctly identified peptides than peptides absent from a protein sample. Features fed to SVMs are most effective when they discriminate target classes clearly - here, correct and incorrect identifications. In order to investigate whether the proposed sibling features show that behaviour, we compare the means of feature distributions for correct and incorrect identifications across all datasets (Fig. 7). For fair comparison, each feature space has been transformed to the (0,1)-interval using min-max normalization (for analysis purposes only). For all statistics, correct peptide identifications are, on average, assigned higher scores. However, peptide-based features distinguish correct from incorrect identifications more clearly than precursor-based features. Especially Sibling Scores and Sibling Ions seem to be quite effective. Hence, finding sibling PSMs with precursor ions of different charge state seems to be more informative about the peptide's presence in a sample than finding sibling PSMs with peptides of different modifications. Both precursor-based features, on the other hand, seem to be less successful at untangling correct and incorrect PSMs (e.g. Precursor Scores). A major difference between peptide- and precursor-based features is that peptide-based features take into account top-scoring PSMs only whereas precursor-based features sum over all PSMs found in a particular search engine run that meet the defined criteria. We hypothesize that summing over many low-scoring PSMs (i.e. likely to be incorrect) might result in false high scores.

Leveraging information between multiple technical replicates increases the number of correct PSMs, but fails to detect higher numbers of unique peptides. When compared to Percolator not using information gathered from sibling experiments, we record a substantially higher number of peptides and spectra correctly matched, at least for search engines X!Tandem and Comet; in the case of MS-GF+, the gain is negligible (data not shown). However, even though the number of correct PSMs is increased, the unique set of correctly identified peptides remains unchanged (data not shown). Intuitively, finding an increased number of peptides that have already been identified is compatible with the features' design as they favour high-scoring peptides.

Most of the described performance improvement stems from peptide-based features, whereas precursor-based features contribute very little. To measure the degree of performance contribution of the proposed sibling features, we record the drop in performance when removing a single feature (or, more precisely, a set of features). To this end, taking into account correlations between features becomes necessary since information shared by two correlated features might still be available to the SVM when only one of those features is dropped. Not surprisingly, peptide- and precursor-based features are found to represent separated feature groups that correlate stronger within their respective group than with outsider features (Fig. 8). Notably, Precursor Scores and Replicate Spectra, i.e. the precursor-based features, are strongly correlated, possibly due to the fact that these are essentially based on the same statistic applied to different but highly overlapping datasets. However, precursor-based features seem to contribute very little to the discrimination of correct and incorrect identifications (Table 2). When data was preprocessed with X!Tandem, integrating precursor-based features even leads to a slight decline in performance by 1.5%. In contrast, we find that leaving out peptide-based features leads to a severe drop in performance for X!Tandem (14.6%) and Comet (10.4%). In combination with MS-GF+, however, leveraging information between experiments using peptide-based features contributes little to improved performance. These findings highlight the difficulty of developing statistics that are effective to a similar degree for a range of search engines.

Table 2: Contribution of peptide-based and precursor-based feature subsets to Percolator's improved performance. For each search engine separately, we record the drop in performance when one of the defined feature subsets is left out. Drop in performance is measured as the percentage change in correct PSMs at a constant PEP threshold of 0.01.
Features dropped Peptide-based Precursor-based
X!Tandem & Percolator -14.6% 1.5%
MS-GF+ & Percolator -0.3% -1.2%
Comet & Percolator -10.4% -0.3%

Discussion and Outlook

IDPEP versus Percolator. In this work, we have conducted an in-depth analysis of IDPEP, the PEP estimation tool currently implemented in OpenMS []. Using a dataset of known ground truth, we have shown that the parametric assumptions underlying IDPEP's mixture model are not satisfied when used in combination with three popular peptide search engines (X!Tandem [], MS-GF+ [], and Comet []). In severe cases, IDPEP's heavy reliance on the engine's ability to sufficiently distinguish correct from incorrect identifications leads to unsuccessful or bad fitting, thereby returning a list of PEPs that is virtually unusable. Furthermore, Percolator [,,] has been shown to consistently outperform IDPEP over all analysed datasets and search engines considered. Encouraged by its robustness and reliability, Percolator has subsequently been integrated into the OpenMS framework substituting IDPEP as default PEP estimation tool. We, therefore, provide easy access to Percolator and enable users within the OpenMS community to easily integrate Percolator into their mass spectrometric workflows.

Replicate MS experiments improve peptide identification. The second part of this work focused on improving PEP estimation by leveraging information between replicate runs in cases where a single protein sample has been analysed multiple times. To this end, we proposed novel Percolator features based on shared peptides or precursor ions between PSMs reported by different database searches of the same engine. For two out of three search engines, X!Tandem and Comet, peptide-based features have been shown to greatly increase the number of PSMs correctly identified. Especially features Sibling Scores and Sibling Ions showed high discriminative power (with respect to correct and incorrect identifications) which is in accordance with findings regarding iProphet, a tool using similar sibling statistics []. In contrast, while iProphet reports an increase in correct PSMs at a constant false discovery rate when accounting for precursor identical PSMs, we here find little contribution of those features towards Percolator's improved performance. However, even though iProphet's and our sibling statistics are similarly motivated, a direct comparison is difficult since they are defined and implemented in different ways and embedded in different frameworks.

Implicit bias towards high-abundance proteins. Sibling features build on the general assumption that peptides identified in multiple runs are more likely to be correct. This assumption holds true when proteins of the same concentration are spiked into a common background. However, mixtures often contain proteins whose abundance might differ by magnitudes. In such cases, especially when used in conjunction with data-dependent acquisition of MS/MS spectra, the overlap of replicate runs is mostly restricted to high-abundant proteins []. Consequently, sibling features might fail to pick up low-abundant peptides. It is, therefore, essential to take mixture composition and the technical approach used into consideration when deciding on a strategy to deal with multiple replicate MS runs.

Finding optimal Percolator configurations that are effective for a wide range of mass spectrometric workflows is challenging. Even though leveraging information between replicate runs has been shown to improve PSM identification, there is a discrepancy in its effectiveness between different search engines used as preprocessors. While X!Tandem and Comet combined with Percolator respond well to the newly implemented features, there is little gain in the case of MS-GF+. These findings highlight how the variety of search engines that can be used upstream to Percolator complicate the problem of developing a feature set that is universally effective. In view of the complexity and variability of mass spectrometric workflows, a one-fits-all configuration for Percolator might simply not exist. However, Percolator can be easily adapted for use with a particular search engine since integrating information is as simple as defining new features. To effectively tailor the proposed features to a particular search engine, one might consider taking into account more than solely the engine's main score. Instead, one can use carefully designed engine-specific functions distinguishing correct from incorrect identifications as clearly as possible by combining various bits of information an engine might return, similarly to what has been done in PeptideProphet []. This approach might improve PEP estimation in the case of MS-GF+, but might be beneficial in the case of other search engines as well.

From peptide to protein inference. Peptide inference is a critical step in a shotgun proteomics pipeline. However, in most cases, one is ultimately interested in protein inference while peptide inference is more of a necessity due to the shotgun approach used. In a typical mass spectrometric workflow, protein inference is based on the unique set of peptides identified from experimental spectra. In agreement with analysis results regarding iProphet [], sibling features failed to correctly identify a higher number of unique peptides. However, even though it is not the common use case, there are scenarios in which one is interested in identifying the peptides itself, e.g. for identification of expressed sequence tags to correct genome annotations [], or for collection of proteomic data in, for example, the Peptide Atlas database [].

Extension of the proposed framework to other highly overlapping datasets generated using approaches different from simple experiment replication. Essentially, simultaneously analysed MS datasets need to fulfil a single condition only: They are required to refer to the same protein sample, i.e. they are assumed to be observed evidence regarding the same (unknown) truth. This holds true not only for technical replicates of an experiment, as used in this study, but also for datasets generated by, for example, analysing a sample multiple times using a range of mass spectrometers, or by searching a set of observed spectra from the same experiment against a common peptide database using various peptide search engines. Regardless of the specific technique used, the resulting datasets tend to be highly overlapping while still offering some individual information [], and as such are suited for the proposed analysis leveraging information between sibling datasets of PSMs. In the case of replicates generated by different machines, Percolator extended by sibling features can directly be applied. However, when using different peptide search engines, one needs to be careful about combining scores retrieved from different engines due to their heterogeneity [].