Skip to main content

Streaming FDR Calculation for Protein Identification

  • Conference paper
  • First Online:
New Trends in Databases and Information Systems (ADBIS 2018)

Abstract

Identification of proteins is a key step of metaproteomics research. This protein identification task should be migrated to a fast data streaming architecture to increase horizontal scalability and performance. A protein database search involves two steps: the pairwise matching of experimental spectra against protein sequences creating peptide-spectrum-matches (PSM) and the statistical validation of PSMs. The peptide-spectrum-matching is inherently parallelizable since each match is independent. However, false positive matches are inherent to this method due to measurement errors and artifacts, thus requiring statistical validation. State of the art validation is achieved using the target-decoy method, which estimates the false discovery rate (FDR) by searching against a shuffled version of the original protein database. In contrast to the protein database search, validation by target-decoy is not parallelizable, because the FDR approximation requires all experimental data at once. In short, when using a fast data architecture for the workflow, the target-decoy approach is no longer feasible. Hence a novel approach is required to avoid false discovery of PSM on streaming single-pass experimental data. To this end, the recently proposed nokoi classifier seems promising to solve the aforementioned problems. In this paper, we present a general nokoi pipeline to create such a decoy-free classifier, that reach over 95% accuracy for general metaproteomics data.

Supported by de.NBI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Data sizes: BIOGAS1 (5984 PSMs) BIOGAS2 (8367 PSMs) BIOGAS3 (8921 PSMs).

  2. 2.

    Data sizes: GUT1 (4819 PSMs) GUT2 (2317 PSMs) GUT3 (2685 PSMs).

References

  1. Aebersold, R., Mann, M.: Mass spectrometry-based proteomics. Nature 422(6928), 198 (2003)

    Article  Google Scholar 

  2. Cottrell, J.S., London, U.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20(18), 3551–3567 (1999)

    Article  Google Scholar 

  3. Deutsch, E.W.: File formats commonly used in mass spectrometry proteomics. Mol. Cell. Proteomics 11(12), 1612–1621 (2012)

    Article  Google Scholar 

  4. Eisenacher, M., Kohl, M., Turewicz, M., Koch, M., Uszkoreit, J., Stephan, C.: Search and decoy: the automatic identification of mass spectra. Methods Mol. Biol. (2012). https://doi.org/10.1007/978-1-61779-885-6_28

  5. Elias, J., Gygi, S.: Target-decoy search strategy for mass spectrometry-based proteomics. Methods Mol. Biol. 604, 55–71 (2010). https://doi.org/10.1007/978-1-60761-444-9_5

    Article  Google Scholar 

  6. Eng, J.K., McCormack, A.L., Yates, J.R.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5(11), 976–989 (1994)

    Article  Google Scholar 

  7. Estrada, R.: Fast Data Processing Systems with SMACK Stack. Packt Publishing, Birmingham (2016)

    Google Scholar 

  8. Gonnelli, G.: A decoy-free approach to the identification of peptides. J. Proteome Res. 14(4), 1792–1798 (2015)

    Article  Google Scholar 

  9. Heyer, R., Kohrs, F., Reichl, U., Benndorf, D.: Metaproteomics of complex microbial communities in biogas plants. Microb. Technol. 8 (2015). https://doi.org/10.1111/1751-7915.12276

  10. Heyer, R., Schallert, K., Zoun, R., Becher, B., Saake, G., Benndorf, D.: Challenges and perspectives of metaproteomic data analysis. J. Biotechnol. 261(Supplement C), 24–36 (2017). https://doi.org/10.1016/j.jbiotec.2017.06.1201. Bioinformatics Solutions for Big Data Analysis in Life Sciences presented by the German Network for Bioinformatics Infrastructure

  11. Kipf, A., Pandey, V., Boettcher, J., Braun, L., Neumann, T., Kemper, A.: Analytics on fast data: main-memory database systems versus modern streaming systems (2017)

    Google Scholar 

  12. Maron, P.A., Ranjard, L., Mougel, C., Lemanceau, P.: Metaproteomics: a new approach for studying functional microbial ecology. Microb. Ecol. 53, 486–493 (2007)

    Article  Google Scholar 

  13. Matrix Science: Data File Format (2016). http://www.matrixscience.com/help/data_file_help.html

  14. Millioni, R., Franchin, C., Tessari, P., Polati, R., Cecconi, D., Arrigoni, G.: Pros and cons of peptide isolectric focusing in shotgun proteomics. J. Chromatogr. A 1293, 1–9 (2013). https://doi.org/10.1016/j.chroma.2013.03.073

    Article  Google Scholar 

  15. National Center for Biotechnology Information: Fasta Format, November 2002. https://blast.ncbi.nlm.nih.gov

  16. Petriz, B.A., Franco, O.L.: Metaproteomics as a complementary approach to gut microbiota in health and disease. Front. Chem. (2017). https://doi.org/10.3389/fchem.2017.00004

  17. Robertson, C., Ronald, C.B.: A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17(20), 2310–2316 (2003)

    Article  Google Scholar 

  18. Wampler, D.: Fast data: big data evolved. White Paper (2015)

    Google Scholar 

  19. Wampler, D.: Fast Data Architectures for Streaming Applications, 1st edn. OReilly Media, Sebastopol (2016)

    Google Scholar 

  20. Zhang, J., Liang, Y., Yau, P., Pandey, R., Harpalani, S.: A metaproteomic approach for identifying proteins in anaerobic bioreactors converting coal to methane. Int. J. Coal Geol. 146, 91–103 (2015)

    Article  Google Scholar 

Download references

Acknowledgment

The authors sincerely thank Xiao Chen, Sebastian Krieter, Andreas Meister and Marcus Pinnecke for their support and advice. This work is partly funded by the de.NBI Network (031L0103), the European Regional Development Fund (grant no.: 11.000sz00.00.0 17 114347 0), the DFG (grant no.: SA 465/50-1), by the German Federal Ministry of Food and Agriculture (grants no.: 22404015) and dedicated to the memory of Mikhail Zoun.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roman Zoun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zoun, R. et al. (2018). Streaming FDR Calculation for Protein Identification. In: Benczúr, A., et al. New Trends in Databases and Information Systems. ADBIS 2018. Communications in Computer and Information Science, vol 909. Springer, Cham. https://doi.org/10.1007/978-3-030-00063-9_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00063-9_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00062-2

  • Online ISBN: 978-3-030-00063-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics