Streaming FDR Calculation for Protein Identification

Zoun, Roman; Schallert, Kay; Janki, Atin; Ravindran, Rohith; Campero Durand, Gabriel; Fenske, Wolfram; Broneske, David; Heyer, Robert; Benndorf, Dirk; Saake, Gunter

doi:10.1007/978-3-030-00063-9_10

Roman Zoun¹⁵,
Kay Schallert¹⁶,
Atin Janki¹⁵,
Rohith Ravindran¹⁵,
Gabriel Campero Durand¹⁵,
Wolfram Fenske¹⁵,
David Broneske¹⁵,
Robert Heyer¹⁶,
Dirk Benndorf¹⁶ &
…
Gunter Saake¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 909))

Included in the following conference series:

European Conference on Advances in Databases and Information Systems

1311 Accesses

Abstract

Identification of proteins is a key step of metaproteomics research. This protein identification task should be migrated to a fast data streaming architecture to increase horizontal scalability and performance. A protein database search involves two steps: the pairwise matching of experimental spectra against protein sequences creating peptide-spectrum-matches (PSM) and the statistical validation of PSMs. The peptide-spectrum-matching is inherently parallelizable since each match is independent. However, false positive matches are inherent to this method due to measurement errors and artifacts, thus requiring statistical validation. State of the art validation is achieved using the target-decoy method, which estimates the false discovery rate (FDR) by searching against a shuffled version of the original protein database. In contrast to the protein database search, validation by target-decoy is not parallelizable, because the FDR approximation requires all experimental data at once. In short, when using a fast data architecture for the workflow, the target-decoy approach is no longer feasible. Hence a novel approach is required to avoid false discovery of PSM on streaming single-pass experimental data. To this end, the recently proposed nokoi classifier seems promising to solve the aforementioned problems. In this paper, we present a general nokoi pipeline to create such a decoy-free classifier, that reach over 95% accuracy for general metaproteomics data.

Supported by de.NBI.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Data sizes: BIOGAS1 (5984 PSMs) BIOGAS2 (8367 PSMs) BIOGAS3 (8921 PSMs).
2.
Data sizes: GUT1 (4819 PSMs) GUT2 (2317 PSMs) GUT3 (2685 PSMs).

References

Aebersold, R., Mann, M.: Mass spectrometry-based proteomics. Nature 422(6928), 198 (2003)
Article Google Scholar
Cottrell, J.S., London, U.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20(18), 3551–3567 (1999)
Article Google Scholar
Deutsch, E.W.: File formats commonly used in mass spectrometry proteomics. Mol. Cell. Proteomics 11(12), 1612–1621 (2012)
Article Google Scholar
Eisenacher, M., Kohl, M., Turewicz, M., Koch, M., Uszkoreit, J., Stephan, C.: Search and decoy: the automatic identification of mass spectra. Methods Mol. Biol. (2012). https://doi.org/10.1007/978-1-61779-885-6_28
Elias, J., Gygi, S.: Target-decoy search strategy for mass spectrometry-based proteomics. Methods Mol. Biol. 604, 55–71 (2010). https://doi.org/10.1007/978-1-60761-444-9_5
Article Google Scholar
Eng, J.K., McCormack, A.L., Yates, J.R.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5(11), 976–989 (1994)
Article Google Scholar
Estrada, R.: Fast Data Processing Systems with SMACK Stack. Packt Publishing, Birmingham (2016)
Google Scholar
Gonnelli, G.: A decoy-free approach to the identification of peptides. J. Proteome Res. 14(4), 1792–1798 (2015)
Article Google Scholar
Heyer, R., Kohrs, F., Reichl, U., Benndorf, D.: Metaproteomics of complex microbial communities in biogas plants. Microb. Technol. 8 (2015). https://doi.org/10.1111/1751-7915.12276
Heyer, R., Schallert, K., Zoun, R., Becher, B., Saake, G., Benndorf, D.: Challenges and perspectives of metaproteomic data analysis. J. Biotechnol. 261(Supplement C), 24–36 (2017). https://doi.org/10.1016/j.jbiotec.2017.06.1201. Bioinformatics Solutions for Big Data Analysis in Life Sciences presented by the German Network for Bioinformatics Infrastructure
Kipf, A., Pandey, V., Boettcher, J., Braun, L., Neumann, T., Kemper, A.: Analytics on fast data: main-memory database systems versus modern streaming systems (2017)
Google Scholar
Maron, P.A., Ranjard, L., Mougel, C., Lemanceau, P.: Metaproteomics: a new approach for studying functional microbial ecology. Microb. Ecol. 53, 486–493 (2007)
Article Google Scholar
Matrix Science: Data File Format (2016). http://www.matrixscience.com/help/data_file_help.html
Millioni, R., Franchin, C., Tessari, P., Polati, R., Cecconi, D., Arrigoni, G.: Pros and cons of peptide isolectric focusing in shotgun proteomics. J. Chromatogr. A 1293, 1–9 (2013). https://doi.org/10.1016/j.chroma.2013.03.073
Article Google Scholar
National Center for Biotechnology Information: Fasta Format, November 2002. https://blast.ncbi.nlm.nih.gov
Petriz, B.A., Franco, O.L.: Metaproteomics as a complementary approach to gut microbiota in health and disease. Front. Chem. (2017). https://doi.org/10.3389/fchem.2017.00004
Robertson, C., Ronald, C.B.: A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17(20), 2310–2316 (2003)
Article Google Scholar
Wampler, D.: Fast data: big data evolved. White Paper (2015)
Google Scholar
Wampler, D.: Fast Data Architectures for Streaming Applications, 1st edn. OReilly Media, Sebastopol (2016)
Google Scholar
Zhang, J., Liang, Y., Yau, P., Pandey, R., Harpalani, S.: A metaproteomic approach for identifying proteins in anaerobic bioreactors converting coal to methane. Int. J. Coal Geol. 146, 91–103 (2015)
Article Google Scholar

Download references

Acknowledgment

The authors sincerely thank Xiao Chen, Sebastian Krieter, Andreas Meister and Marcus Pinnecke for their support and advice. This work is partly funded by the de.NBI Network (031L0103), the European Regional Development Fund (grant no.: 11.000sz00.00.0 17 114347 0), the DFG (grant no.: SA 465/50-1), by the German Federal Ministry of Food and Agriculture (grants no.: 22404015) and dedicated to the memory of Mikhail Zoun.

Author information

Authors and Affiliations

Working Group Databases and Software Engineering, University of Magdeburg, Magdeburg, Germany
Roman Zoun, Atin Janki, Rohith Ravindran, Gabriel Campero Durand, Wolfram Fenske, David Broneske & Gunter Saake
Chair of Bioprocess Engineering, University of Magdeburg, Magdeburg, Germany
Kay Schallert, Robert Heyer & Dirk Benndorf

Authors

Roman Zoun
View author publications
You can also search for this author in PubMed Google Scholar
Kay Schallert
View author publications
You can also search for this author in PubMed Google Scholar
Atin Janki
View author publications
You can also search for this author in PubMed Google Scholar
Rohith Ravindran
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Campero Durand
View author publications
You can also search for this author in PubMed Google Scholar
Wolfram Fenske
View author publications
You can also search for this author in PubMed Google Scholar
David Broneske
View author publications
You can also search for this author in PubMed Google Scholar
Robert Heyer
View author publications
You can also search for this author in PubMed Google Scholar
Dirk Benndorf
View author publications
You can also search for this author in PubMed Google Scholar
Gunter Saake
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roman Zoun .

Editor information

Editors and Affiliations

Eötvös Loránd University, Budapest, Hungary
András Benczúr
Abt. Informatik, Universität Kiel, Kiel, Germany
Bernhard Thalheim
Eötvös Loránd University, Budapest, Hungary
Tomáš Horváth
Politecnico di Torino, Turin, Italy
Silvia Chiusano
Polytechnic University of Turin, Turin, Italy
Tania Cerquitelli
Hungarian Academy of Sciences, Budapest, Hungary
Csaba Sidló
University of Nebraska–Lincoln, Lincoln, NE, USA
Peter Z. Revesz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zoun, R. et al. (2018). Streaming FDR Calculation for Protein Identification. In: Benczúr, A., et al. New Trends in Databases and Information Systems. ADBIS 2018. Communications in Computer and Information Science, vol 909. Springer, Cham. https://doi.org/10.1007/978-3-030-00063-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-00063-9_10
Published: 31 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00062-2
Online ISBN: 978-3-030-00063-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics