Protein Identification as a Suitable Application for Fast Data Architecture

Zoun, Roman; Durand, Gabriel Campero; Schallert, Kay; Patrikar, Apoorva; Broneske, David; Fenske, Wolfram; Heyer, Robert; Benndorf, Dirk; Saake, Gunter

doi:10.1007/978-3-319-99133-7_14

Protein Identification as a Suitable Application for Fast Data Architecture

Roman Zoun¹⁵,
Gabriel Campero Durand¹⁵,
Kay Schallert¹⁶,
Apoorva Patrikar¹⁷,
David Broneske¹⁵,
Wolfram Fenske¹⁵,
Robert Heyer¹⁶,
Dirk Benndorf¹⁶ &
…
Gunter Saake¹⁵

Conference paper
First Online: 07 August 2018

616 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 903))

Abstract

Metaproteomics is a field of biology research that relies on mass spectrometry to characterize the protein complement of microbiological communities. Since only identified data can be analyzed, identification algorithms such as X!Tandem, OMSSA and Mascot are essential in the domain, to get insights into the biological experimental data. However, protein identification software has been developed for proteomics. Metaproteomics, in contrast, involves large biological communities, gigabytes of experimental data per sample, and greater amounts of comparisons, given the mixed culture of species in the protein database. Furthermore, the file-based nature of current protein identification tools makes them ill-suited for future metaproteomics research. In addition, possible medical use cases of metaproteomics require near real-time identification. From the technology perspective, Fast Data seems promising to increase throughput and performance of protein identification in a metaproteomics workflow. In this paper we analyze the core functions of the established protein identification engine X!Tandem and show that streaming Fast Data architectures are suitable for protein identification. Furthermore, we point out the bottlenecks of the current algorithms and how to remove them with our approach.

Supported by de.NBI.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information [2].
2.
Mutation in the protein sequence change an amono acid.
3.
Spark, Mesos, Akka, Cassandra and Kafka as streaming pipeline for real time data processing [13].
4.
CPU 48x Intel Xeon E5-2650 v4; 512 GB RAM.

References

Ahmad, Y., Çetintemel, U.: Streaming applications. In: Liu, L., Tamer Özsu, M. (eds.) Encyclopedia of Database Systems, pp. 2847–2848. Springer, Heidelberg (2009). https://doi.org/10.1007/978-0-387-39940-9_374
Chapter Google Scholar
Apweiler, R., et al.: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 32, 115–119 (2004)
Article Google Scholar
Balgley, B.M., Laudeman, T., Yang, L., Song, T., Lee, C.S.: Comparative evaluation of tandem MS search algorithms using a target-decoy search strategy. Mol. Cell. Proteomics 6(9), 1599–1608 (2007)
Article Google Scholar
Banerjee, S., Mazumdar, S.: Electrospray ionization mass spectrometry: a technique to access the information beyond the molecular weight of the analyte. Int. J. Anal. Chem. 2012 (2012). https://doi.org/10.1155/2012/282574
Baumgardner, L., Shanmugam, A., Lam, H., Eng, J., Martin, D.: Fast parallel tandem mass spectral library searching using GPU hardware acceleration. J. Proteome Res. (2011). https://doi.org/10.1021/pr200074h
National Center for Biotechnology Information: Fasta format, November 2002. https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
Cottrell, J.S., London, U.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20(18), 3551–3567 (1999)
Article Google Scholar
Craig, R., Beavis, R.C.: A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17(20), 2310–2316 (2003). https://doi.org/10.1002/rcm.1198
Deutsch, E.W.: File formats commonly used in mass spectrometry proteomics. Mol. Cell. Proteomics 11(12), 1612–1621 (2012)
Article Google Scholar
Duncan, M.W., Aebersold, R., Caprioli, R.M.: The pros and cons of peptide-centric proteomics. Nat. Biotechnol. (2010). https://doi.org/10.1038/nbt0710-659
Elias, J., Gygi, S.: Target-decoy search strategy for mass spectrometry-based proteomics. Methods Mol. Biol. 604, 55–71 (2010). https://doi.org/10.1007/978-1-60761-444-9_5
Article Google Scholar
Eng, J.K., McCormack, A.L., Yates, J.R.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5(11), 976–989 (1994)
Article Google Scholar
Estrada, R.: Fast Data Processing Systems with SMACK Stack. Packt Publishing, Birmingham (2016)
Google Scholar
Griss, J., et al.: Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods (2016). https://doi.org/10.1038/nmeth.3902
Heyer, R., Kohrs, F., Reichl, U., Benndorf, D.: Metaproteomics of complex microbial communities in biogas plants. Microb. Technol. 8 (2015). https://doi.org/10.1111/1751-7915.12276
Seidler, J., Zinn, N., Boehm, M.E., Lehmann, W.D.: De novo sequencing of peptides by MS/MS. Proteomics (2009). https://doi.org/10.1002/pmic.200900459
Kipf, A., Pandey, V., Boettcher, J., Braun, L., Neumann, T., Kemper, A.: Analytics on fast data: main-memory database systems versus modern streaming systems. In: 20th International Conference on Extending Database Technology (2017)
Google Scholar
Kokaly, R., et al.: USGS spectral library version 7. Technical report, U.S. Geological Survey Data Series 1035 (2017). https://doi.org/10.3133/ds1035
Lubeck, M., et al.: Pasef\(^{\rm TM}\) on a timstof pro defines new performance standards for shotgun proteomics with dramatic improvements in MS/MS data acquisition rates and sensitivity. Technical report, Bruker Daltonik GmbH (2017)
Google Scholar
Maron, P.A., Ranjard, L., Mougel, C., Lemanceau, P.: Metaproteomics: a new approach for studying functional microbial ecology. Microb. Ecol. 53, 486–493 (2007)
Article Google Scholar
McDonald, W.H., et al.: MS1, MS2, and SQT-three unified, compact, and easily parsed file formats for the storage of shotgun proteomic spectra and identifications. Rapid Commun. Mass Spectrom. 18(18), 2162–2168 (2004). https://doi.org/10.1002/rcm.1603
Millioni, R., Franchin, C., Tessari, P., Polati, R., Cecconi, D., Arrigoni, G.: Pros and cons of peptide isolectric focusing in shotgun proteomics. J. Chromatogr. A 1293, 1–9 (2013). https://doi.org/10.1016/j.chroma.2013.03.073
Ondov, B.D., Bergman, N.H., Phillippy, A.M.: Interactive metagenomic visualization in a web browser. BMC Bioinform. 12(1), 385 (2011). https://doi.org/10.1186/1471-2105-12-385
Petriz, B.A., Franco, O.L.: Metaproteomics as a complementary approach to gut microbiota in health and disease. Front. Chem. (2017). https://doi.org/10.3389/fchem.2017.00004
Pratt, B., Howbert, J.J., Tasman, N.I., Nilsson, E.J.: MR-Tandem: parallel X!Tandem using hadoop mapreduce on Amazon web services. Bioinformatics (2012). https://doi.org/10.1093/bioinformatics/btr615
Craig, R., Beavis, R.C.: A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17, 2310–2316 (2003)
Google Scholar
Matrix Science: Data file format (2016). http://www.matrixscience.com/help/data_file_help.html
Wampler, D.: Fast data: big data evolved. White Paper (2015)
Google Scholar
Wampler, D.: Fast Data Architectures for Streaming Applications, 1st edn. O’Reilly Media, Sebastopol (2016)
Google Scholar
Zhang, J., Liang, Y., Yau, P., Pandey, R., Harpalani, S.: A metaproteomic approach for identifying proteins in anaerobic bioreactors converting coal to methane. Int. J. Coal Geol. 146, 91–103 (2015)
Google Scholar
Zoun, R., Schallert, K., Broneske, D., Heyer, R., Benndorf, D., Saake, G.: Interactive chord visualization for metaproteomics. In: 28th International Workshop on Database and Expert Systems Applications (DEXA), pp. 79–83, August 2017. https://doi.org/10.1109/DEXA.2017.32

Download references

Acknowledgment

The authors sincerely thank Xiao Chen, Sebastian Krieter, Andreas Meister and Marcus Pinnecke for their support and advice. This work is partly funded by the de.NBI Network (031L0103), the DFG (grant no.: SA 465/50-1), the European Regional Development Fund (grant no. 11.000sz00.00.017 114347 0), the German Federal Ministry of Food and Agriculture (grants nos. 22404015) and dedicated to the memory of Mikhail Zoun.

Author information

Authors and Affiliations

Working Group Databases and Software Engineering, University of Magdeburg, 39104, Magdeburg, Germany
Roman Zoun, Gabriel Campero Durand, David Broneske, Wolfram Fenske & Gunter Saake
Chair of Bioprocess Engineering, University of Magdeburg, 39104, Magdeburg, Germany
Kay Schallert, Robert Heyer & Dirk Benndorf
Accenture GmbH, Kronberg im Taunus, Germany
Apoorva Patrikar

Authors

Roman Zoun
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Campero Durand
View author publications
You can also search for this author in PubMed Google Scholar
Kay Schallert
View author publications
You can also search for this author in PubMed Google Scholar
Apoorva Patrikar
View author publications
You can also search for this author in PubMed Google Scholar
David Broneske
View author publications
You can also search for this author in PubMed Google Scholar
Wolfram Fenske
View author publications
You can also search for this author in PubMed Google Scholar
Robert Heyer
View author publications
You can also search for this author in PubMed Google Scholar
Dirk Benndorf
View author publications
You can also search for this author in PubMed Google Scholar
Gunter Saake
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roman Zoun .

Editor information

Editors and Affiliations

University of Tunis, Tunis, Tunisia
Mourad Elloumi
MiCS, Media Computer Science, University of Passau, Passau, Bayern, Germany
Michael Granitzer
IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
University of Twente, Enschede, Overijssel, The Netherlands
Christin Seifert
Fak. Medien, Bauhaus Universität Weimar, Weimar, Thüringen, Germany
Benno Stein
Inst. für Softwaretechnik, Vienna University of Technology, Vienna, Austria
A Min Tjoa
FAW, Johannes Kepler University of Linz, Linz, Austria
Roland Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zoun, R. et al. (2018). Protein Identification as a Suitable Application for Fast Data Architecture. In: Elloumi, M., et al. Database and Expert Systems Applications. DEXA 2018. Communications in Computer and Information Science, vol 903. Springer, Cham. https://doi.org/10.1007/978-3-319-99133-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-99133-7_14
Published: 07 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99132-0
Online ISBN: 978-3-319-99133-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics