Abstract
Metaproteomics is a field of biology research that relies on mass spectrometry to characterize the protein complement of microbiological communities. Since only identified data can be analyzed, identification algorithms such as X!Tandem, OMSSA and Mascot are essential in the domain, to get insights into the biological experimental data. However, protein identification software has been developed for proteomics. Metaproteomics, in contrast, involves large biological communities, gigabytes of experimental data per sample, and greater amounts of comparisons, given the mixed culture of species in the protein database. Furthermore, the file-based nature of current protein identification tools makes them ill-suited for future metaproteomics research. In addition, possible medical use cases of metaproteomics require near real-time identification. From the technology perspective, Fast Data seems promising to increase throughput and performance of protein identification in a metaproteomics workflow. In this paper we analyze the core functions of the established protein identification engine X!Tandem and show that streaming Fast Data architectures are suitable for protein identification. Furthermore, we point out the bottlenecks of the current algorithms and how to remove them with our approach.
Supported by de.NBI.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information [2].
- 2.
Mutation in the protein sequence change an amono acid.
- 3.
Spark, Mesos, Akka, Cassandra and Kafka as streaming pipeline for real time data processing [13].
- 4.
CPU 48x Intel Xeon E5-2650 v4; 512 GB RAM.
References
Ahmad, Y., Çetintemel, U.: Streaming applications. In: Liu, L., Tamer Özsu, M. (eds.) Encyclopedia of Database Systems, pp. 2847–2848. Springer, Heidelberg (2009). https://doi.org/10.1007/978-0-387-39940-9_374
Apweiler, R., et al.: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 32, 115–119 (2004)
Balgley, B.M., Laudeman, T., Yang, L., Song, T., Lee, C.S.: Comparative evaluation of tandem MS search algorithms using a target-decoy search strategy. Mol. Cell. Proteomics 6(9), 1599–1608 (2007)
Banerjee, S., Mazumdar, S.: Electrospray ionization mass spectrometry: a technique to access the information beyond the molecular weight of the analyte. Int. J. Anal. Chem. 2012 (2012). https://doi.org/10.1155/2012/282574
Baumgardner, L., Shanmugam, A., Lam, H., Eng, J., Martin, D.: Fast parallel tandem mass spectral library searching using GPU hardware acceleration. J. Proteome Res. (2011). https://doi.org/10.1021/pr200074h
National Center for Biotechnology Information: Fasta format, November 2002. https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp
Cottrell, J.S., London, U.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20(18), 3551–3567 (1999)
Craig, R., Beavis, R.C.: A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17(20), 2310–2316 (2003). https://doi.org/10.1002/rcm.1198
Deutsch, E.W.: File formats commonly used in mass spectrometry proteomics. Mol. Cell. Proteomics 11(12), 1612–1621 (2012)
Duncan, M.W., Aebersold, R., Caprioli, R.M.: The pros and cons of peptide-centric proteomics. Nat. Biotechnol. (2010). https://doi.org/10.1038/nbt0710-659
Elias, J., Gygi, S.: Target-decoy search strategy for mass spectrometry-based proteomics. Methods Mol. Biol. 604, 55–71 (2010). https://doi.org/10.1007/978-1-60761-444-9_5
Eng, J.K., McCormack, A.L., Yates, J.R.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5(11), 976–989 (1994)
Estrada, R.: Fast Data Processing Systems with SMACK Stack. Packt Publishing, Birmingham (2016)
Griss, J., et al.: Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods (2016). https://doi.org/10.1038/nmeth.3902
Heyer, R., Kohrs, F., Reichl, U., Benndorf, D.: Metaproteomics of complex microbial communities in biogas plants. Microb. Technol. 8 (2015). https://doi.org/10.1111/1751-7915.12276
Seidler, J., Zinn, N., Boehm, M.E., Lehmann, W.D.: De novo sequencing of peptides by MS/MS. Proteomics (2009). https://doi.org/10.1002/pmic.200900459
Kipf, A., Pandey, V., Boettcher, J., Braun, L., Neumann, T., Kemper, A.: Analytics on fast data: main-memory database systems versus modern streaming systems. In: 20th International Conference on Extending Database Technology (2017)
Kokaly, R., et al.: USGS spectral library version 7. Technical report, U.S. Geological Survey Data Series 1035 (2017). https://doi.org/10.3133/ds1035
Lubeck, M., et al.: Pasef\(^{\rm TM}\) on a timstof pro defines new performance standards for shotgun proteomics with dramatic improvements in MS/MS data acquisition rates and sensitivity. Technical report, Bruker Daltonik GmbH (2017)
Maron, P.A., Ranjard, L., Mougel, C., Lemanceau, P.: Metaproteomics: a new approach for studying functional microbial ecology. Microb. Ecol. 53, 486–493 (2007)
McDonald, W.H., et al.: MS1, MS2, and SQT-three unified, compact, and easily parsed file formats for the storage of shotgun proteomic spectra and identifications. Rapid Commun. Mass Spectrom. 18(18), 2162–2168 (2004). https://doi.org/10.1002/rcm.1603
Millioni, R., Franchin, C., Tessari, P., Polati, R., Cecconi, D., Arrigoni, G.: Pros and cons of peptide isolectric focusing in shotgun proteomics. J. Chromatogr. A 1293, 1–9 (2013). https://doi.org/10.1016/j.chroma.2013.03.073
Ondov, B.D., Bergman, N.H., Phillippy, A.M.: Interactive metagenomic visualization in a web browser. BMC Bioinform. 12(1), 385 (2011). https://doi.org/10.1186/1471-2105-12-385
Petriz, B.A., Franco, O.L.: Metaproteomics as a complementary approach to gut microbiota in health and disease. Front. Chem. (2017). https://doi.org/10.3389/fchem.2017.00004
Pratt, B., Howbert, J.J., Tasman, N.I., Nilsson, E.J.: MR-Tandem: parallel X!Tandem using hadoop mapreduce on Amazon web services. Bioinformatics (2012). https://doi.org/10.1093/bioinformatics/btr615
Craig, R., Beavis, R.C.: A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17, 2310–2316 (2003)
Matrix Science: Data file format (2016). http://www.matrixscience.com/help/data_file_help.html
Wampler, D.: Fast data: big data evolved. White Paper (2015)
Wampler, D.: Fast Data Architectures for Streaming Applications, 1st edn. O’Reilly Media, Sebastopol (2016)
Zhang, J., Liang, Y., Yau, P., Pandey, R., Harpalani, S.: A metaproteomic approach for identifying proteins in anaerobic bioreactors converting coal to methane. Int. J. Coal Geol. 146, 91–103 (2015)
Zoun, R., Schallert, K., Broneske, D., Heyer, R., Benndorf, D., Saake, G.: Interactive chord visualization for metaproteomics. In: 28th International Workshop on Database and Expert Systems Applications (DEXA), pp. 79–83, August 2017. https://doi.org/10.1109/DEXA.2017.32
Acknowledgment
The authors sincerely thank Xiao Chen, Sebastian Krieter, Andreas Meister and Marcus Pinnecke for their support and advice. This work is partly funded by the de.NBI Network (031L0103), the DFG (grant no.: SA 465/50-1), the European Regional Development Fund (grant no. 11.000sz00.00.017 114347 0), the German Federal Ministry of Food and Agriculture (grants nos. 22404015) and dedicated to the memory of Mikhail Zoun.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Zoun, R. et al. (2018). Protein Identification as a Suitable Application for Fast Data Architecture. In: Elloumi, M., et al. Database and Expert Systems Applications. DEXA 2018. Communications in Computer and Information Science, vol 903. Springer, Cham. https://doi.org/10.1007/978-3-319-99133-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-99133-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99132-0
Online ISBN: 978-3-319-99133-7
eBook Packages: Computer ScienceComputer Science (R0)