Skip to main content

Protein Identification as a Suitable Application for Fast Data Architecture

  • Conference paper
  • First Online:
  • 616 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 903))

Abstract

Metaproteomics is a field of biology research that relies on mass spectrometry to characterize the protein complement of microbiological communities. Since only identified data can be analyzed, identification algorithms such as X!Tandem, OMSSA and Mascot are essential in the domain, to get insights into the biological experimental data. However, protein identification software has been developed for proteomics. Metaproteomics, in contrast, involves large biological communities, gigabytes of experimental data per sample, and greater amounts of comparisons, given the mixed culture of species in the protein database. Furthermore, the file-based nature of current protein identification tools makes them ill-suited for future metaproteomics research. In addition, possible medical use cases of metaproteomics require near real-time identification. From the technology perspective, Fast Data seems promising to increase throughput and performance of protein identification in a metaproteomics workflow. In this paper we analyze the core functions of the established protein identification engine X!Tandem and show that streaming Fast Data architectures are suitable for protein identification. Furthermore, we point out the bottlenecks of the current algorithms and how to remove them with our approach.

Supported by de.NBI.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information [2].

  2. 2.

    Mutation in the protein sequence change an amono acid.

  3. 3.

    Spark, Mesos, Akka, Cassandra and Kafka as streaming pipeline for real time data processing [13].

  4. 4.

    CPU 48x Intel Xeon E5-2650 v4; 512 GB RAM.

References

  1. Ahmad, Y., Çetintemel, U.: Streaming applications. In: Liu, L., Tamer Özsu, M. (eds.) Encyclopedia of Database Systems, pp. 2847–2848. Springer, Heidelberg (2009). https://doi.org/10.1007/978-0-387-39940-9_374

    Chapter  Google Scholar 

  2. Apweiler, R., et al.: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 32, 115–119 (2004)

    Article  Google Scholar 

  3. Balgley, B.M., Laudeman, T., Yang, L., Song, T., Lee, C.S.: Comparative evaluation of tandem MS search algorithms using a target-decoy search strategy. Mol. Cell. Proteomics 6(9), 1599–1608 (2007)

    Article  Google Scholar 

  4. Banerjee, S., Mazumdar, S.: Electrospray ionization mass spectrometry: a technique to access the information beyond the molecular weight of the analyte. Int. J. Anal. Chem. 2012 (2012). https://doi.org/10.1155/2012/282574

  5. Baumgardner, L., Shanmugam, A., Lam, H., Eng, J., Martin, D.: Fast parallel tandem mass spectral library searching using GPU hardware acceleration. J. Proteome Res. (2011). https://doi.org/10.1021/pr200074h

  6. National Center for Biotechnology Information: Fasta format, November 2002. https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp

  7. Cottrell, J.S., London, U.: Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20(18), 3551–3567 (1999)

    Article  Google Scholar 

  8. Craig, R., Beavis, R.C.: A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17(20), 2310–2316 (2003). https://doi.org/10.1002/rcm.1198

  9. Deutsch, E.W.: File formats commonly used in mass spectrometry proteomics. Mol. Cell. Proteomics 11(12), 1612–1621 (2012)

    Article  Google Scholar 

  10. Duncan, M.W., Aebersold, R., Caprioli, R.M.: The pros and cons of peptide-centric proteomics. Nat. Biotechnol. (2010). https://doi.org/10.1038/nbt0710-659

  11. Elias, J., Gygi, S.: Target-decoy search strategy for mass spectrometry-based proteomics. Methods Mol. Biol. 604, 55–71 (2010). https://doi.org/10.1007/978-1-60761-444-9_5

    Article  Google Scholar 

  12. Eng, J.K., McCormack, A.L., Yates, J.R.: An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5(11), 976–989 (1994)

    Article  Google Scholar 

  13. Estrada, R.: Fast Data Processing Systems with SMACK Stack. Packt Publishing, Birmingham (2016)

    Google Scholar 

  14. Griss, J., et al.: Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods (2016). https://doi.org/10.1038/nmeth.3902

  15. Heyer, R., Kohrs, F., Reichl, U., Benndorf, D.: Metaproteomics of complex microbial communities in biogas plants. Microb. Technol. 8 (2015). https://doi.org/10.1111/1751-7915.12276

  16. Seidler, J., Zinn, N., Boehm, M.E., Lehmann, W.D.: De novo sequencing of peptides by MS/MS. Proteomics (2009). https://doi.org/10.1002/pmic.200900459

  17. Kipf, A., Pandey, V., Boettcher, J., Braun, L., Neumann, T., Kemper, A.: Analytics on fast data: main-memory database systems versus modern streaming systems. In: 20th International Conference on Extending Database Technology (2017)

    Google Scholar 

  18. Kokaly, R., et al.: USGS spectral library version 7. Technical report, U.S. Geological Survey Data Series 1035 (2017). https://doi.org/10.3133/ds1035

  19. Lubeck, M., et al.: Pasef\(^{\rm TM}\) on a timstof pro defines new performance standards for shotgun proteomics with dramatic improvements in MS/MS data acquisition rates and sensitivity. Technical report, Bruker Daltonik GmbH (2017)

    Google Scholar 

  20. Maron, P.A., Ranjard, L., Mougel, C., Lemanceau, P.: Metaproteomics: a new approach for studying functional microbial ecology. Microb. Ecol. 53, 486–493 (2007)

    Article  Google Scholar 

  21. McDonald, W.H., et al.: MS1, MS2, and SQT-three unified, compact, and easily parsed file formats for the storage of shotgun proteomic spectra and identifications. Rapid Commun. Mass Spectrom. 18(18), 2162–2168 (2004). https://doi.org/10.1002/rcm.1603

  22. Millioni, R., Franchin, C., Tessari, P., Polati, R., Cecconi, D., Arrigoni, G.: Pros and cons of peptide isolectric focusing in shotgun proteomics. J. Chromatogr. A 1293, 1–9 (2013). https://doi.org/10.1016/j.chroma.2013.03.073

  23. Ondov, B.D., Bergman, N.H., Phillippy, A.M.: Interactive metagenomic visualization in a web browser. BMC Bioinform. 12(1), 385 (2011). https://doi.org/10.1186/1471-2105-12-385

  24. Petriz, B.A., Franco, O.L.: Metaproteomics as a complementary approach to gut microbiota in health and disease. Front. Chem. (2017). https://doi.org/10.3389/fchem.2017.00004

  25. Pratt, B., Howbert, J.J., Tasman, N.I., Nilsson, E.J.: MR-Tandem: parallel X!Tandem using hadoop mapreduce on Amazon web services. Bioinformatics (2012). https://doi.org/10.1093/bioinformatics/btr615

  26. Craig, R., Beavis, R.C.: A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spectrom. 17, 2310–2316 (2003)

    Google Scholar 

  27. Matrix Science: Data file format (2016). http://www.matrixscience.com/help/data_file_help.html

  28. Wampler, D.: Fast data: big data evolved. White Paper (2015)

    Google Scholar 

  29. Wampler, D.: Fast Data Architectures for Streaming Applications, 1st edn. O’Reilly Media, Sebastopol (2016)

    Google Scholar 

  30. Zhang, J., Liang, Y., Yau, P., Pandey, R., Harpalani, S.: A metaproteomic approach for identifying proteins in anaerobic bioreactors converting coal to methane. Int. J. Coal Geol. 146, 91–103 (2015)

    Google Scholar 

  31. Zoun, R., Schallert, K., Broneske, D., Heyer, R., Benndorf, D., Saake, G.: Interactive chord visualization for metaproteomics. In: 28th International Workshop on Database and Expert Systems Applications (DEXA), pp. 79–83, August 2017. https://doi.org/10.1109/DEXA.2017.32

Download references

Acknowledgment

The authors sincerely thank Xiao Chen, Sebastian Krieter, Andreas Meister and Marcus Pinnecke for their support and advice. This work is partly funded by the de.NBI Network (031L0103), the DFG (grant no.: SA 465/50-1), the European Regional Development Fund (grant no. 11.000sz00.00.017 114347 0), the German Federal Ministry of Food and Agriculture (grants nos. 22404015) and dedicated to the memory of Mikhail Zoun.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roman Zoun .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zoun, R. et al. (2018). Protein Identification as a Suitable Application for Fast Data Architecture. In: Elloumi, M., et al. Database and Expert Systems Applications. DEXA 2018. Communications in Computer and Information Science, vol 903. Springer, Cham. https://doi.org/10.1007/978-3-319-99133-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99133-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99132-0

  • Online ISBN: 978-3-319-99133-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics