skip to main content
10.1145/1899503.1899515acmconferencesArticle/Chapter ViewAbstractPublication PageshtConference Proceedingsconference-collections
research-article

PH2: an hadoop-based framework for mining structural properties from the PDB database

Published:11 October 2010Publication History

ABSTRACT

PH2 is an Hadoop and SQL-based tool for extracting information out of the Protein Database (PDB) quickly. The PDB database is stored as a set of Hadoop sequence files in a replicated way on the Hadoop Distributed File System. PH2 then allows a user to provide queries about 3D structures (and other properties) in SQL, and for these queries to be run in a highly-parallel manner using the Hadoop framework. PDB is an important source of information about structural and other properties of proteins, and it currently contains about 65000 protein structures. Determining which proteins have particular shapes is an important bioinformatics application. PH2 parses each PDB file, creates a SQL database for it and then performs the appropriate queries. Experiments performed on a small local cluster and a large shared cluster show that the application is highly-scalable. On the large cluster, a complex real query takes less than 4 minutes to search the whole of PDB.

References

  1. T. Agus, R. Klein, and P. Ndlangamandla. PDB Data Miner. Unpublished code, 2008.Google ScholarGoogle Scholar
  2. P. Ananthalakshmi, K. Samayamohan, C. Chokalingam, C. Mayilarasi, and K. Sekar. PSST-2.0: Protein Data Bank sequence search tool. Applied Bioinformatics, 4(2):141--5, 2005.Google ScholarGoogle Scholar
  3. H. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. Shindyalov, and P. Bourne. The Protein Data Bank. Nucleic Acids Research, 28(1):235--242, Jan. 2000.Google ScholarGoogle ScholarCross RefCross Ref
  4. J. Cohen. Bioinformatics --- an introduction for computer scientists. ACM Computing Surveys, 36(2):122--158, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Eidhammer, I. Jonassen, and W. Taylor. Structure Comparison and Structure Patterns. Journal of Computational Biology, 7(5):685--716, Oct. 2000.Google ScholarGoogle ScholarCross RefCross Ref
  7. S. Ghemawat and H. G. S.-T. Leung. The Google File System. In SOSP '03: Proceedings of the Nineteenth A CM Symposium on Operating Systems Principles, pages 29--43, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Holland, T. Down, M. Pocock, A. Prlić, D. Huen, K. James, S. Foisy, A. Dräger, A. Yates, M. Heuer, and M. Schreiber. BioJava: an open-source framework for bioinformatics. Bioinformatics, 24(18):2096--7, Sept. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Hunter. Molecular Biology for Computer Scientists, pages 1--46. MIT Press, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Kirchmair, P. Markt, S. Disinto, D. Schuster, G. Spitzer, K. Liedel, T. Langer, and G. Wolber. The protein data bank (PDB), its related services and software tools as key components for in silico guided drug discovery. Journal of Medicinal Chemistry, 51(22):7021--7040, Oct. 2008.Google ScholarGoogle ScholarCross RefCross Ref
  11. T. Margraf, G. Schenk, and A. Torda. The SALAMI protein structure search server. Nucleic Acids Research, 37(Web Server issue):W480--4, July 2009.Google ScholarGoogle Scholar
  12. Y. Mark. Parallel-PDB: OpenMP for Bioinformatics. Honours Research Report, School of Computer Science, University of the Witwatersrand, 2009.Google ScholarGoogle Scholar
  13. E. Pryor and J. Fetrow. Pdb-sql: a storage engine for macromolecular data. In ACM-SE 45: Proceedings of the 45th Annual Southeast Regional Conference, pages 260--265, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Samson and M. Levitt. Protein segment finder: an online search engine for segment motifs in the PDB. Nucleic Acids Research, 37(Database Issue):D224-D-228, 2009.Google ScholarGoogle Scholar
  15. H. Täubig, A. Buchner, and J. Griebsch. PAST: fast structure-based searching in the PDB. Nucleic Acids Research, 34:W20--3, July 2006.Google ScholarGoogle ScholarCross RefCross Ref
  16. J. Venner. Pro Hadoop. Apress, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Ye and A. Godzik. FATCAT: a web server for flexible structure comparison and structure similarity searching. Nucleic Acids Research, 32:W582--5, July 2004.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. PH2: an hadoop-based framework for mining structural properties from the PDB database

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SAICSIT '10: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists
            October 2010
            447 pages
            ISBN:9781605589503
            DOI:10.1145/1899503

            Copyright © 2010 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 11 October 2010

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate187of439submissions,43%

            Upcoming Conference

            HT '24
            35th ACM Conference on Hypertext and Social Media
            September 10 - 13, 2024
            Poznan , Poland

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader