ABSTRACT
PH2 is an Hadoop and SQL-based tool for extracting information out of the Protein Database (PDB) quickly. The PDB database is stored as a set of Hadoop sequence files in a replicated way on the Hadoop Distributed File System. PH2 then allows a user to provide queries about 3D structures (and other properties) in SQL, and for these queries to be run in a highly-parallel manner using the Hadoop framework. PDB is an important source of information about structural and other properties of proteins, and it currently contains about 65000 protein structures. Determining which proteins have particular shapes is an important bioinformatics application. PH2 parses each PDB file, creates a SQL database for it and then performs the appropriate queries. Experiments performed on a small local cluster and a large shared cluster show that the application is highly-scalable. On the large cluster, a complex real query takes less than 4 minutes to search the whole of PDB.
- T. Agus, R. Klein, and P. Ndlangamandla. PDB Data Miner. Unpublished code, 2008.Google Scholar
- P. Ananthalakshmi, K. Samayamohan, C. Chokalingam, C. Mayilarasi, and K. Sekar. PSST-2.0: Protein Data Bank sequence search tool. Applied Bioinformatics, 4(2):141--5, 2005.Google Scholar
- H. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. Shindyalov, and P. Bourne. The Protein Data Bank. Nucleic Acids Research, 28(1):235--242, Jan. 2000.Google ScholarCross Ref
- J. Cohen. Bioinformatics --- an introduction for computer scientists. ACM Computing Surveys, 36(2):122--158, 2004. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- I. Eidhammer, I. Jonassen, and W. Taylor. Structure Comparison and Structure Patterns. Journal of Computational Biology, 7(5):685--716, Oct. 2000.Google ScholarCross Ref
- S. Ghemawat and H. G. S.-T. Leung. The Google File System. In SOSP '03: Proceedings of the Nineteenth A CM Symposium on Operating Systems Principles, pages 29--43, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- R. Holland, T. Down, M. Pocock, A. Prlić, D. Huen, K. James, S. Foisy, A. Dräger, A. Yates, M. Heuer, and M. Schreiber. BioJava: an open-source framework for bioinformatics. Bioinformatics, 24(18):2096--7, Sept. 2008. Google ScholarDigital Library
- L. Hunter. Molecular Biology for Computer Scientists, pages 1--46. MIT Press, 1993. Google ScholarDigital Library
- J. Kirchmair, P. Markt, S. Disinto, D. Schuster, G. Spitzer, K. Liedel, T. Langer, and G. Wolber. The protein data bank (PDB), its related services and software tools as key components for in silico guided drug discovery. Journal of Medicinal Chemistry, 51(22):7021--7040, Oct. 2008.Google ScholarCross Ref
- T. Margraf, G. Schenk, and A. Torda. The SALAMI protein structure search server. Nucleic Acids Research, 37(Web Server issue):W480--4, July 2009.Google Scholar
- Y. Mark. Parallel-PDB: OpenMP for Bioinformatics. Honours Research Report, School of Computer Science, University of the Witwatersrand, 2009.Google Scholar
- E. Pryor and J. Fetrow. Pdb-sql: a storage engine for macromolecular data. In ACM-SE 45: Proceedings of the 45th Annual Southeast Regional Conference, pages 260--265, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- A. Samson and M. Levitt. Protein segment finder: an online search engine for segment motifs in the PDB. Nucleic Acids Research, 37(Database Issue):D224-D-228, 2009.Google Scholar
- H. Täubig, A. Buchner, and J. Griebsch. PAST: fast structure-based searching in the PDB. Nucleic Acids Research, 34:W20--3, July 2006.Google ScholarCross Ref
- J. Venner. Pro Hadoop. Apress, 2009. Google ScholarDigital Library
- T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarDigital Library
- Y. Ye and A. Godzik. FATCAT: a web server for flexible structure comparison and structure similarity searching. Nucleic Acids Research, 32:W582--5, July 2004.Google ScholarCross Ref
Index Terms
- PH2: an hadoop-based framework for mining structural properties from the PDB database
Recommendations
Theoretical analysis of binding specificity of influenza viral hemagglutinin to avian and human receptors based on the fragment molecular orbital method
The hemagglutinin (HA) protein of the influenza virus binds to the host cell receptor in the early stage of viral infection. A change in binding specificity from avian @a2-3 to human @a2-6 receptor is essential for optimal human-to-human transmission ...
Computational analysis of N-H…π interactions and its impact on the structural stability of β-lactamases
Studies on intra-protein interactions provide valuable information on protein conformation. The aim of our study is to explore the functional importance of residues participating in N-H...@p hydrogen bonds in maintaining the conformational stability of @...
Brief communication: Ab initio fragment molecular orbital (FMO) method applied to analysis of the ligand-protein interaction in a pheromone-binding protein
Full quantum computation of the electronic state of proteins has recently become possible by the advent of the ab initio fragment molecular orbital (FMO) method. We applied this method to the analysis of the interaction between the Bombyx mori pheromone-...
Comments