research-article

PH2: an hadoop-based framework for mining structural properties from the PDB database

Author:
Scott Hazelhurst

University of the Witwatersrand, Johannesburg, Wits, South Africa

University of the Witwatersrand, Johannesburg, Wits, South Africa
View Profile

SAICSIT '10: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information TechnologistsOctober 2010Pages 104–112https://doi.org/10.1145/1899503.1899515

Published:11 October 2010Publication History

SAICSIT '10: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists

Pages 104–112

ABSTRACT

PH2 is an Hadoop and SQL-based tool for extracting information out of the Protein Database (PDB) quickly. The PDB database is stored as a set of Hadoop sequence files in a replicated way on the Hadoop Distributed File System. PH2 then allows a user to provide queries about 3D structures (and other properties) in SQL, and for these queries to be run in a highly-parallel manner using the Hadoop framework. PDB is an important source of information about structural and other properties of proteins, and it currently contains about 65000 protein structures. Determining which proteins have particular shapes is an important bioinformatics application. PH2 parses each PDB file, creates a SQL database for it and then performs the appropriate queries. Experiments performed on a small local cluster and a large shared cluster show that the application is highly-scalable. On the large cluster, a complex real query takes less than 4 minutes to search the whole of PDB.

References

T. Agus, R. Klein, and P. Ndlangamandla. PDB Data Miner. Unpublished code, 2008.Google Scholar
P. Ananthalakshmi, K. Samayamohan, C. Chokalingam, C. Mayilarasi, and K. Sekar. PSST-2.0: Protein Data Bank sequence search tool. Applied Bioinformatics, 4(2):141--5, 2005.Google Scholar
H. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I. Shindyalov, and P. Bourne. The Protein Data Bank. Nucleic Acids Research, 28(1):235--242, Jan. 2000.Google ScholarCross Ref
J. Cohen. Bioinformatics --- an introduction for computer scientists. ACM Computing Surveys, 36(2):122--158, 2004. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
I. Eidhammer, I. Jonassen, and W. Taylor. Structure Comparison and Structure Patterns. Journal of Computational Biology, 7(5):685--716, Oct. 2000.Google ScholarCross Ref
S. Ghemawat and H. G. S.-T. Leung. The Google File System. In SOSP '03: Proceedings of the Nineteenth A CM Symposium on Operating Systems Principles, pages 29--43, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
R. Holland, T. Down, M. Pocock, A. Prlić, D. Huen, K. James, S. Foisy, A. Dräger, A. Yates, M. Heuer, and M. Schreiber. BioJava: an open-source framework for bioinformatics. Bioinformatics, 24(18):2096--7, Sept. 2008. Google ScholarDigital Library
L. Hunter. Molecular Biology for Computer Scientists, pages 1--46. MIT Press, 1993. Google ScholarDigital Library
J. Kirchmair, P. Markt, S. Disinto, D. Schuster, G. Spitzer, K. Liedel, T. Langer, and G. Wolber. The protein data bank (PDB), its related services and software tools as key components for in silico guided drug discovery. Journal of Medicinal Chemistry, 51(22):7021--7040, Oct. 2008.Google ScholarCross Ref
T. Margraf, G. Schenk, and A. Torda. The SALAMI protein structure search server. Nucleic Acids Research, 37(Web Server issue):W480--4, July 2009.Google Scholar
Y. Mark. Parallel-PDB: OpenMP for Bioinformatics. Honours Research Report, School of Computer Science, University of the Witwatersrand, 2009.Google Scholar
E. Pryor and J. Fetrow. Pdb-sql: a storage engine for macromolecular data. In ACM-SE 45: Proceedings of the 45th Annual Southeast Regional Conference, pages 260--265, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
A. Samson and M. Levitt. Protein segment finder: an online search engine for segment motifs in the PDB. Nucleic Acids Research, 37(Database Issue):D224-D-228, 2009.Google Scholar
H. Täubig, A. Buchner, and J. Griebsch. PAST: fast structure-based searching in the PDB. Nucleic Acids Research, 34:W20--3, July 2006.Google ScholarCross Ref
J. Venner. Pro Hadoop. Apress, 2009. Google ScholarDigital Library
T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarDigital Library
Y. Ye and A. Godzik. FATCAT: a web server for flexible structure comparison and structure similarity searching. Nucleic Acids Research, 32:W582--5, July 2004.Google ScholarCross Ref

Index Terms

PH2: an hadoop-based framework for mining structural properties from the PDB database

Recommendations

Theoretical analysis of binding specificity of influenza viral hemagglutinin to avian and human receptors based on the fragment molecular orbital method

The hemagglutinin (HA) protein of the influenza virus binds to the host cell receptor in the early stage of viral infection. A change in binding specificity from avian @a2-3 to human @a2-6 receptor is essential for optimal human-to-human transmission ...
Read More
Computational analysis of N-H…π interactions and its impact on the structural stability of β-lactamases

Studies on intra-protein interactions provide valuable information on protein conformation. The aim of our study is to explore the functional importance of residues participating in N-H...@p hydrogen bonds in maintaining the conformational stability of @...
Read More
Brief communication: Ab initio fragment molecular orbital (FMO) method applied to analysis of the ligand-protein interaction in a pheromone-binding protein

Full quantum computation of the electronic state of proteins has recently become possible by the advent of the ab initio fragment molecular orbital (FMO) method. We applied this method to the analysis of the interaction between the Bombyx mori pheromone-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAICSIT '10: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists
October 2010
447 pages
ISBN:9781605589503
DOI:10.1145/1899503
Conference Chair:
Paula Kotzé
CSIR Meraka Institute, Pretoria, South Africa
,
Program Chairs:
Alta van der Merwe
CSIR Meraka Institute, Pretoria, South Africa
,
Aurona Gerber
CSIR Meraka Institute, Pretoria, South Africa
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 October 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
PDB
hadoop
parallel computing
structural information
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate187of439submissions,43%
Upcoming Conference
HT '24

Sponsor:

sigweb

35th ACM Conference on Hypertext and Social Media

September 10 - 13, 2024

Poznan , Poland
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 268
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

PH2: an hadoop-based framework for mining structural properties from the PDB database

SAICSIT '10: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists

ABSTRACT

References

Cited By

Index Terms

Recommendations

Theoretical analysis of binding specificity of influenza viral hemagglutinin to avian and human receptors based on the fragment molecular orbital method

Computational analysis of N-H…π interactions and its impact on the structural stability of β-lactamases

Brief communication: Ab initio fragment molecular orbital (FMO) method applied to analysis of the ligand-protein interaction in a pheromone-binding protein

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

PH2: an hadoop-based framework for mining structural properties from the PDB database

SAICSIT '10: Proceedings of the 2010 Annual Research Conference of the South African Institute of Computer Scientists and Information Technologists

ABSTRACT

References

Cited By

Index Terms

Recommendations

Theoretical analysis of binding specificity of influenza viral hemagglutinin to avian and human receptors based on the fragment molecular orbital method

Computational analysis of N-H…π interactions and its impact on the structural stability of β-lactamases

Brief communication: Ab initio fragment molecular orbital (FMO) method applied to analysis of the ligand-protein interaction in a pheromone-binding protein

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media