Abstract
Remote protein homology detection is a problem of detecting evolutionary relationship between proteins at low sequence similarity level. Among several problems in remote protein homology detection include the questions of determining which combination of multiple alignment and classification techniques is the best as well as the misalignment of protein sequences during the alignment process. Therefore, this paper deals with remote protein homology detection via assessing the impact of using structural information on protein multiple alignments over sequence information. This paper further presents the best combinations of multiple alignment and classification programs to be chosen. This paper also improves the quality of the multiple alignments via integration of a refinement algorithm. The framework of this paperbegan with datasets preparation on datasets from SCOP version 1.73, followed by multiple alignments of the protein sequences using CLUSTALW, MAFFT, ProbCons and T-Coffee for sequence-based multiple alignments and 3DCoffee, MAMMOTH-mult, MUSTANG and PROMALS3D for structural-based multiple alignments. Next, a refinement algorithm was applied on the protein sequences to reduce misalignments. Lastly, the aligned protein sequences were classified using the pHMMs generative classifier such as HMMER and SAM and also SVMs discriminative classifier such as SVM-Fold and SVM-Struct. The performances of assessed programs were evaluated using ROC, Precision and Recall tests. The result from this paper shows that the combination of refined SVM-Struct and PROMALS3D performs the best against other programs, which suggests that this combination is the best for RPHD. This paper also shows that the use of the refinement algorithm increases the performance of the multiple alignments programs by at least 4%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Madera, M., Gough, J.: A comparison of profile hidden markov model procedures for remote homology detection. Nucleic Acids Research 30, 4321–4328 (2002)
Bourne, P., Weissig, H. (eds.): Structural Bioinformatics. Wiley-Liss, Hoboken (2003)
Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20, 467–476 (2004)
Jaakkola, T., Diekhans, M., Haussler, D.: A discriminative framework for detecting remote protein homologies. Journal of Computational Biology 7, 95–114 (2000)
Liao, L., Noble, W.S.: Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Journal of Computational Biology 10, 857–868 (2003)
Chakrabarti, S., Lanczycki, C.J., Panchenko, A.R., Przytycka, T.M., Thiessen, P.A., Bryant, S.H.: Refining multiple sequence alignments with conserved core regions. Nucleic Acids Research 34, 2598–2606 (2006)
Edgar, R.C., Batzoglou, S.: Multiple sequence alignment. Current Opinion in Structural Biology 16, 368–373 (2006)
Pei, J., Grishin, N.V.: MUMMALS: Multiple sequence alignment improved by using hidden markov models with local structural information. Nucleic Acids Research 34, 4364–4374 (2006)
Subramanian, A., Kaufmann, M., Morgenstern, B.: DIALIGN-TX: Greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms for Molecular Biology 3, 6–17 (2008)
Bray, N., Pachter, L.: MAVID: Constrained ancestral alignment of multiple sequences. Genome Research 14, 693–699 (2004)
Suchard, M.A., Redelings, B.D.: BAli-Phy: Simultaneous bayesian inference of alignment and phylogeny. Bioinformatics 22, 2047–2048 (2006)
Sheinerman, F.B., Al-Lazikani, B., Honig, B.: Sequence, structure and energetic determinants of phosphopeptide selectivity of SH2 domains. Journal of Molecular Biology 334, 823–841 (2003)
Al-Lazikani, B., Sheinerman, F.B., Honig, B.: Combining multiple structure and sequence alignments to improve sequence detection and alignment: application to the SH2 domains of Janus kinases. PNAS 98, 14796–14801 (2001)
Oldfield, T.: CAALIGN: A program for pairwise and multiple protein-structure alignment. Acta Crystallographica Section D 63, 514–525 (2007)
Birzele, F., Gewehr, J.E., Csaba, G., Zimmer, R.: Vorolign-fast structural alignment using voronoi contacts. Bioinformatics 23, e205–211 (2007)
Menke, M., Berger, B., Cowen, L.: Matt: local flexibility aids protein multiple structure alignment. PLoS Computational Biology 4, e10 (2008)
Ye, Y., Godzik, A.: Multiple flexible structure alignment using partial order graphs. Bioinformatics 21, 2362–2369 (2005)
Dai, J., Cheng, J.: HMMEditor: A visual editing tool for profile hidden markov model. BMC Genomics 9, S8 (2008)
Madera, M.: Profile Comparer: A program for scoring and aligning profile hidden markov models. Bioinformatics 24, 2630–2631 (2008)
Grundy, W.N., Bailey, T.L., Elkan, C.P., Baker, M.E.: Meta-MEME: Motif-based hidden markov models of protein families. Computer Applications in the Biosciences 13, 397–406 (1997)
Birney, E., Clamp, M., Durbin, R.: GeneWise and Genomewise. Genome Research 14, 988–995 (2004)
Pavlidis, P., Wapinski, I., Noble, W.S.: Support vector machine classification on the web. Bioinformatics 20, 586–587 (2004)
Pirooznia, M., Deng, Y.: SVM Classifier - A comprehensive java interface for support vector machine classification of microarray data. BMC Bioinformatics 7, S25 (2006)
Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z.: SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Research 31, 3692–3697 (2003)
Melvin, I., Ie, E., Kuang, R., Weston, J., Noble, W., Leslie, C.: SVM-Fold: A tool for discriminative multi-class protein fold and superfamily recognition. BMC Bioinformatics 8, S2 (2007)
Manohar, A., Batzoglou, S.: TreeRefiner: A tool for refining a multiple alignment on a phylogenetic tree. In: Proceeding of the 4th International IEEE Computer Society Computational Systems Bioinformatics Conference, pp. 111–119 (2005)
Notredame, C., Holm, L., Higgins, D.G.: COFFEE: An objective function for multiple sequence alignments. Bioinformatics 14, 407–422 (1998)
Edgar, R.: MUSCLE: A multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics 5, 113–132 (2004)
Wallace, I.M., O’Sullivan, O., Higgins, D.G.: Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics 21, 1408–1414 (2005)
Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam, H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., et al.: Clustal W and Clustal X version 2.0. Bioinformatics 23, 2947–2948 (2007)
Katoh, K., Misawa, K., Kuma, K., Miyata, T.: MAFFT: A novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Research 30, 3059–3066 (2002)
Do, C.B., Mahabhashyam, M.S.P., Brudno, M., Batzoglou, S.: PROBCONS: Probabilistic consistency-based multiple sequence alignment. Genome Research 15, 330–340 (2005)
Notredame, C., Higgins, D.G., Heringa, J.: T-Coffee: A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology 302, 205–217 (2000)
O’Sullivan, O., Suhre, K., Abergel, C., Higgins, D.G., Notredame, C.: 3DCoffee: Combining protein sequences and structures within multiple sequence alignments. Journal of Molecular Biology 340, 385–395 (2004)
Lupyan, D., Leo-Macias, A., Ortiz, A.R.: A new progressive-iterative algorithm for multiple structure alignment. Bioinformatics 21, 3255–3263 (2005)
Konagurthu, A.S., Whisstock, J.C., Stuckey, P.J., Lesk, A.M.: MUSTANG: A multiple structural alignment algorithm. Protein Science 64, 559–574 (2006)
Kann, M.G., Thiessen, P.A., Panchenko, A.R., Schaffer, A.A., Altschul, S.F., Bryant, S.H.: A structure-based method for protein sequence alignment. Bioinformatics 21, 1451–1456 (2005)
Eddy, S.R.: Profile hidden Markov models. Bioinformatics 14, 755–763 (1998)
Karplus, K., Barrett, C., Hughey, R.: Hidden Markov Models for Detecting Remote Protein Homologies. Bioinformatics 14, 846–856 (1998)
Rangwala, H., Karypis, G.: Profile-based Direct Kernels for Remote Homology Detection and Fold Recognition. Bioinformatics 21, 4239–4247 (2005)
Melvin, I., Ie, E., Kuang, R., Weston, J., Noble, W., Leslie, C.: SVM-Fold: A tool for discriminative multi-class protein fold and superfamily recognition. BMC Bioinformatics 8, 2 (2007)
Bernardes, J., Davila, A., Costa, V., Zaverucha, G.: Improving Model Construction of Profile HMMs for Remote Homology Detection Through Structural Alignment. BMC Bioinformatics 8, 435–447 (2007)
Chakrabarti, S., Lanczycki, C.J., Panchenko, A.R., Przytycka, T.M., Thiessen, P.A., Bryant, S.H.: Refining multiple sequence alignments with conserved core regions. Nucleic Acids Research 34, 2598–2606 (2006)
Marchler-Bauer, A., Anderson, J.B., Chitsaz, F., Derbyshire, M.K., DeWeese-Scott, C., Fong, J.H., Geer, L.Y., Geer, R.C., Gonzales, N.R., Gwadz, M., et al.: CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Research 37, D205–210 (2009)
Finn, R.D., Tate, J., Mistry, J., Coggill, P.C., Sammut, S.J., Hotz, H.-R., Ceric, G., Forslund, K., Eddy, S.R., Sonnhammer, E.L.L., Bateman, A.: The Pfam protein families database. Nucleic Acids Research 36, D281–288 (2008)
Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J.P., Chothia, C., Murzin, A.G.: SCOP database in 2004: Refinements integrate structure and sequence family data. Nucleic Acids Research 32, D226–229 (2004)
Sonego, P., Kocsor, A., Pongor, S.: ROC analysis: Applications to the classification of biological sequences and 3D structures. Briefings in Bioinformatics 9, 198–209 (2008)
Supper, J., Spangenberg, L., Planatscher, H., Draeger, A., Schroeder, A., Zell, A.: BowTieBuilder: modeling signal transduction pathways. BMC Systems Biology 3, 67 (2009)
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E.: The protein data bank. Nucleic Acids Research 28, 235–242 (2000)
Katoh, K., Kuma, K., Toh, H., Miyata, T.: MAFFT Version 5: Improvement in Accuracy of Multiple Sequence Alignment. Nucleic Acids Research 33, 511–518 (2005)
Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proceeding of the National Academy of Sciences of the United States of America 89, 10915–10919 (1992)
Taylor, W.R., Orengo, C.A.: Protein Structure Alignment. Journal of Molecular Biology 208, 1–22 (1989)
Shia, J., Blundella, T.L., Mizuguchia, K.: FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. Journal of Molecular Biology 310, 243–257 (2000)
Gribskov, M., Robinson, N.L.: Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching. Computers & Chemistry 20, 25–33 (1996)
Kedem, K., Chew, L.P., Elber, R.: Unit-vector RMS (URMS) as a tool to analyze molecular dynamics trajectories. Proteins 37, 554–564 (1999)
Pei, J., Grishin, N.V.: PROMALS: towards accurate multiple sequence alignments of distantly related proteins. Bioinformatics 23, 802–808 (2007)
Wang, Q., Song, E., Jin, R., Han, P., Wang, X., Zhou, Y., Zeng, J.: Segmentation of lung nodules in computed tomography images using dynamic programming and multidirection fusion techniques. Academic Radiology 16, 678–688 (2009)
Sato, K., Morita, K., Sakakibara, Y.: PSSMTS: position specific scoring matrices on tree structures. Journal of Mathematical Biology 56, 201–214 (2008)
Neuwald, A.F., Poleksic, A.: PSI-BLAST searches using hidden Markov models of structural repeats: prediction of an unusual sliding DNA clamp and of ß-propellers in UV-damaged DNA-binding protein. Nucleic Acids Research 28, 3570–3580 (2000)
Ng, A.Y., Jordan, M.I.: On Discriminative vs Generative Classification algorithm: A Comparison of Logistic Regression and Naive Bayes. In: Dietterich, T., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems (NIPS), vol. 14, pp. 841–848. MIT Press, Vancouver (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Abdullah, F.M., Othman, R.M., Kasim, S., Hashim, R. (2011). An Optimal Mesh Algorithm for Remote Protein Homology Detection. In: Kim, Th., Adeli, H., Robles, R.J., Balitanas, M. (eds) Ubiquitous Computing and Multimedia Applications. UCMA 2011. Communications in Computer and Information Science, vol 151. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20998-7_57
Download citation
DOI: https://doi.org/10.1007/978-3-642-20998-7_57
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20997-0
Online ISBN: 978-3-642-20998-7
eBook Packages: Computer ScienceComputer Science (R0)