Efficient Discovery of Structural Motifs from Protein Sequences with Combination of Flexible Intra- and Inter-block Gap Constraints

Hsu, Chen-Ming; Chen, Chien-Yu; Hsu, Ching-Chi; Liu, Baw-Jhiune

doi:10.1007/11731139_62

Chen-Ming Hsu²²,
Chien-Yu Chen²³,
Ching-Chi Hsu²⁴ &
…
Baw-Jhiune Liu²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3918))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

3028 Accesses
8 Citations

Abstract

Discovering protein structural signatures directly from their primary information is a challenging task, because the residues associated with a functional motif are not necessarily clustered in one region of the sequence. This work proposes an algorithm that aims to discover conserved sequential blocks interleaved by large irregular gaps from a set of unaligned biological sequences. Different from the previous works that employ only one type of constraint on gap flexibility, we propose using combination of intra- and inter-block gap constraints to discover longer patterns with larger irregular gaps. The smaller flexible intra-block gap constraint is used to relax the restriction in local motif blocks but still keep them compact, and the larger flexible inter-block gap constraint is proposed to allow longer irregular gaps between compact motif blocks. Using two types of gap constraints for different purposes improves the efficiency of mining process while keeping high accuracy of mining results. The efficiency of the algorithm also helps to identify functional motifs that are conserved in only a small subset of the input sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blanchette, M., Schwikowski, B., Tompa, M.: An exact algorithm to identify motifs in orthologous sequences from multiple species. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 37–45 (2000)
Google Scholar
Blekas, K., Fotiadis, D.I., Likas, A.: Greedy mixture learning for multiple motif discovery in biological sequences. Bioinformatics 19, 607–617 (2003)
Article Google Scholar
Brazma, A., Jonassen, I., Eidhammer, I., Gilbert, D.: Approaches to the automatic discovery of patterns in biosequences. J. Comput. Biol. 5, 277–305 (1998)
Article Google Scholar
Eidhammer, I., Jonassen, I., Taylor, W.R.: Protein Bioinformatics: An Algorithmic Approach to Sequence and Structure Analysis. John Wiley & Sons, Chichester (2004)
Google Scholar
Falquet, L., et al.: The PROSITE database, its status in 2002. Nucl. Acids Res. 30, 235–238 (2002)
Article Google Scholar
Jonassen, I.: Efficient discovery of conserved patterns using a pattern graph. Comput. Appl. Biosci. 13, 509–522 (1997)
Google Scholar
Jonassen, I., Collins, J.F., Higgins, D.: Finding flexible patterns in unaligned protein sequences. Protein Science 4(8), 1587–1595 (1995)
Article Google Scholar
Liu, X., Brutlag, D.L., Liu, J.S.: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput., 127–138 (2001)
Google Scholar
Martin, P., et al.: Insights into the Structure, Solvation, and Mechanism of ArsC Arsenate Reductase, a Novel Arsenic Detoxification Enzyme. Structure 9(2001), 1071–1081 (2001)
Article Google Scholar
Martinez-Yamout, M., Legge, G.B., Zhang, O., Wright, P.E., Dyson, H.J.: Solution structure of the cysteine-rich domain of the Escherichia coli chaperone protein DnaJ. J. Mol. Biol. 300(4), 805–818 (2000)
Article Google Scholar
Narasimhan, G., Bu, C., Gao, Y., Wang, X., Xu, N., Mathee, K.: Mining protein sequences for motifs. J. Comput. Biol. 9, 707–720 (2002)
Article Google Scholar
Neuwald, A.F., Green, P.: Detecting patterns in protein sequences. J. Mol. Biol. 239, 698–712 (1994)
Article Google Scholar
Ogiwara, A., Uchiyama, I., Yasuhiko, S., Kanehisa, M.: Construction of a dictionary of sequence motifs that characterize groups of related proteins. Protein Eng. 5, 479–488 (1992)
Article Google Scholar
Pei, J., Han, J.: Constrained frequent pattern mining: a pattern-growth view. ACM SIGKDD Explorations (Special Issue on Constraints in Data Mining) 4(1), 31–39 (2002)
Article Google Scholar
Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q., Dayal, U., Hsu, M.-C.: Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. IEEE Transactions on Knowledge and Data Engineering 16, 1424–1440 (2004)
Article Google Scholar
Pevzner, P.A., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8, 269–278 (2000)
Google Scholar
Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: The Teiresias algorithm. Bioinformatics 14, 55–67 (1998)
Article Google Scholar
Saqi, M.A.S., Sternberg, M.J.E.: Identification of sequence motifs from a set of proteins with related function. Protein Eng. 7, 165–171 (1994)
Article Google Scholar
Shi, Y.Y., Tang, W., Hao, S.F., Wang, C.C.: Constributions of cysteine residues in Zn2 to zinc figers and thioldisulfide oxidoreductase activities of chaperone DnaJ. Biochemistry 44, 1683–1689 (2005)
Article Google Scholar
Silvestri, C., Orlando, S., Perego, R.: A new algorithm for gap constrained sequence mining. In: Proceedings of the 2004, ACM Symposium on Applied Computing, special track on Data Mining, pp. 540–547 (2004)
Google Scholar
Su, Q.J., Lu, L., Saxonov, S., Brutlag, D.L.: eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity. Nucl. Acids Res. 33, D178–D182 (2005)
Google Scholar
Wang, J.T.L., et al.: Discovering active motifs in sets of related protein sequences and using them for classification. Nucl. Acids Res. 22, 2769–2775 (1994)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 320, Taiwan, R.O.C.
Chen-Ming Hsu & Baw-Jhiune Liu
Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taipei, 106, Taiwan, R.O.C.
Chien-Yu Chen
Institute for Information Industry, Taipei, 106, Taiwan, R.O.C.
Ching-Chi Hsu

Authors

Chen-Ming Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Chien-Yu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ching-Chi Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Baw-Jhiune Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore
Wee-Keong Ng
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
School of Computer Science and Technology, Heilongjiang University, China
Jianzhong Li
School of Computer Engineering, Nanyang Technological University, 639798, Singapore, Singapore
Kuiyu Chang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hsu, CM., Chen, CY., Hsu, CC., Liu, BJ. (2006). Efficient Discovery of Structural Motifs from Protein Sequences with Combination of Flexible Intra- and Inter-block Gap Constraints. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_62

Download citation

DOI: https://doi.org/10.1007/11731139_62
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33206-0
Online ISBN: 978-3-540-33207-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics