Efficient Algorithm for Mining Correlated Protein-DNA Binding Cores

Wong, Po-Yuen; Chan, Tak-Ming; Wong, Man-Hon; Leung, Kwong-Sak

doi:10.1007/978-3-642-29038-1_34

Po-Yuen Wong²²,
Tak-Ming Chan²²,
Man-Hon Wong²² &
…
Kwong-Sak Leung²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7238))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1629 Accesses
2 Citations

Abstract

Correlated protein-DNA interaction (binding cores) between transcription factor (TFs) and transcription factor binding sites (TFBSs) are usually identified by costly 3D structural experiments. To avoid numerous unsuccessful trials, we are motivated to develop a cheap and efficient sequence-based computational method for providing testable novel binding cores with high confidence to accelerate the experiments. Although there are abundant sequence-based motif discovery algorithms, few directly address associating both TF and TFBS core motifs which are both verifiable on 3D structures. In this paper, we formally define the problem of discovering correlated TF-TFBS binding cores, and apply association rule mining techniques over existing real sequence data (TRANSFAC). The proposed algorithm first builds two frequent sequence tree (FS-Tree) structures storing condensed information for association rule mining. Association rules are then generated by depth-first traversal on the structures. FS-Trees have several advantages to support further applications, including efficient calculation of the support and confidence, simple generation of candidate rules, and applicability of effective pruning techniques. As a result, the FS-Trees serve as a useful basis for more general extensions related to biological binding core identification. We tested our algorithm on real sequence data from the biological database TRANSFAC and focus on efficiency comparisons with the recent work employing association rule mining. The rules discovered reveal real TF-TFBS binding cores in independent 3D verifications on Protein Data Bank (PDB).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Savasere, A., Omiecinski, E., Navathe, S.: An efficient algorithm for mining association rules in large databases. In: Proc. 1995 Int. Conf. Very Large Data Bases, pp. 432–443 (1995)
Google Scholar
Agarwal, R.C., Aggarwal, C.C., Prasad, V.V.V.: A Tree Projection Algorithm for Generation of Frequent Item Sets. Journal of Parallel and Distributed Computing 61(3), 350–371 (2001)
Article MATH Google Scholar
Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. ACM SIGMOD Record 22, 207–216 (1993)
Article Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. 20th Int. Conf. Very Large Data Bases, VLDB, vol. 1215, pp. 487–499. Citeseer (1994)
Google Scholar
Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 3–14. IEEE (1995)
Google Scholar
Ayres, J., Flannick, J., Gehrke, J., Yiu, T.: Sequential pattern mining using a bitmap representation. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 429–435. ACM, New York (2002)
Chapter Google Scholar
Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28–36 (1994)
Google Scholar
Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. SIGMOD Rec. 26, 255–264 (1997)
Article Google Scholar
Chan, T.M., Wong, K.C., Lee, K.H., Wong, M.H., Lau, C.K., Tsui, S.K., Leung, K.S.: Discovering approximate associated sequence patterns for protein DNA interactions. Bioinformatics 27(4), 471–478 (2011)
Article Google Scholar
Das, A., Ng, W.K., Woon, Y.K.: Rapid association rule mining. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, CIKM 2001, pp. 474–481. ACM, New York (2001)
Chapter Google Scholar
Galas, D.J., Schmitz, A.: Dnaase footprinting a simple method for the detection of protein-dna binding specificity. Nucleic Acids Research 5(9), 3157–3170 (1978)
Article Google Scholar
Garner, M.M., Revzin, A.: A gel electrophoresis method for quantifying the binding of proteins to specific dna regions: application to components of the escherichia coli lactose operon regulatory system. Nucleic Acids Research 9(13), 3047–3060 (1981)
Article Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. ACM SIGMOD Record 29(2), 1–12 (2000)
Article Google Scholar
Jones, S., van Heyningen, P., Berman, H.M., Thornton, J.M.: Protein-DNA interactions: a structural analysis. Journal of Molecular Biology 287(5), 877–896 (1999)
Article Google Scholar
Leung, K.S., Wong, K.C., Chan, T.M., Wong, M.H., Lee, K.H., Lau, C.K., Tsui, S.K.W.: Discovering protein-DNA binding sequence patterns using association rule mining. Nucleic Acids Research 38(19), 6324–6337 (2010)
Article Google Scholar
Li, M., Ma, B., Wang, L.: Finding similar regions in many sequences. Journal of Computer and System Sciences 65, 73–96 (2002)
Article MathSciNet Google Scholar
MacIsaac, K.D., Fraenkel, E.: Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput. Biol. 2(4), e36 (2006)
Article Google Scholar
MacIsaac, K.D., Fraenkel, E.: Practical strategies for discovering regulatory dna sequence motifs (2006)
Google Scholar
Matys, V., Kel-Margoulis, O.V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev, D., Krull, M., Hornischer, K., Voss, N., Stegmaier, P., Lewicki-Potapov, B., Saxel, H., Kel, A.E., Wingender, E.: Transfac and its module transcompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34, 108–110 (2006)
Article Google Scholar
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New algorithms for fast discovery of association rules. In: 3rd Intl. Conf. on Knowledge Discovery and Data Mining, vol. 20, pp. 283–286 (1997)
Google Scholar
Park, J., Chen, M., Yu, P.: An effective hash-based algorithm for mining association rules. ACM SIGMOD Record 24(2), 175–186 (1995)
Article Google Scholar
Pavesi, G., Mauri, G., Pesole, G.: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17, S207–S214 (2001)
Article Google Scholar
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., Hsu, M.: PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In: ICCCN, p. 215. IEEE Computer Society (2001)
Google Scholar
Sagot, M.-F.: Spelling Approximate Repeated or Common Motifs using a Suffix Tree. In: Lucchesi, C.L., Moura, A.V. (eds.) LATIN 1998. LNCS, vol. 1380, pp. 374–390. Springer, Heidelberg (1998)
Chapter Google Scholar
Smith, A.D., Sumazin, P., Das, D., Zhang, M.Q.: Mining chip-chip data for transcription factor and cofactor binding sites. Bioinformatics 21(suppl.1), i403–i412 (2005)
Article Google Scholar
Srikant, R., Agrawal, R.: Mining sequential patterns: Generalizations and performance improvements. In: Advances in Database Technology XEDBT 1996, pp. 1–17 (1996)
Google Scholar
Wang, K., Tang, L., Han, J., Liu, J.: Top Down FP-Growth for Association Rule Mining. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 334–340. Springer, Heidelberg (2002)
Chapter Google Scholar
Wang, K., Xu, Y., Yu, J.: Scalable sequential pattern mining for biological sequences. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 178–187. ACM, New York (2004)
Google Scholar
Zaki, M.: Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering 12(3), 372–390 (2000)
Article MathSciNet Google Scholar
Zaki, M.: SPADE: An efficient algorithm for mining frequent sequences. In: Machine Learning, pp. 375–386 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong
Po-Yuen Wong, Tak-Ming Chan, Man-Hon Wong & Kwong-Sak Leung

Authors

Po-Yuen Wong
View author publications
You can also search for this author in PubMed Google Scholar
Tak-Ming Chan
View author publications
You can also search for this author in PubMed Google Scholar
Man-Hon Wong
View author publications
You can also search for this author in PubMed Google Scholar
Kwong-Sak Leung
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, Seoul National University, Gwanak-ro, Gwanak-gu, 151747, Seoul, South Korea
Sang-goo Lee
Computer School, Wuhan University, Luo-jia-shan, Wuchang, 430081, Wuhan, Hubei Province, China
Zhiyong Peng
School of Information Technology and Electrical Engineering, University of Queensland, QLD 4072, Brisbane, Australia
Xiaofang Zhou
Department of Computer Science, Kangwon National University, 192-1, Hyoja2-Dong, Chuncheon, 200701, Kangwon, South Korea
Yang-Sae Moon
Institute for Computer Science and Business Information, University of Duisburg-Essen, Schützenbahn 70, 45117, Essen, Germany
Rainer Unland
School of Information and Communication Engineering, Chungbuk National University, 52 Naesudong-ro, Heungdeok-gu, Cheongju, 4072, Chungbuk, South Korea
Jaesoo Yoo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wong, PY., Chan, TM., Wong, MH., Leung, KS. (2012). Efficient Algorithm for Mining Correlated Protein-DNA Binding Cores. In: Lee, Sg., Peng, Z., Zhou, X., Moon, YS., Unland, R., Yoo, J. (eds) Database Systems for Advanced Applications. DASFAA 2012. Lecture Notes in Computer Science, vol 7238. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29038-1_34

Download citation

DOI: https://doi.org/10.1007/978-3-642-29038-1_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29037-4
Online ISBN: 978-3-642-29038-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics