Skip to main content

Trie-based Apriori Motif Discovery Approach

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 7292))

Abstract

One of the hardest and long-standing problems in Bioinformatics is the problem of motif discovery in biological sequences. It is the problem of finding recurring patterns in these sequences. Apriori is a well-known data mining algorithm. It is used to mine frequent patterns in large datasets. In this paper, we would like to apply Apriori to the common motifs discovery problem. We propose three modifications so that we can adapt the classic Apriori to our problem. First, the Trie data structure is used to store all biological sequences under examination. Second, both of the frequent pattern extraction and the candidate generation steps are done using the same data structure, the Trie. The Trie allows to simultaneously search all possible starting points in the sequence for any occurrence of the given pattern. Third, instead of using only the support as a measure to assess frequent patterns, a new measure, the normalized information content (normIC), is proposed which is able to distinguish motifs in real promoter sequences. Preliminary experiments are conducted on Tompa’s benchmark to investigate the performance of our proposed algorithm, the Trie-based Apriori Motif Discovery (TrieAMD). Results show that our algorithm outperforms all of the tested tools on real datasets for average sensitivity.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases, San Francisco, CA, USA, pp. 487–499 (1994)

    Google Scholar 

  2. Badr, G.: Tries in information retrieval and syntactic pattern recognition. Ph.D. Thesis, School of Computer Science, Carleton University (June 2006)

    Google Scholar 

  3. Badr, G., Turcotte, M.: Component-Based Matching for Multiple Interacting RNA Sequences. In: Chen, J., Wang, J., Zelikovsky, A. (eds.) ISBRA 2011. LNCS, vol. 6674, pp. 73–86. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  4. Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of International Conference on Intelligent Systems for Molecular Biology, ISMB, vol. 2, pp. 28–36 (1994)

    Google Scholar 

  5. Bajcsy, P., Han, J., Liu, L., Yang, J.: Survey of biodata analysis from a data mining perspective. In: Wu, X., Jain, L., Wang, J.T., Zaki, M.J., Toivonen, H.T., Shasha, D. (eds.) Data Mining in Bioinformatics, pp. 9–39. Springer, London (2005)

    Chapter  Google Scholar 

  6. Carvalho, A.M., Freitas, A.T., Oliveira, A.L., Sagot, M.: An efficient algorithm for the identification of structured motifs in DNA promoter sequences. IEEE/ACM Trans. Comput. Biol. Bioinformatics 3(2), 126–140 (2006)

    Article  Google Scholar 

  7. Goos, G., Hartmanis, J., Leeuwen, J., Crochemore, M., Iliopoulos, C.S., Mohamed, M., Sagot, M.: Longest Repeats with a Block of Don’t Cares. In: Farach-Colton, M. (ed.) LATIN 2004. LNCS, vol. 2976, pp. 271–278. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  8. Jiang, H., Zhao, Y., Chen, W., Zheng, W.: Searching maximal degenerate motifs guided by a compact suffix tree. Advances in Experimental Medicine and Biology 680, 19–26 (2010)

    Article  Google Scholar 

  9. Li, M., Ma, B., Wang, L.: Finding similar regions in many sequences. Journal of Computer and System Sciences 65, 73–96 (2002)

    Article  MathSciNet  Google Scholar 

  10. Matys, V., Kel-Margoulis, O.V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev, D., Krull, M., Hornischer, K., Voss, N., Stegmaier, P., Lewicki-Potapov, B., Saxel, H., Kel, A.E., Wingender, E.: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34(Database issue), D108–D110 (2006)

    Article  Google Scholar 

  11. Ozer, H.G., Ray, W.C.: Informative Motifs in Protein Family Alignments. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 161–170. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  12. Pavesi, G., Mauri, G., Pesole, G.: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17(suppl. 1), S207–S214 (2001)

    Article  Google Scholar 

  13. Pavesi, G., Mauri, G., Pesole, G.: In silico representation and discovery of transcription factor binding sites. Briefings in Bioinformatics 5(3), 217–236 (2004)

    Article  Google Scholar 

  14. Pevzner, P.A., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 269–278. AAAI Press (2000)

    Google Scholar 

  15. Sagot, M.-F.: Spelling Approximate Repeated or Common Motifs Using a Suffix Tree. In: Lucchesi, C.L., Moura, A.V. (eds.) LATIN 1998. LNCS, vol. 1380, pp. 374–390. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  16. Sagot, M.F., Escalier, V., Viari, A., Soldano, H.: Searching for repeated words in a text allowing for mismatches and gaps. In: Baeza-Yates, R., Manber, U. (eds.) Second South American Workshop on String Processing, pp. 87–100. University of Chile (1995)

    Google Scholar 

  17. Sandelin, A., Alkema, W., Engstrm, P., Wasserman, W.W., Lenhard, B.: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research 32(Database issue), D91–D94 (2004)

    Article  Google Scholar 

  18. Schneider, T.D., Stormo, G.D., Gold, L.: Information content of binding sites on nucleotide sequences. Journal of Molecular Biology 188(3), 415–431 (1986)

    Article  Google Scholar 

  19. Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., Makeev, V.J., Mironov, A.A., Noble, W.S., Pavesi, G., Pesole, G., Rgnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., Zhu, Z.: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 23(1), 137–144 (2005)

    Article  Google Scholar 

  20. Ukkonen, E.: On-Line construction of suffix trees. Algorithmica 14, 249–260 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  21. Weiner, P.: Linear pattern matching algorithms. In: IEEE Conference Record of 14th Annual Symposium on Switching and Automata Theory, SWAT 2008, pp. 1–11. IEEE (October 1973)

    Google Scholar 

  22. Wijaya, E., Rajaraman, K., Yiu, S., Sung, W.: Detection of generic spaced motifs using submotif pattern mining. Bioinformatics 23(12), 1476–1485 (2007)

    Article  Google Scholar 

  23. Jia, C., Lu, R., Chen, L.: A Frequent Pattern Mining Method for Finding Planted (l, d)-motifs of Unknown Length. In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A. (eds.) RSKT 2010. LNCS, vol. 6401, pp. 240–248. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Al-Turaiki, I., Badr, G., Mathkour, H. (2012). Trie-based Apriori Motif Discovery Approach. In: Bleris, L., Măndoiu, I., Schwartz, R., Wang, J. (eds) Bioinformatics Research and Applications. ISBRA 2012. Lecture Notes in Computer Science(), vol 7292. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30191-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-30191-9_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-30190-2

  • Online ISBN: 978-3-642-30191-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics