Skip to main content

The q-Gram Distance for Ordered Unlabeled Trees

  • Conference paper
Discovery Science (DS 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3735))

Included in the following conference series:

Abstract

In this paper, we investigate the q -gram distance for ordered unlabeled trees (trees, for short). First, we formulate a q-gram as simply a tree with q nodes isomorphic to a line graph, and the q-gram distance between two trees as similar as one between two strings. Then, by using the depth sequence based on postorder, we design the algorithm EnumGram to enumerate all q-grams in a tree T with n nodes which runs in O(n 2) time and in O(q) space. Furthermore, we improve it to the algorithm LinearEnumGram which runs in O(qn) time and in O(qd) space, where d is the depth of T. Hence, we can evaluate the q-gram distance D q (T 1,T 2) between T 1 and T 2 in O(q maxn 1, n 2) time and in O(q maxd 1, d 2) space, where n i and d i are the number of nodes in T i and the depth of T i , respectively. Finally, we show the relationship between the q-gram distance D q (T 1,T 2) and the edit distanceE(T 1,T 2) that D q (T 1,T 2)≤ (gl+1)E(T 1,T 2), where g = max{g 1, g 2}, l = max{l 1, l 2}, g i is the degree of T i and l i is the number of leaves in T i . In particular, for the top-down tree edit distanceF(T 1,T 2), this result implies that \(D_{q}(T_{1}, T_{2}) \leq {\rm min}\{g^{q-2}, l - 1\}\{F(T_{1}, T_{2})\}\).

This work is partially supported by Grand-in-Aid for Scientific Research 15700137, 16016275 and 17700138 from the Ministry of Education, Culture, Sports, Science and Technology, Japan.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Asai, T., Abe, K., Kawazoe, S., Arimura, H., Sakamoto, H., Arikawa, S.: Efficient substructure discovery from large semi-structured data. In: Proc. SDM 2002 (2002)

    Google Scholar 

  2. Asai, T., Arimura, H., Nakano, S., Uno, T.: Discovering frequent substructures in large unordered trees. In: Grieser, G., Tanaka, Y., Yamamoto, A. (eds.) DS 2003. LNCS (LNAI), vol. 2843, pp. 47–61. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  3. Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337, 217–239 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  4. Burkhardt, S., Karkkainen, J.: Better filtering with gapped q-grams. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 73–85. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  5. Furukawa, K., Uchida, T., Yamada, K., Miyahara, T., Shoudai, T., Nakamura, Y.: Extracting characteristic structures among words in semistructured documents. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 351–360. Springer, Heidelberg (2002)

    Google Scholar 

  6. Garofalakis, M., Kumar, A.: Correlating XML data streams using tree-edit distance embeddings. In: Proc. PODS 2003, pp. 143–154 (2003)

    Google Scholar 

  7. Ikeda, D., Yamada, Y., Hirokawa, S.: Eliminating useless parts in semi-structured documents using alternation counts. In: Jantke, K.P., Shinohara, A. (eds.) DS 2001. LNCS (LNAI), vol. 2226, pp. 113–127. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  8. Jokinen, P., Ukkonen, E.: Two algorithms for approximate string matching in static texts. In: Tarlecki, A. (ed.) MFCS 1991. LNCS, vol. 520, pp. 240–248. Springer, Heidelberg (1991)

    Google Scholar 

  9. Kuboyama, T., Shin, K., Miyahara, T., Yasuda, H.: A theoretical analysis of alignment and edit problems for trees. In: Coppo, M., Lodi, E., Pinna, G.M. (eds.) ICTCS 2005. LNCS, vol. 3701, pp. 323–337. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  10. Nakano, S., Uno, T.: Efficient generation of rooted trees. National Institute of Informatics Technical Report NII-2003-005E (2003)

    Google Scholar 

  11. Nakano, S., Uno, T.: Constant time generation of trees with specified diameter. In: Hromkovič, J., Nagl, M., Westfechtel, B. (eds.) WG 2004. LNCS, vol. 3353, pp. 33–45. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  12. Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Giancarlo, R., Sankoff, D. (eds.) CPM 2000. LNCS, vol. 1848, pp. 350–363. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  13. Selkow, S.M.: The tree-to-tree editing problem. Inform. Proc. Let. 6, 184–186 (1997)

    Article  MathSciNet  Google Scholar 

  14. Shasha, D., Zhang, K.: Fast algorithms for the unit cost edit distance between trees. J. Algo. 11, 581–621 (1990)

    Article  MathSciNet  MATH  Google Scholar 

  15. Uchida, T., Mogawa, T., Nakamura, Y.: Finding frequent structural features among words in tree-structured documents. In: Dai, H., Srikant, R., Zhang, C. (eds.) PAKDD 2004. LNCS (LNAI), vol. 3056, pp. 351–360. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  16. Ukkonen, E.: Approximate string-matching with q-grams and maximal matches. Theor. Comput. Sci. 92, 191–211 (1993)

    Article  MathSciNet  Google Scholar 

  17. Yang, W.: Identifying syntactic differences between two programs. Software–Practice and Experience 21, 739–755 (1991)

    Article  Google Scholar 

  18. Zaki, M.J.: Efficiently mining frequent trees in a forest. In: Proc. SIGKDD 2002, pp. 71–80 (2002)

    Google Scholar 

  19. Zhang, K., Shasha, D.: Tree pattern matching. In: Apostolico, A., Galil, Z. (eds.) Pattern matching algorithms, pp. 341–371 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ohkura, N., Hirata, K., Kuboyama, T., Harao, M. (2005). The q-Gram Distance for Ordered Unlabeled Trees. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds) Discovery Science. DS 2005. Lecture Notes in Computer Science(), vol 3735. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11563983_17

Download citation

  • DOI: https://doi.org/10.1007/11563983_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29230-2

  • Online ISBN: 978-3-540-31698-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics