Finding Frequent Structural Features among Words in Tree-Structured Documents

Uchida, Tomoyuki; Mogawa, Tomonori; Nakamura, Yasuaki

doi:10.1007/978-3-540-24775-3_43

Tomoyuki Uchida¹⁹,
Tomonori Mogawa²⁰ &
Yasuaki Nakamura¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3056))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2923 Accesses
3 Citations

Abstract

Many electronic documents such as SGML/HTML/XML files and LaTeX files have tree structures. Such documents are called tree-structured documents. Many tree-structured documents contain large plain texts. In order to extract structural features among words from tree-structured documents, we consider the problem of finding frequent structured patterns among words in tree-structured documents. Let k≥ 2 be an integer and (W ₁,W ₂,...,W _k) a list of words which are sorted in lexicographical order. A consecutive path pattern on (W ₁ , W ₂ ,..., W _k ) is a sequence 〈t ₁;t ₂;...,t _k − 1〉 of labeled rooted ordered trees such that, for i=1,2,...,k-1, (1) t _i consists of only one node having the pair (W _i,W _i + 1) as its label, or (2) t _i has just two nodes whose degrees are one and which are labeled with W _i and W _i + 1, respectively. We present a data mining algorithm for finding all frequent consecutive path patterns in tree-structured documents. Then, by reporting experimental results on our algorithm, we show that our algorithm is efficient for extracting structural features from tree-structured documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc. of the 20th VLDB Conference, pp. 487–499 (1994)
Google Scholar
Asai, T., Abe, K., Kawasoe, S., Arimura, H., Sakamoto, H., Arikawa, S.: Efficient substructure discovery from large semi-structured data. In: Proc. 2nd SIAM Int. Conf. Data Mining, SDM 2002 (2002) (to appear)
Google Scholar
Fernandez, M., Suciu, D.: Optimizing regular path expressions using graph schemas. In: Proc. Int. Conf. on Data Engineering (ICDE 1998), pp. 14–23 (1998)
Google Scholar
Fujino, R., Arimura, H., Arikawa, S.: Discovering unordered and ordered phrase association patterns for text mining. In: Terano, T., Chen, A.L.P. (eds.) PAKDD 2000. LNCS (LNAI), vol. 1805, pp. 281–293. Springer, Heidelberg (2000)
Chapter Google Scholar
Furukawa, K., Uchida, T., Yamada, K., Miyahara, T., Shoudai, T., Nakamura, Y.: Extracting characteristic structures among words in semistructured documents. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 356–367. Springer, Heidelberg (2002)
Chapter Google Scholar
Gonnet, G., Baeza-Yates, R.: Handbook of Algorithms and Data Structures. Addison-Wesley, Reading (1991)
Google Scholar
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco (2001)
Google Scholar
Lewis, D.: Reuters-21578 text categorization test collection. UCI KDD Archive (1997), http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html
Miyahara, T., Shoudai, T., Uchida, T., Takahashi, K., Ueda, H.: Discovery of frequent tree structured patterns in semistructured web documents. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS (LNAI), vol. 2035, pp. 47–52. Springer, Heidelberg (2001)
Chapter Google Scholar
Wang, K., Liu, H.: Discovering structural association of semistructured data. IEEE Trans. Knowledge and Data Engineering 12, 353–371 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Sciences, Hiroshima City University, Hiroshima, 731-3194, Japan
Tomoyuki Uchida & Yasuaki Nakamura
Department of Computer and Media Technologies, Hiroshima City University, Hiroshima, 731-3194, Japan
Tomonori Mogawa

Authors

Tomoyuki Uchida
View author publications
You can also search for this author in PubMed Google Scholar
Tomonori Mogawa
View author publications
You can also search for this author in PubMed Google Scholar
Yasuaki Nakamura
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering and Information Technology, Deakin University, VIC 3125, Australia
Honghua Dai
University of Illinois at Urbana-Champaign, 61801, Urbana, IL, USA
Ramakrishnan Srikant
Faculty of Engineering and Information Technology, Centre for Quantum Computation and Intelligent Systems, and Australian ACS National Committee for Artificial Intelligence, University of Technology, Sydney, Australia
Chengqi Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Uchida, T., Mogawa, T., Nakamura, Y. (2004). Finding Frequent Structural Features among Words in Tree-Structured Documents. In: Dai, H., Srikant, R., Zhang, C. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2004. Lecture Notes in Computer Science(), vol 3056. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24775-3_43

Download citation

DOI: https://doi.org/10.1007/978-3-540-24775-3_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22064-0
Online ISBN: 978-3-540-24775-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics