Extracting Characteristic Structures among Words in Semistructured Documents

Furukawa, Kazuyoshi; Uchida, Tomoyuki; Yamada, Kazuya; Miyahara, Tetsuhiro; Shoudai, Takayoshi; Nakamura, Yasuaki

doi:10.1007/3-540-47887-6_36

Extracting Characteristic Structures among Words in Semistructured Documents

Kazuyoshi Furukawa⁴,
Tomoyuki Uchida⁴,
Kazuya Yamada⁴,
Tetsuhiro Miyahara⁴,
Takayoshi Shoudai⁵ &
…
Yasuaki Nakamura⁴

Conference paper
First Online: 01 January 2002

2098 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2336))

Abstract

Electronic documents such as SGML/HTML/XML files and LaTeX files have been rapidly increasing, by the rapid progress of network and storage technologies. Many electronic documents have no rigid structure and are called semistructured documents. Since a lot of semistructured documents contain large plain texts, we focus on the structural characteristics among words in semistructured documents. The aim of this paper is to present a text mining technique for semistructured documents. We consider a problem of finding all frequent structured patterns among words in semistructured documents. Let (W ₁, W ₂,..., W _k) be a list of words which are sorted in lexicographical order and let k ≥ 2 be an integer. Firstly, we define a tree-association pattern on (W ₁, W ₂,..., W _k). A tree-association pattern on (W ₁, W ₂,..., W _k) is a sequence 〈t ₁; t ₂;...; t _k-1〉 of labeled rooted trees such that, for i = 1, 2,..., k-1, (1) t _i consists of only one node having the pair of two words W _i and W _i+1 as its label, or (2) t _i is a labeled rooted tree which has just two leaves labeled with W _i and W _i+1, respectively. Next, we present a text mining algorithm for finding all frequent tree-association patterns in semistructured documents. Finally, by reporting experimental results on our algorithm, we show that our algorithm is effective for extracting structural characteristics in semistructured documents.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 2000.
Google Scholar
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. Proc. of the 20th VLDB Conference, pages 487–499, 1994.
Google Scholar
T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient substructure discovery from large semi-structured data. Proc. 2nd SIAM Int. Conf. Data Mining (SDM-2002) (to appear), 2002.
Google Scholar
M. Fernandez and Suciu D. Optimizing regular path expressions using graph schemas. Proc. Int. Conf. on Data Engineering (ICDE-98), pages 14–23, 1998.
Google Scholar
R. Fujino, H. Arimura, and S. Arikawa. Discovering unordered and ordered phrase association patterns for text mining. Proc. PAKDD-2000, Springer-Verlag, LNAI 1805, pages 281–293, 2000.
Google Scholar
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001.
Google Scholar
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. Proc. ACM SIGMOD Conf., pages 1–12, 2000.
Google Scholar
D. Lewis. Reuters-21578 text categorization test collection. UCI KDD Archive, http://kdd.ics.uci.edu/databases/reuters21578/reuters21 578.html , 1997.
T. Miyahara, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Polynomial time matching algorithms for tree-like structured patterns in knowledge discovery. Proc. PAKDD-2000, Springer-Verlag, LNAI 1805, pages 5–16, 2000.
Google Scholar
T. Miyahara, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tree structuted patterns in semistructured web documents. Proc. PAKDD-2001, Springer-Verlag, LNAI 2035, pages 47–52, 2001.
Google Scholar
T. Miyahara, Y. Suzuki, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tag tree patterns in semistructured web documents. Proc. PAKDD-2002, Springer-Verlag, LNAI (to appear), 2002.
Google Scholar
K. Wang and H. Liu. Discovering structural association of semistructured data. IEEE Trans. Knowledge and Data Engineering, 12:353–371, 2000.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Sciences, Hiroshima City University, Hiroshima, 731-3194, Japan
Kazuyoshi Furukawa, Tomoyuki Uchida, Kazuya Yamada, Tetsuhiro Miyahara & Yasuaki Nakamura
Department of Informatics, Kyushu University, Kasuga, 816-8580, Japan
Takayoshi Shoudai

Authors

Kazuyoshi Furukawa
View author publications
You can also search for this author in PubMed Google Scholar
Tomoyuki Uchida
View author publications
You can also search for this author in PubMed Google Scholar
Kazuya Yamada
View author publications
You can also search for this author in PubMed Google Scholar
Tetsuhiro Miyahara
View author publications
You can also search for this author in PubMed Google Scholar
Takayoshi Shoudai
View author publications
You can also search for this author in PubMed Google Scholar
Yasuaki Nakamura
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

EE Department, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan, ROC
Ming-Syan Chen
IBM Thomas J. Watson Research Center, 30 Sawmill River Road, Hawthorne, NY, 10532, USA
Philip S. Yu
School of Computing, National University of Singapore, Lower Kent Ridge Road, Singapore, 119260
Bing Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Furukawa, K., Uchida, T., Yamada, K., Miyahara, T., Shoudai, T., Nakamura, Y. (2002). Extracting Characteristic Structures among Words in Semistructured Documents. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_36

Download citation

DOI: https://doi.org/10.1007/3-540-47887-6_36
Published: 29 April 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43704-8
Online ISBN: 978-3-540-47887-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics