Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

Miyahara, Tetsuhiro; Suzuki, Yusuke; Shoudai, Takayoshi; Uchida, Tomoyuki; Takahashi, Kenichi; Ueda, Hiroaki

doi:10.1007/3-540-47887-6_35

Tetsuhiro Miyahara⁴,
Yusuke Suzuki⁵,
Takayoshi Shoudai⁵,
Tomoyuki Uchida⁴,
Kenichi Takahashi⁴ &
…
Hiroaki Ueda⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2336))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2101 Accesses
34 Citations

Abstract

Many Web documents such as HTML files and XML files have no rigid structure and are called semistructured data. In general, such semistructured Web documents are represented by rooted trees with ordered children. We propose a new method for discovering frequent tree structured patterns in semistructured Web documents by using a tag tree pattern as a hypothesis. A tag tree pattern is an edge labeled tree with ordered children which has structured variables. An edge label is a tag or a keyword in such Web documents, and a variable can be substituted by an arbitrary tree. So a tag tree pattern is suited for representing tree structured patterns in such Web documents. First we show that it is hard to compute the optimum frequent tag tree pattern. So we present an algorithm for generating all maximally frequent tag tree patterns and give the correctness of it. Finally, we report some experimental results on our algorithm. Although this algorithm is not efficient, experiments show that we can extract characteristic tree structured patterns in those data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Abiteboul, P. Buneman, and D. Suciu. Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 2000.
Google Scholar
T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. Efficient substructure discovery from large semi-structured data. Proc. 2nd SIAM Int. Conf. Data Mining (SDM-2002) (to appear), 2002.
Google Scholar
C.-H. Chang, S.-C. Lui, and Y.-C. Wu. Applying pattern mining to web information extraction. Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2001), Springer-Verlag, LNAI 2035, pages 4–15, 2001.
Google Scholar
M. Fernandez and D. Suciu. Optimizing regular path expressions using graph schemas. Proceedings of the 14th International Conference on Data Engineering (ICDE-98), IEEE Computer Society, pages 14–23, 1998.
Google Scholar
K. Furukawa, T. Uchida, K. Yamada, T. Miyahara, T. Shoudai, and Y. Nakamura. Extracting characteristic structures among words in semistructured documents. Proc. PAKDD-2002, Springer-Verlag, LNAI (to appear), 2002.
Google Scholar
N. Kushmerick. Wrapper induction: efficiency and expressiveness. Artificial Intelligence, 118:15–68, 2000.
Article MATH MathSciNet Google Scholar
T. Miyahara, T. Shoudai, T. Uchida, K. Takahashi, and H. Ueda. Discovery of frequent tree structuted patterns in semistructured web documents. Proc. PAKDD-2001, Springer-Verlag, LNAI 2035, pages 47–52, 2001.
Google Scholar
T. Shoudai, T. Uchida, and T. Miyahara. Polynomial time algorithms for finding unordered tree patterns with internal variables. Proc. FCT-2001, Springer-Verlag, LNCS 2138, pages 335–346, 2001.
Google Scholar
W. Skarbek. Generating ordered trees. Theoretical Computer Science, 57:153–159, 1988.
Article MATH MathSciNet Google Scholar
Y. Suzuki, T. Shoudai, T. Miyahara, and T. Uchida. Polynomial time inductive inference of ordered tree patterns with internal variables from positive data. Proc. LA Winter Symposium, Kyoto, Japan, pages 33-1–33-12, 2002.
Google Scholar
K. Wang and H. Liu. Discovering structural association of semistructured data. IEEE Trans. Knowledge and Data Engineering, 12:353–371, 2000.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Sciences, Hiroshima City University, Hiroshima, 731-3194, Japan
Tetsuhiro Miyahara, Tomoyuki Uchida, Kenichi Takahashi & Hiroaki Ueda
Department of Informatics, Kyushu University, Kasuga, 816-8580, Japan
Yusuke Suzuki & Takayoshi Shoudai

Authors

Tetsuhiro Miyahara
View author publications
You can also search for this author in PubMed Google Scholar
Yusuke Suzuki
View author publications
You can also search for this author in PubMed Google Scholar
Takayoshi Shoudai
View author publications
You can also search for this author in PubMed Google Scholar
Tomoyuki Uchida
View author publications
You can also search for this author in PubMed Google Scholar
Kenichi Takahashi
View author publications
You can also search for this author in PubMed Google Scholar
Hiroaki Ueda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

EE Department, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan, ROC
Ming-Syan Chen
IBM Thomas J. Watson Research Center, 30 Sawmill River Road, Hawthorne, NY, 10532, USA
Philip S. Yu
School of Computing, National University of Singapore, Lower Kent Ridge Road, Singapore, 119260
Bing Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Miyahara, T., Suzuki, Y., Shoudai, T., Uchida, T., Takahashi, K., Ueda, H. (2002). Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_35

Download citation

DOI: https://doi.org/10.1007/3-540-47887-6_35
Published: 29 April 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43704-8
Online ISBN: 978-3-540-47887-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics