An Integrated System of Mining HTML Texts and Filtering Structured Documents

Yun, Bo-Hyun; Lim, Myung-Eun; Park, Soo-Hyun

doi:10.1007/3-540-36175-8_34

Bo-Hyun Yun⁵,
Myung-Eun Lim⁵ &
Soo-Hyun Park⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2637))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1149 Accesses

Abstract

This paper presents a method of mining HTML documents into structured documents and of filtering structured documents by using both slot weighting and token weighting. The goal of a mining algorithm is to find slot-token patterns in HTML documents. In order to express user interests in structured document filtering, slot and token are considered. Our preference computation algorithm applies vector similarity and Bayesian probability to filter structured documents. The experimental results show that it is important to consider hyperlinking and unlablelling in mining HTML texts; slot and token weighting can enhance the performance of structured document filtering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chun-nan Hsu and Ming-tzung Dung, Generating Finite-State Transducers for semistructured data extraction from the web. Information Systems vol. 23, No. 8, p 521–538, 1998.
Article Google Scholar
Dayne Freitag, Toward General-Purpose Learning for Information Extraction, Proceedings of the 36^th annual meeting of the Association for Computational Linguistics and 7^th International Conference on Computational Linguistics, 1998.
Google Scholar
eachmovie data download site, http://www.research.compaq.com/SRC/eachmovie/data/.
Heekyoung Seo, Jaeyoung Yang, and Joongmin Choi, Knowledge-based Wrapper Generation by Using XML, Workshop on Adaptive Text Extraction and Mining(ATEM 2001), pp. 1–8, Seattle, USA, 2001.
Google Scholar
Ion Muslea, Steven Minton, Craig A. Knoblock, Hierarchical Wrapper Induction for Semistructured Information Soueces.
Google Scholar
Mary Elaine Califf and Raymond J. Mooney, Relational Learning of Pattern-Match Rules for Information Extraction, Proceedings of the 16^th National Conference on Artificial Intelligence, p. 328–334, Orlando, FL, July, 1999.
Google Scholar
Naveen Ashish and Craig Knoblock, Semi-automatic Wrapper Generation for Internet Information Sources, Proceedings of the Second International Conference on Cooperative Information Systems, Charleston, SC, 1997.
Google Scholar
Raymond J. Mooney. Content-Based Book Recommending Using Learning for Text Categorization, Proceedings of the 5^th ACM conference on Digital Libraries, June 2000.
Google Scholar
Robert. B. Allen, User models: theory, method, and practice, international journal on man-machine studies, vol. 32, p. 511–543, 1990.
Article Google Scholar
Stephen Soderland, Learning Information Extraction Rules for Semi-structured and Free text. Machine Learning, 34(1–3):233–272, 1999.
Article MATH Google Scholar
Yanlei Diao, Hongjun Lu, and Dekai Wu, A Comparative Study of Classification Based Personal E-mail Filtering, Proceedings of the 4^th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Kyoto, Japan, April 2000.
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Human Information Processing, Electronics and Telecommunications Research Institute, 161, Kajong-Dong, Yusong-Gu, Daejon, 305-350, Korea
Bo-Hyun Yun & Myung-Eun Lim
School of Business IT, Kookmin University, 861-1, Cheongrung-dong, Sungbuk-ku, Seoul, 136-702, Korea
Soo-Hyun Park

Authors

Bo-Hyun Yun
View author publications
You can also search for this author in PubMed Google Scholar
Myung-Eun Lim
View author publications
You can also search for this author in PubMed Google Scholar
Soo-Hyun Park
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Korea Advanced Institute of Science and Technology, 373-1 Koo-Sung Dong, Yoo-Sung Ku, Daejeon, 305-701, Korea
Kyu-Young Whang
Department of Statistics, Seoul National University, Sillimdong Kwanakgu, Seoul, 151-742, Korea
Jongwoo Jeon
School of Electrical Engineering and Computer Science, Seoul National University, Kwanak P.O. Box 34, Seoul, 151-742, Korea
Kyuseok Shim
Department of Computer Science and Engineering, University of Minnesota, 200 Union St SE, Minneapolis, MN, 55455, USA
Jaideep Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yun, BH., Lim, ME., Park, SH. (2003). An Integrated System of Mining HTML Texts and Filtering Structured Documents. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8_34

Download citation

DOI: https://doi.org/10.1007/3-540-36175-8_34
Published: 30 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-04760-5
Online ISBN: 978-3-540-36175-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics