Abstract
This paper presents a method of mining HTML documents into structured documents and of filtering structured documents by using both slot weighting and token weighting. The goal of a mining algorithm is to find slot-token patterns in HTML documents. In order to express user interests in structured document filtering, slot and token are considered. Our preference computation algorithm applies vector similarity and Bayesian probability to filter structured documents. The experimental results show that it is important to consider hyperlinking and unlablelling in mining HTML texts; slot and token weighting can enhance the performance of structured document filtering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chun-nan Hsu and Ming-tzung Dung, Generating Finite-State Transducers for semistructured data extraction from the web. Information Systems vol. 23, No. 8, p 521–538, 1998.
Dayne Freitag, Toward General-Purpose Learning for Information Extraction, Proceedings of the 36th annual meeting of the Association for Computational Linguistics and 7th International Conference on Computational Linguistics, 1998.
eachmovie data download site, http://www.research.compaq.com/SRC/eachmovie/data/.
Heekyoung Seo, Jaeyoung Yang, and Joongmin Choi, Knowledge-based Wrapper Generation by Using XML, Workshop on Adaptive Text Extraction and Mining(ATEM 2001), pp. 1–8, Seattle, USA, 2001.
Ion Muslea, Steven Minton, Craig A. Knoblock, Hierarchical Wrapper Induction for Semistructured Information Soueces.
Mary Elaine Califf and Raymond J. Mooney, Relational Learning of Pattern-Match Rules for Information Extraction, Proceedings of the 16th National Conference on Artificial Intelligence, p. 328–334, Orlando, FL, July, 1999.
Naveen Ashish and Craig Knoblock, Semi-automatic Wrapper Generation for Internet Information Sources, Proceedings of the Second International Conference on Cooperative Information Systems, Charleston, SC, 1997.
Raymond J. Mooney. Content-Based Book Recommending Using Learning for Text Categorization, Proceedings of the 5th ACM conference on Digital Libraries, June 2000.
Robert. B. Allen, User models: theory, method, and practice, international journal on man-machine studies, vol. 32, p. 511–543, 1990.
Stephen Soderland, Learning Information Extraction Rules for Semi-structured and Free text. Machine Learning, 34(1–3):233–272, 1999.
Yanlei Diao, Hongjun Lu, and Dekai Wu, A Comparative Study of Classification Based Personal E-mail Filtering, Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Kyoto, Japan, April 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yun, BH., Lim, ME., Park, SH. (2003). An Integrated System of Mining HTML Texts and Filtering Structured Documents. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8_34
Download citation
DOI: https://doi.org/10.1007/3-540-36175-8_34
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-04760-5
Online ISBN: 978-3-540-36175-6
eBook Packages: Springer Book Archive