Skip to main content

An Integrated System of Mining HTML Texts and Filtering Structured Documents

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2003)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2637))

Included in the following conference series:

  • 1149 Accesses

Abstract

This paper presents a method of mining HTML documents into structured documents and of filtering structured documents by using both slot weighting and token weighting. The goal of a mining algorithm is to find slot-token patterns in HTML documents. In order to express user interests in structured document filtering, slot and token are considered. Our preference computation algorithm applies vector similarity and Bayesian probability to filter structured documents. The experimental results show that it is important to consider hyperlinking and unlablelling in mining HTML texts; slot and token weighting can enhance the performance of structured document filtering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chun-nan Hsu and Ming-tzung Dung, Generating Finite-State Transducers for semistructured data extraction from the web. Information Systems vol. 23, No. 8, p 521–538, 1998.

    Article  Google Scholar 

  2. Dayne Freitag, Toward General-Purpose Learning for Information Extraction, Proceedings of the 36th annual meeting of the Association for Computational Linguistics and 7th International Conference on Computational Linguistics, 1998.

    Google Scholar 

  3. eachmovie data download site, http://www.research.compaq.com/SRC/eachmovie/data/.

  4. Heekyoung Seo, Jaeyoung Yang, and Joongmin Choi, Knowledge-based Wrapper Generation by Using XML, Workshop on Adaptive Text Extraction and Mining(ATEM 2001), pp. 1–8, Seattle, USA, 2001.

    Google Scholar 

  5. Ion Muslea, Steven Minton, Craig A. Knoblock, Hierarchical Wrapper Induction for Semistructured Information Soueces.

    Google Scholar 

  6. Mary Elaine Califf and Raymond J. Mooney, Relational Learning of Pattern-Match Rules for Information Extraction, Proceedings of the 16th National Conference on Artificial Intelligence, p. 328–334, Orlando, FL, July, 1999.

    Google Scholar 

  7. Naveen Ashish and Craig Knoblock, Semi-automatic Wrapper Generation for Internet Information Sources, Proceedings of the Second International Conference on Cooperative Information Systems, Charleston, SC, 1997.

    Google Scholar 

  8. Raymond J. Mooney. Content-Based Book Recommending Using Learning for Text Categorization, Proceedings of the 5th ACM conference on Digital Libraries, June 2000.

    Google Scholar 

  9. Robert. B. Allen, User models: theory, method, and practice, international journal on man-machine studies, vol. 32, p. 511–543, 1990.

    Article  Google Scholar 

  10. Stephen Soderland, Learning Information Extraction Rules for Semi-structured and Free text. Machine Learning, 34(1–3):233–272, 1999.

    Article  MATH  Google Scholar 

  11. Yanlei Diao, Hongjun Lu, and Dekai Wu, A Comparative Study of Classification Based Personal E-mail Filtering, Proceedings of the 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Kyoto, Japan, April 2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yun, BH., Lim, ME., Park, SH. (2003). An Integrated System of Mining HTML Texts and Filtering Structured Documents. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8_34

Download citation

  • DOI: https://doi.org/10.1007/3-540-36175-8_34

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-04760-5

  • Online ISBN: 978-3-540-36175-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics