Skip to main content

Bottom-Up Discovery of Clusters of Maximal Ranges in HTML Trees for Search Engines Results Extraction

  • Conference paper
Business Information Systems (BIS 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4439))

Included in the following conference series:

  • 1841 Accesses

Abstract

Unsupervised HTML records detection is an important step in many Web content mining applications.

In this paper we propose a method of bottom-up discovery of clusters of maximal, non-agglomerative similar HTML ranges in nested set HTML tree representation. Afterward we demonstrate its applicability to records detection in search engines results. For performance measurement several distance assessment strategies were evaluated and two test collections were prepared containing results pages from almost 60 global and country-specific search engines and almost 100 methodically generated complex HTML trees with pre-set properties respectively.

Empirical study shows that our method performs well and can detect successfully most of search results ranges clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Age Of Com, http://www.ageof.com

  2. Aho, A.V., Hopcroft, J.E., Ullman, J.D.: Data Structures and Algorithms. Addison-Wesley, Reading (1983)

    MATH  Google Scholar 

  3. Big Search Engine Index, http://www.search-engine-index.co.uk

  4. Celko, J.: Trees and Hierarchies in SQL for Smarties (2004)

    Google Scholar 

  5. Chang, K.C.C., He, B.: Structured databases on the web: observations and implications. SIGMOD Record (2004)

    Google Scholar 

  6. Chang, C., Lui, S.: IEPAD: Information Extraction based on Pattern Discovery. In: WWW (2001)

    Google Scholar 

  7. Chawathe, S.S.: Comparing Hierarchical Data in External Memory. In: VLDB (1999)

    Google Scholar 

  8. Chi, Y., Yang, Y., Muntz, R.R.: Canonical forms for labelled trees and their applications in frequent subtree mining. Knowledge and Information Systems 8(2), 203–234 (2005)

    Article  Google Scholar 

  9. Chilkat XML .NET, http://www.chilkatsoft.com/xml-dotnet.asp

  10. Embley, D.W.: Tao C. Automating the Extraction of Data from HTML Tables with Unknown Structure. Data & Knowledge Engineering (2005)

    Google Scholar 

  11. Gazen, B., Minton, S.: AutoFeed: an unsupervised learning system for generating webfeeds. In: K-CAP (2005)

    Google Scholar 

  12. HTML Tidy Library, http://tidy.sourceforge.net/

  13. HTTrack Website Copier, http://www.httrack.com/

  14. Knuth, D.E.: The Art of Computer Programming. Addison-Wesley, Reading (1968)

    MATH  Google Scholar 

  15. Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: SIGKDD (2003)

    Google Scholar 

  16. Minton, S., Knoblock, C.A., Lerman, K.: Automatic data extraction from lists and tables in web sources. In: IJCAI (2001)

    Google Scholar 

  17. Opera Web Browser, http://www.opera.com/

  18. Pandia Powersearch, http://www.pandia.com/powersearch

  19. Simon, K., Lausen, G.: ViPER: Augmenting Automatic Information Extraction with Visual Perceptions. In: CIKM (2005)

    Google Scholar 

  20. Wang, J., Lochovsky, F.: Data Extraction and Label Assignment for Web Databases. In: WWW (2003)

    Google Scholar 

  21. World Wide Web Consortium. HTML 4.01 Specification (1999)

    Google Scholar 

  22. Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: WWW (2005)

    Google Scholar 

  23. Zhao, H., et al.: Fully Automatic Wrapper Generation for Search Engines. In: WWW (2005)

    Google Scholar 

  24. Zhao, H., Meng, W., Yu, C.: Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages. In: VLDB (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Witold Abramowicz

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Flejter, D., Hryniewiecki, R. (2007). Bottom-Up Discovery of Clusters of Maximal Ranges in HTML Trees for Search Engines Results Extraction. In: Abramowicz, W. (eds) Business Information Systems. BIS 2007. Lecture Notes in Computer Science, vol 4439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72035-5_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-72035-5_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-72034-8

  • Online ISBN: 978-3-540-72035-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics