Bottom-Up Discovery of Clusters of Maximal Ranges in HTML Trees for Search Engines Results Extraction

Flejter, Dominik; Hryniewiecki, Roman

doi:10.1007/978-3-540-72035-5_31

Dominik Flejter¹ &
Roman Hryniewiecki¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4439))

Included in the following conference series:

International Conference on Business Information Systems

1841 Accesses

Abstract

Unsupervised HTML records detection is an important step in many Web content mining applications.

In this paper we propose a method of bottom-up discovery of clusters of maximal, non-agglomerative similar HTML ranges in nested set HTML tree representation. Afterward we demonstrate its applicability to records detection in search engines results. For performance measurement several distance assessment strategies were evaluated and two test collections were prepared containing results pages from almost 60 global and country-specific search engines and almost 100 methodically generated complex HTML trees with pre-set properties respectively.

Empirical study shows that our method performs well and can detect successfully most of search results ranges clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Age Of Com, http://www.ageof.com
Aho, A.V., Hopcroft, J.E., Ullman, J.D.: Data Structures and Algorithms. Addison-Wesley, Reading (1983)
MATH Google Scholar
Big Search Engine Index, http://www.search-engine-index.co.uk
Celko, J.: Trees and Hierarchies in SQL for Smarties (2004)
Google Scholar
Chang, K.C.C., He, B.: Structured databases on the web: observations and implications. SIGMOD Record (2004)
Google Scholar
Chang, C., Lui, S.: IEPAD: Information Extraction based on Pattern Discovery. In: WWW (2001)
Google Scholar
Chawathe, S.S.: Comparing Hierarchical Data in External Memory. In: VLDB (1999)
Google Scholar
Chi, Y., Yang, Y., Muntz, R.R.: Canonical forms for labelled trees and their applications in frequent subtree mining. Knowledge and Information Systems 8(2), 203–234 (2005)
Article Google Scholar
Chilkat XML .NET, http://www.chilkatsoft.com/xml-dotnet.asp
Embley, D.W.: Tao C. Automating the Extraction of Data from HTML Tables with Unknown Structure. Data & Knowledge Engineering (2005)
Google Scholar
Gazen, B., Minton, S.: AutoFeed: an unsupervised learning system for generating webfeeds. In: K-CAP (2005)
Google Scholar
HTML Tidy Library, http://tidy.sourceforge.net/
HTTrack Website Copier, http://www.httrack.com/
Knuth, D.E.: The Art of Computer Programming. Addison-Wesley, Reading (1968)
MATH Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: SIGKDD (2003)
Google Scholar
Minton, S., Knoblock, C.A., Lerman, K.: Automatic data extraction from lists and tables in web sources. In: IJCAI (2001)
Google Scholar
Opera Web Browser, http://www.opera.com/
Pandia Powersearch, http://www.pandia.com/powersearch
Simon, K., Lausen, G.: ViPER: Augmenting Automatic Information Extraction with Visual Perceptions. In: CIKM (2005)
Google Scholar
Wang, J., Lochovsky, F.: Data Extraction and Label Assignment for Web Databases. In: WWW (2003)
Google Scholar
World Wide Web Consortium. HTML 4.01 Specification (1999)
Google Scholar
Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree Alignment. In: WWW (2005)
Google Scholar
Zhao, H., et al.: Fully Automatic Wrapper Generation for Search Engines. In: WWW (2005)
Google Scholar
Zhao, H., Meng, W., Yu, C.: Automatic Extraction of Dynamic Record Sections From Search Engine Result Pages. In: VLDB (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Poznan University of Economics, Department of Information Systems, al. Niepodleglosci, 10, 60-967, Poznan, Poland
Dominik Flejter & Roman Hryniewiecki

Authors

Dominik Flejter
View author publications
You can also search for this author in PubMed Google Scholar
Roman Hryniewiecki
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Witold Abramowicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Flejter, D., Hryniewiecki, R. (2007). Bottom-Up Discovery of Clusters of Maximal Ranges in HTML Trees for Search Engines Results Extraction. In: Abramowicz, W. (eds) Business Information Systems. BIS 2007. Lecture Notes in Computer Science, vol 4439. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72035-5_31

Download citation

DOI: https://doi.org/10.1007/978-3-540-72035-5_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72034-8
Online ISBN: 978-3-540-72035-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics