Abstract
This paper proposes a website-level data extraction approach to identify the object relevant information distributed across multiple web pages. Page-level data extraction is widely studied with assumption that each input web page contains multiple data records of interested objects. However, in many cases for web mining, the multiple pages describing an object are sparsely distributed in a website. It makes page-level solutions no longer applicable. We exploit the hierarchy model of websites for web page organization to solve the problem of website-level data extraction. A new resource, the Hierarchical Navigation Path (HNP), which can be discovered from the website structure, is introduced for object relevant web page filtering. The found web pages are clustered using the URL and semantic hyperlink analysis, and then the entry page and the detailed profile pages of each object are identified. The empirical experiments show the effectiveness of the proposed approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Laender, A., da Silva, A., Ribeiro-Neto, B., Teixeira, J.: A Brief Survey of Web Data Extraction Tools. SIGMOD Record (2002)
Arocena, G.O., Mendelzon, A.O.: WebOQL: Restructuring documents, data-bases, and webs. In: Proc. of ICDE (1998)
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: SIGMOD 2003 (2003)
Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: Proc. of the ACM SIGKDD (2003)
Chang, C., Lui, S.: IEPAD: Information extraction based on pattern discovery. In: Proc. of WWW (2001)
Cohen, W., Hurst, M., Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. In: Proc. of WWW (2002)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a vision-based page segmentation algorithm. Microsoft Technical Report (MSR-TR-2003-79) (2003)
Hammer, J., Mchvoh, J., Garcia-Molina, H.: Semistructured data: The TSIMMIS experience. In: Proc. of the First East-European Symposium on Advances in Databases and Information Systems (1997)
Davulcu, H., Vadrevu, S., Nagarajan, S., Gelgi, F.: METEOR: metadata and instance extraction from object referral lists on the web. In: Proc. of WWW (2005)
Zhu, H., Raghavan, S., Vaithyanathan, S.: Alexander Löser: Navigating the intranet with high precision. In: Proc. WWW (2007)
Kao, H.-Y., Lin, S.-H.: Mining web informative structures and content based on entropy analysis. IEEE Trans. on Knowledge and Data Engineering (2004)
Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., Ma, W.-Y.: Simultaneous record detection and attribute labeling in web data extraction. In: Proc. of KDD (2006)
Park, J., Barbosa, D.: Adaptive record extraction from web pages. In: Proc. WWW (2007)
Tajima, K., Mizuuchi, Y., Kitagawa, M., Tanaka, K.: Cut as a querying unit for WWW, Netnews, and E-mail. In: Proc. of ACM Hypertext (1998)
Kevin, S., McCurley, A.T.: Mining and Knowledge Discovery from the Web. In: ISPAN (2004)
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. In: AI (2000)
Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semi-structured information sources Autonomous Agents and Multi-Agent Sys. (2001)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Wong, T.-L., Lam, W.: Adapting Web information extraction knowledge via mining site-invariant and site-dependent features. ACM Trans. Internet Techn. (2007)
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards Automatic Data Extraction from Large Web Sites. In: Proc. VLDB (2001)
Li, W.S., Ayan, N.F., Takano, H., Shimamura, H.: Constructing multi-granular and topic-focused web site maps. In: Proc. of WWW (2001)
Li, W., Candan, V.K.Q., Agrawal, D.: Retrieving and Organizing Web Pages by Information Unit. In: Proc. of WWW (2001)
Nie, Z., Ma, Y.J., Ma, W.-Y.: Web Object Retrieval. In: Proc. of WWW (2001)
Zhai, Y.H., Liu, B.: Structured data extraction from the Web based on partial tree alignment. IEEE Trans. on Knowledge and Data Engineering (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, J., Zhao, Y. (2010). Website-Level Data Extraction. In: Cordeiro, J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST 2009. Lecture Notes in Business Information Processing, vol 45. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12436-5_18
Download citation
DOI: https://doi.org/10.1007/978-3-642-12436-5_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12435-8
Online ISBN: 978-3-642-12436-5
eBook Packages: Computer ScienceComputer Science (R0)