Website-Level Data Extraction

Li, Jianqiang; Zhao, Yu

doi:10.1007/978-3-642-12436-5_18

Jianqiang Li⁷ &
Yu Zhao⁷

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 45))

Included in the following conference series:

International Conference on Web Information Systems and Technologies

496 Accesses

Abstract

This paper proposes a website-level data extraction approach to identify the object relevant information distributed across multiple web pages. Page-level data extraction is widely studied with assumption that each input web page contains multiple data records of interested objects. However, in many cases for web mining, the multiple pages describing an object are sparsely distributed in a website. It makes page-level solutions no longer applicable. We exploit the hierarchy model of websites for web page organization to solve the problem of website-level data extraction. A new resource, the Hierarchical Navigation Path (HNP), which can be discovered from the website structure, is introduced for object relevant web page filtering. The found web pages are clustered using the URL and semantic hyperlink analysis, and then the entry page and the detailed profile pages of each object are identified. The empirical experiments show the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Laender, A., da Silva, A., Ribeiro-Neto, B., Teixeira, J.: A Brief Survey of Web Data Extraction Tools. SIGMOD Record (2002)
Google Scholar
Arocena, G.O., Mendelzon, A.O.: WebOQL: Restructuring documents, data-bases, and webs. In: Proc. of ICDE (1998)
Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: SIGMOD 2003 (2003)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: Proc. of the ACM SIGKDD (2003)
Google Scholar
Chang, C., Lui, S.: IEPAD: Information extraction based on pattern discovery. In: Proc. of WWW (2001)
Google Scholar
Cohen, W., Hurst, M., Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. In: Proc. of WWW (2002)
Google Scholar
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a vision-based page segmentation algorithm. Microsoft Technical Report (MSR-TR-2003-79) (2003)
Google Scholar
Hammer, J., Mchvoh, J., Garcia-Molina, H.: Semistructured data: The TSIMMIS experience. In: Proc. of the First East-European Symposium on Advances in Databases and Information Systems (1997)
Google Scholar
Davulcu, H., Vadrevu, S., Nagarajan, S., Gelgi, F.: METEOR: metadata and instance extraction from object referral lists on the web. In: Proc. of WWW (2005)
Google Scholar
Zhu, H., Raghavan, S., Vaithyanathan, S.: Alexander Löser: Navigating the intranet with high precision. In: Proc. WWW (2007)
Google Scholar
Kao, H.-Y., Lin, S.-H.: Mining web informative structures and content based on entropy analysis. IEEE Trans. on Knowledge and Data Engineering (2004)
Google Scholar
Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., Ma, W.-Y.: Simultaneous record detection and attribute labeling in web data extraction. In: Proc. of KDD (2006)
Google Scholar
Park, J., Barbosa, D.: Adaptive record extraction from web pages. In: Proc. WWW (2007)
Google Scholar
Tajima, K., Mizuuchi, Y., Kitagawa, M., Tanaka, K.: Cut as a querying unit for WWW, Netnews, and E-mail. In: Proc. of ACM Hypertext (1998)
Google Scholar
Kevin, S., McCurley, A.T.: Mining and Knowledge Discovery from the Web. In: ISPAN (2004)
Google Scholar
Kushmerick, N.: Wrapper induction: efficiency and expressiveness. In: AI (2000)
Google Scholar
Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semi-structured information sources Autonomous Agents and Multi-Agent Sys. (2001)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Wong, T.-L., Lam, W.: Adapting Web information extraction knowledge via mining site-invariant and site-dependent features. ACM Trans. Internet Techn. (2007)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards Automatic Data Extraction from Large Web Sites. In: Proc. VLDB (2001)
Google Scholar
Li, W.S., Ayan, N.F., Takano, H., Shimamura, H.: Constructing multi-granular and topic-focused web site maps. In: Proc. of WWW (2001)
Google Scholar
Li, W., Candan, V.K.Q., Agrawal, D.: Retrieving and Organizing Web Pages by Information Unit. In: Proc. of WWW (2001)
Google Scholar
Nie, Z., Ma, Y.J., Ma, W.-Y.: Web Object Retrieval. In: Proc. of WWW (2001)
Google Scholar
Zhai, Y.H., Liu, B.: Structured data extraction from the Web based on partial tree alignment. IEEE Trans. on Knowledge and Data Engineering (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

NEC Laboratories China, 11F, Bldg.A, Innovation Plaza, Tsinghua Science Park, Beijing, 100084, China
Jianqiang Li & Yu Zhao

Authors

Jianqiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Systems and informatics, Institute for Systems and Technologies of Information, Control and Communication (INSTICC) and Instituto Politécnico de Setúbal (IPS), Rua do Vale de Chaves, Estefanilha, 2910-761, Setúbal, Portugal
José Cordeiro & Joaquim Filipe &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, J., Zhao, Y. (2010). Website-Level Data Extraction. In: Cordeiro, J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST 2009. Lecture Notes in Business Information Processing, vol 45. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12436-5_18

Download citation

DOI: https://doi.org/10.1007/978-3-642-12436-5_18
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12435-8
Online ISBN: 978-3-642-12436-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics