Skip to main content

Website-Level Data Extraction

  • Conference paper
Web Information Systems and Technologies (WEBIST 2009)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 45))

Included in the following conference series:

  • 496 Accesses

Abstract

This paper proposes a website-level data extraction approach to identify the object relevant information distributed across multiple web pages. Page-level data extraction is widely studied with assumption that each input web page contains multiple data records of interested objects. However, in many cases for web mining, the multiple pages describing an object are sparsely distributed in a website. It makes page-level solutions no longer applicable. We exploit the hierarchy model of websites for web page organization to solve the problem of website-level data extraction. A new resource, the Hierarchical Navigation Path (HNP), which can be discovered from the website structure, is introduced for object relevant web page filtering. The found web pages are clustered using the URL and semantic hyperlink analysis, and then the entry page and the detailed profile pages of each object are identified. The empirical experiments show the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Laender, A., da Silva, A., Ribeiro-Neto, B., Teixeira, J.: A Brief Survey of Web Data Extraction Tools. SIGMOD Record (2002)

    Google Scholar 

  2. Arocena, G.O., Mendelzon, A.O.: WebOQL: Restructuring documents, data-bases, and webs. In: Proc. of ICDE (1998)

    Google Scholar 

  3. Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: SIGMOD 2003 (2003)

    Google Scholar 

  4. Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: Proc. of the ACM SIGKDD (2003)

    Google Scholar 

  5. Chang, C., Lui, S.: IEPAD: Information extraction based on pattern discovery. In: Proc. of WWW (2001)

    Google Scholar 

  6. Cohen, W., Hurst, M., Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. In: Proc. of WWW (2002)

    Google Scholar 

  7. Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a vision-based page segmentation algorithm. Microsoft Technical Report (MSR-TR-2003-79) (2003)

    Google Scholar 

  8. Hammer, J., Mchvoh, J., Garcia-Molina, H.: Semistructured data: The TSIMMIS experience. In: Proc. of the First East-European Symposium on Advances in Databases and Information Systems (1997)

    Google Scholar 

  9. Davulcu, H., Vadrevu, S., Nagarajan, S., Gelgi, F.: METEOR: metadata and instance extraction from object referral lists on the web. In: Proc. of WWW (2005)

    Google Scholar 

  10. Zhu, H., Raghavan, S., Vaithyanathan, S.: Alexander Löser: Navigating the intranet with high precision. In: Proc. WWW (2007)

    Google Scholar 

  11. Kao, H.-Y., Lin, S.-H.: Mining web informative structures and content based on entropy analysis. IEEE Trans. on Knowledge and Data Engineering (2004)

    Google Scholar 

  12. Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., Ma, W.-Y.: Simultaneous record detection and attribute labeling in web data extraction. In: Proc. of KDD (2006)

    Google Scholar 

  13. Park, J., Barbosa, D.: Adaptive record extraction from web pages. In: Proc. WWW (2007)

    Google Scholar 

  14. Tajima, K., Mizuuchi, Y., Kitagawa, M., Tanaka, K.: Cut as a querying unit for WWW, Netnews, and E-mail. In: Proc. of ACM Hypertext (1998)

    Google Scholar 

  15. Kevin, S., McCurley, A.T.: Mining and Knowledge Discovery from the Web. In: ISPAN (2004)

    Google Scholar 

  16. Kushmerick, N.: Wrapper induction: efficiency and expressiveness. In: AI (2000)

    Google Scholar 

  17. Muslea, I., Minton, S., Knoblock, C.: Hierarchical wrapper induction for semi-structured information sources Autonomous Agents and Multi-Agent Sys. (2001)

    Google Scholar 

  18. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  19. Wong, T.-L., Lam, W.: Adapting Web information extraction knowledge via mining site-invariant and site-dependent features. ACM Trans. Internet Techn. (2007)

    Google Scholar 

  20. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards Automatic Data Extraction from Large Web Sites. In: Proc. VLDB (2001)

    Google Scholar 

  21. Li, W.S., Ayan, N.F., Takano, H., Shimamura, H.: Constructing multi-granular and topic-focused web site maps. In: Proc. of WWW (2001)

    Google Scholar 

  22. Li, W., Candan, V.K.Q., Agrawal, D.: Retrieving and Organizing Web Pages by Information Unit. In: Proc. of WWW (2001)

    Google Scholar 

  23. Nie, Z., Ma, Y.J., Ma, W.-Y.: Web Object Retrieval. In: Proc. of WWW (2001)

    Google Scholar 

  24. Zhai, Y.H., Liu, B.: Structured data extraction from the Web based on partial tree alignment. IEEE Trans. on Knowledge and Data Engineering (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, J., Zhao, Y. (2010). Website-Level Data Extraction. In: Cordeiro, J., Filipe, J. (eds) Web Information Systems and Technologies. WEBIST 2009. Lecture Notes in Business Information Processing, vol 45. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12436-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-12436-5_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-12435-8

  • Online ISBN: 978-3-642-12436-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics