Skip to main content

Extracting General Lists from Web Documents: A Hybrid Approach

  • Conference paper
Modern Approaches in Applied Intelligence (IEA/AIE 2011)

Abstract

The problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However, empirical results show that considering the HTML structure and visual cues of a Web page independently do not generalize well. We propose a new hybrid method to extract general lists from the Web. It employs both general assumptions on the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods across a varied Web corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)

    Article  Google Scholar 

  2. Cai, D., Yu, S., Rong Wen, J., Ying Ma, W.: Extracting content structure for web pages based on visual representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  3. Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: automatic data extraction from data-intensive web sites. SIGMOD, 624–624 (2002)

    Google Scholar 

  4. Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: WWW, pp. 71–80. ACM, New York (2007)

    Chapter  Google Scholar 

  5. Gupta, R., Sarawagi, S.: Answering table augmentation queries from unstructured lists on the web. Proc. VLDB Endow. 2(1), 289–300 (2009)

    Article  Google Scholar 

  6. Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. SIGMOD, 119–130 (2004)

    Google Scholar 

  7. Lerman, K., Knoblock, C., Minton, S.: Automatic data extraction from lists and tables in web sources. In: IJCAI. AAAI Press, Menlo Park (2001)

    Google Scholar 

  8. Lie, H.W., Bos, B.: Cascading Style Sheets:Designing for the Web, 2nd edn. Addison-Wesley Professional, Reading (1999)

    Google Scholar 

  9. Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: KDD, pp. 601–606. ACM Press, New York (2003)

    Google Scholar 

  10. Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Trans. on Knowl. and Data Eng. 22(3), 447–460 (2010)

    Article  Google Scholar 

  11. Mehta, R.R., Mitra, P., Karnick, H.: Extracting semantic structure of web documents using content and visual information. In: WWW, pp. 928–929. ACM, New York (2005)

    Google Scholar 

  12. Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting data records from the web using tag path clustering. In: WWW, pp. 981–990. ACM, New York (2009)

    Chapter  Google Scholar 

  13. Tong, S., Dean, J.: System and methods for automatically creating lists. In: US Patent: 7350187 (March 2008)

    Google Scholar 

  14. Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the web. In: ICDM, pp. 342–350. IEEE, Washington, DC, USA (2007)

    Google Scholar 

  15. Weninger, T., Fumarola, F., Barber, R., Han, J., Malerba, D.: Unexpected results in automatic list extraction on the web. SIGKDD Explorations 12(2), 26–30 (2010)

    Article  Google Scholar 

  16. Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW, pp. 76–85. ACM, New York (2005)

    Google Scholar 

  17. Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. on Knowl. and Data Eng. 18(12), 1614–1628 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J. (2011). Extracting General Lists from Web Documents: A Hybrid Approach. In: Mehrotra, K.G., Mohan, C.K., Oh, J.C., Varshney, P.K., Ali, M. (eds) Modern Approaches in Applied Intelligence. IEA/AIE 2011. Lecture Notes in Computer Science(), vol 6703. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21822-4_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-21822-4_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-21821-7

  • Online ISBN: 978-3-642-21822-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics