Skip to main content

Automatic Extraction of Logical Web Lists

  • Conference paper
Foundations of Intelligent Systems (ISMIS 2014)

Abstract

Recently, there has been increased interest in the extraction of structured data from the web (both “Surface” Web and“Hidden” Web). In particular, in this paper we focus on the automatic extraction of Web Lists. Although this task has been studied extensively, existing approaches are based on the assumption that lists are wholly contained in a Web page.They do not consider that many websites span their listing on several Web Pages and show for each of these only a partial view. Similar to databases, where a view can represent a subset of the data contained in a table, they split a logical list in multiple views (view lists). Automatic extraction of logical lists is an open problem. To tackle this issue we propose an unsupervised and domain-independent algorithm for logical list extraction. Experimental results on real-life and data-intensive Web sites confirm the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baumgartner, R.: Datalog-related aspects in lixto visual developer. In: de Moor, O., Gottlob, G., Furche, T., Sellers, A. (eds.) Datalog 2010. LNCS, vol. 6702, pp. 145–160. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  2. Bing, L., Lam, W., Gu, Y.: Towards a unified solution: Data record region detection and segmentation. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 1265–1274. ACM, New York (2011)

    Google Scholar 

  3. Cafarella, M.J., Halevy, A., Madhavan, J.: Structured data on the web. Commun. ACM 54(2), 72–79 (2011)

    Article  Google Scholar 

  4. Crescenzi, V., Merialdo, P., Missier, P.: Clustering web pages based on their structure. Data Knowl. Eng. 54(3), 279–299 (2005)

    Article  Google Scholar 

  5. Elmeleegy, H., Madhavan, J., Halevy, A.: Harvesting relational tables from lists on the web. The VLDB Journal 20(2), 209–226 (2011)

    Article  Google Scholar 

  6. Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 1535–1545. Association for Computational Linguistics, Stroudsburg (2011)

    Google Scholar 

  7. Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J.: Extracting general lists from web documents: A hybrid approach. In: Mehrotra, K.G., Mohan, C.K., Oh, J.C., Varshney, P.K., Ali, M. (eds.) IEA/AIE 2011, Part I. LNCS, vol. 6703, pp. 285–294. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  8. Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J.: Hylien: A hybrid approach to general list extraction on the web. In: Srinivasan, S., Ramamritham, K., Kumar, A., Ravindra, M.P., Bertino, E., Kumar, R. (eds.) WWW (Companion Volume), pp. 35–36. ACM (2011)

    Google Scholar 

  9. Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 71–80. ACM, New York (2007)

    Chapter  Google Scholar 

  10. Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD 2004, pp. 119–130. ACM, New York (2004)

    Google Scholar 

  11. Lie, H.W., Bos, B.: Cascading Style Sheets: Designing for the Web, 3rd edn., p. 5. Addison-Wesley Professional (2005)

    Google Scholar 

  12. Lin, C.X., Zhao, B., Weninger, T., Han, J., Liu, B.: Entity relation discovery from web tables and links. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1145–1146. ACM, New York (2010)

    Chapter  Google Scholar 

  13. Liu, B., Grossman, R.L., Zhai, Y.: Mining web pages for data records. IEEE Intelligent Systems 19(6), 49–55 (2004)

    Article  Google Scholar 

  14. Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)

    Article  Google Scholar 

  15. Maximilien, E.M., Ranabahu, A.: The programmableweb: Agile, social, and grassroot computing. In: Proceedings of the International Conference on Semantic Computing, ICSC 2007, pp. 477–481. IEEE Computer Society, Washington, DC (2007)

    Google Scholar 

  16. Miao, G., Tatemura, J., Hsiung, W.: Extracting data records from the web using tag path clustering. In: The World Wide Web Conference, pp. 981–990 (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Lanotte, P.F., Fumarola, F., Ceci, M., Scarpino, A., Torelli, M.D., Malerba, D. (2014). Automatic Extraction of Logical Web Lists. In: Andreasen, T., Christiansen, H., Cubero, JC., Raś, Z.W. (eds) Foundations of Intelligent Systems. ISMIS 2014. Lecture Notes in Computer Science(), vol 8502. Springer, Cham. https://doi.org/10.1007/978-3-319-08326-1_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08326-1_37

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08325-4

  • Online ISBN: 978-3-319-08326-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics