Automatic Extraction of Logical Web Lists

Lanotte, Pasqua Fabiana; Fumarola, Fabio; Ceci, Michelangelo; Scarpino, Andrea; Torelli, Michele Damiano; Malerba, Donato

doi:10.1007/978-3-319-08326-1_37

Pasqua Fabiana Lanotte²²,
Fabio Fumarola²²,
Michelangelo Ceci²²,
Andrea Scarpino²²,
Michele Damiano Torelli²² &
…
Donato Malerba²²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8502))

Included in the following conference series:

International Symposium on Methodologies for Intelligent Systems

1563 Accesses
5 Citations

Abstract

Recently, there has been increased interest in the extraction of structured data from the web (both “Surface” Web and“Hidden” Web). In particular, in this paper we focus on the automatic extraction of Web Lists. Although this task has been studied extensively, existing approaches are based on the assumption that lists are wholly contained in a Web page.They do not consider that many websites span their listing on several Web Pages and show for each of these only a partial view. Similar to databases, where a view can represent a subset of the data contained in a table, they split a logical list in multiple views (view lists). Automatic extraction of logical lists is an open problem. To tackle this issue we propose an unsupervised and domain-independent algorithm for logical list extraction. Experimental results on real-life and data-intensive Web sites confirm the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baumgartner, R.: Datalog-related aspects in lixto visual developer. In: de Moor, O., Gottlob, G., Furche, T., Sellers, A. (eds.) Datalog 2010. LNCS, vol. 6702, pp. 145–160. Springer, Heidelberg (2011)
Chapter Google Scholar
Bing, L., Lam, W., Gu, Y.: Towards a unified solution: Data record region detection and segmentation. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM 2011, pp. 1265–1274. ACM, New York (2011)
Google Scholar
Cafarella, M.J., Halevy, A., Madhavan, J.: Structured data on the web. Commun. ACM 54(2), 72–79 (2011)
Article Google Scholar
Crescenzi, V., Merialdo, P., Missier, P.: Clustering web pages based on their structure. Data Knowl. Eng. 54(3), 279–299 (2005)
Article Google Scholar
Elmeleegy, H., Madhavan, J., Halevy, A.: Harvesting relational tables from lists on the web. The VLDB Journal 20(2), 209–226 (2011)
Article Google Scholar
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, pp. 1535–1545. Association for Computational Linguistics, Stroudsburg (2011)
Google Scholar
Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J.: Extracting general lists from web documents: A hybrid approach. In: Mehrotra, K.G., Mohan, C.K., Oh, J.C., Varshney, P.K., Ali, M. (eds.) IEA/AIE 2011, Part I. LNCS, vol. 6703, pp. 285–294. Springer, Heidelberg (2011)
Chapter Google Scholar
Fumarola, F., Weninger, T., Barber, R., Malerba, D., Han, J.: Hylien: A hybrid approach to general list extraction on the web. In: Srinivasan, S., Ramamritham, K., Kumar, A., Ravindra, M.P., Bertino, E., Kumar, R. (eds.) WWW (Companion Volume), pp. 35–36. ACM (2011)
Google Scholar
Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., Pollak, B.: Towards domain-independent information extraction from web tables. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 71–80. ACM, New York (2007)
Chapter Google Scholar
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD 2004, pp. 119–130. ACM, New York (2004)
Google Scholar
Lie, H.W., Bos, B.: Cascading Style Sheets: Designing for the Web, 3rd edn., p. 5. Addison-Wesley Professional (2005)
Google Scholar
Lin, C.X., Zhao, B., Weninger, T., Han, J., Liu, B.: Entity relation discovery from web tables and links. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 1145–1146. ACM, New York (2010)
Chapter Google Scholar
Liu, B., Grossman, R.L., Zhai, Y.: Mining web pages for data records. IEEE Intelligent Systems 19(6), 49–55 (2004)
Article Google Scholar
Liu, W., Meng, X., Meng, W.: Vide: A vision-based approach for deep web data extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)
Article Google Scholar
Maximilien, E.M., Ranabahu, A.: The programmableweb: Agile, social, and grassroot computing. In: Proceedings of the International Conference on Semantic Computing, ICSC 2007, pp. 477–481. IEEE Computer Society, Washington, DC (2007)
Google Scholar
Miao, G., Tatemura, J., Hsiung, W.: Extracting data records from the web using tag path clustering. In: The World Wide Web Conference, pp. 981–990 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Universita degli Studi di Bari “Aldo Moro”, via Orabona, 4, 70125, Bari, Italy
Pasqua Fabiana Lanotte, Fabio Fumarola, Michelangelo Ceci, Andrea Scarpino, Michele Damiano Torelli & Donato Malerba

Authors

Pasqua Fabiana Lanotte
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Fumarola
View author publications
You can also search for this author in PubMed Google Scholar
Michelangelo Ceci
View author publications
You can also search for this author in PubMed Google Scholar
Andrea Scarpino
View author publications
You can also search for this author in PubMed Google Scholar
Michele Damiano Torelli
View author publications
You can also search for this author in PubMed Google Scholar
Donato Malerba
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Research Group PLIS: Programming, Logic and Intelligent Systems Dept. of Communication, Business and Information Technologies, Roskilde University, Denmark
Troels Andreasen & Henning Christiansen &
Department of Computer Science and Artificial Intelligence, CITIC, University of Granada, 18071, Granada, Spain
Juan-Carlos Cubero
University of North Carolina, , , 9201 University City Blvd, Charlotte, NC 28223 USA, and Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, Poland
Zbigniew W. Raś

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lanotte, P.F., Fumarola, F., Ceci, M., Scarpino, A., Torelli, M.D., Malerba, D. (2014). Automatic Extraction of Logical Web Lists. In: Andreasen, T., Christiansen, H., Cubero, JC., Raś, Z.W. (eds) Foundations of Intelligent Systems. ISMIS 2014. Lecture Notes in Computer Science(), vol 8502. Springer, Cham. https://doi.org/10.1007/978-3-319-08326-1_37

Download citation

DOI: https://doi.org/10.1007/978-3-319-08326-1_37
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08325-4
Online ISBN: 978-3-319-08326-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics