skip to main content
research-article

Combining URL and HTML Features for Entity Discovery in the Web

Published:04 December 2019Publication History
Skip Abstract Section

Abstract

The web is a large repository of entity-pages. An entity-page is a page that publishes data representing an entity of a particular type, for example, a page that describes a driver on a website about a car racing championship. The attribute values published in the entity-pages can be used for many data-driven companies, such as insurers, retailers, and search engines. In this article, we define a novel method, called SSUP, which discovers the entity-pages on the websites. The novelty of our method is that it combines URL and HTML features in a way that allows the URL terms to have different weights depending on their capacity to distinguish entity-pages from other pages, and thus the efficacy of the entity-page discovery task is increased. SSUP determines the similarity thresholds on each website without human intervention. We carried out experiments on a dataset with different real-world websites and a wide range of entity types. SSUP achieved a 95% rate of precision and 85% recall rate. Our method was compared with two state-of-the-art methods and outperformed them with a precision gain between 51% and 66%.

References

  1. T. W. Anderson and J. D. Finn. 1996. The New Statistical Analysis of Data. Springer.Google ScholarGoogle Scholar
  2. Akari Asai, Sara Evensen, Behzad Golshan, Alon Y. Halevy, Vivian Li, Andrei Lopatenko, Daniela Stepanov, Yoshihiko Suhara, Wang-Chiew Tan, and Yinzhan Xu. 2018. HappyDB: A Corpus of 100, 000 crowdsourced happy moments. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).Google ScholarGoogle Scholar
  3. Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. 2011. Modern Information Retrieval - The Concepts and Technology Behind Search.2nd ed. Pearson Education, Harlow, England. Retrieved from http://www.mir2ed.org/.Google ScholarGoogle Scholar
  4. Lorenzo Blanco, Valter Crescenzi, and Paolo Merialdo. 2005. Efficiently locating collections of web pages to wrap. In International Conference on Web Information Systems and Technologies (WEBIST’05). INSTICC Press, Miami, FL, 247--254.Google ScholarGoogle Scholar
  5. Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2008. Supporting the automatic construction of entity aware search engines. In International Workshop on Web Information and Data Management (WIDM’08), Chee Yong Chan and Neoklis Polyzotis (Eds.). ACM, 149--156.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Lorenzo Blanco, Nilesh Dalvi, and Ashwin Machanavajjhala. 2011. Highly efficient algorithms for structural clustering of large websites. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). ACM, New York, NY, 437--446.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Extraction and integration of partially overlapping web sources. PVLDB 6, 10 (2013), 805--816.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Andrew Carlson and Charles Schafer. 2008. Bootstrapping information extraction from semi-structured web pages. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I (ECML PKDD’08). Springer-Verlag, Berlin, 195--210.Google ScholarGoogle ScholarCross RefCross Ref
  9. Eric Crestan and Patrick Pantel. 2011. Web-scale table census and classification. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 545--554.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Fabio Fumarola, Tim Weninger, Rick Barber, Donato Malerba, and Jiawei Han. 2011. HyLiEn: A hybrid approach to general list extraction on the web. In Proceedings of the 20th International Conference Companion on World Wide Web (WWW’11). ACM, New York, NY, 35--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, and Cheng Wang. 2014. DIADEM: Thousands of websites to a single database. PVLDB 7, 14 (2014), 1845--1856.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, and C. Lee Giles. 2015. Improving researcher homepage classification with unlabeled data. ACM Transactions on the Web 9, 4, Article 17 (Oct. 2015), 32 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Peter D. Grünwald. 2007. The Minimum Description Length Principle. MIT Press, London, England.Google ScholarGoogle Scholar
  14. Pankaj Gulhane, Rajeev Rastogi, Srinivasan H. Sengamedu, and Ashwin Tengli. 2010. Exploiting content redundancy for web information extraction. PVLDB 3, 1 (2010), 578--587.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Alon Y. Halevy. 2017. Technical perspective: Building knowledge bases from messy data. Communications of the ACM 60, 5 (2017), 92.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yeye He, Dong Xin, Venkatesh Ganti, Sriram Rajaraman, and Nirav Shah. 2013. Crawling deep web entity pages. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM’13). ACM, New York, NY, 355--364.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Inma Hernández, Carlos R. Rivero, David Ruiz, and Rafael Corchuelo. 2016. CALA: ClAssifying links automatically based on their URL. Journal of Systems and Software 115 (2016), 130--143.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Djoerd Hiemstra. 2001. Using Language Models for Information Retrieval. Ph.D. Dissertation. Centre for Telematics and Information Technology, University of Twente, Enschede.Google ScholarGoogle Scholar
  19. Rianne Kaptein, Pavel Serdyukov, Arjen De Vries, and Jaap Kamps. 2010. Entity ranking using wikipedia as a pivot. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). ACM, New York, NY, 69--78.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Cindy Xide Lin, Bo Zhao, Tim Weninger, Jiawei Han, and Bing Liu. 2010. Entity relation discovery from web tables and links. In Proceedings of the 19th International Conference on World Wide Web (WWW’10). 1145--1146.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Edimar Manica, Carina F. Dorneles, and Renata Galante. 2017. Orion: A cypher-based web data extractor. In Database and Expert Systems Applications - 28th International Conference (DEXA’17), Proceedings, Part I. 275--289.Google ScholarGoogle Scholar
  22. Edimar Manica, Carina F. Dorneles, and Renata Galante. 2017. R-Extractor: A method for data extraction from template-based entity-pages. In 41st IEEE Annual Computer Software and Applications Conference (COMPSAC’17). Volume 1. 778--787.Google ScholarGoogle ScholarCross RefCross Ref
  23. Edimar Manica, Renata Galante, and Carina F. Dorneles. 2014. SSUP - A URL-based method to entity-page discovery. In Web Engineering, 14th International Conference (ICWE’14), Proceedings. 254--271.Google ScholarGoogle Scholar
  24. T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. 2015. Never-ending learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15).Google ScholarGoogle Scholar
  25. Alfonso Murolo and Moira C. Norrie. 2016. Revisiting web data extraction using in-browser structural analysis and visual cues in modern web designs. In Web Engineering - 16th International Conference (ICWE’16), Proceedings. 114--131.Google ScholarGoogle Scholar
  26. Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche. 2015. WADaR: Joint wrapper and data repair. PVLDB 8, 12 (2015), 1996--1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Stefano Ortona, Giorgio Orsi, Tim Furche, and Marcello Buoncristiano. 2016. Joint repairs for web wrappers. In Proceedings of theInternational Conference on Data Engineering. IEEE Computer Society, Washington,1146--1157.Google ScholarGoogle ScholarCross RefCross Ref
  28. Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, and Divesh Srivastava. 2015. Dexter: Large-scale discovery and extraction of product specifications on the web. Proceedings of the VLDB Endowment 8, 13 (Sept. 2015), 2194--2205.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Gianluca Quercini and Chantal Reynaud. 2013. Entity discovery and annotation in tables. In Proceedings of the 16th International Conference on Extending Database Technology (EDBT’13). ACM, New York, NY, 693--704.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Hassan A. Sleiman and Rafael Corchuelo. 2014. Trinity: On using trinary trees for unsupervised web data extraction. IEEE Transactions on Knowledge and Data Engineering 26, 6 (2014), 1544--1556.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Márcio L. A. Vidal, Altigran S. da Silva, Edleno S. de Moura, and João M. B. Cavalcanti. 2006. Structure-driven crawler generation by example. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, NY, 292--299.Google ScholarGoogle Scholar
  32. Tim Weninger, Thomas J. Johnston, and Jiawei Han. 2013. The parallel path framework for entity discovery on the web. ACM Transactions on the Web 7, 3, Article 16 (Sept. 2013), 29 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Naoki Yoshinaga and Kentaro Torisawa. 2007. Open-domain attribute-value acquisition from semi-structured texts. In Proceedings of the 6th International Semantic Web Conference (ISWC’07), Workshop on Text to Knowledge: The Lexicon/Ontology Interface (OntoLex’07). Springer, Busan, South Korea, 55--66.Google ScholarGoogle Scholar
  34. Hwanjo Yu, Jiawei Han, and Kevin Chen-Chuan Chang. 2004. PEBL: Web page classification without negative examples. IEEE Transactions on on Knowledge and Data Engineerings 16, 1 (Jan. 2004), 70--81.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Yanhong Zhai and Bing Liu. 2005. Web data extraction based on partial tree alignment. In Proceedings of the 14th International Conference on World Wide Web (WWW’05). 76--85.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ce Zhang, Christopher Ré, Michael J. Cafarella, Jaeho Shin, Feiran Wang, and Sen Wu. 2017. DeepDive: Declarative knowledge base construction. Communications of the ACM 60, 5 (2017), 93--102.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Combining URL and HTML Features for Entity Discovery in the Web

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on the Web
          ACM Transactions on the Web  Volume 13, Issue 4
          November 2019
          139 pages
          ISSN:1559-1131
          EISSN:1559-114X
          DOI:10.1145/3372405
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 December 2019
          • Accepted: 1 September 2019
          • Revised: 1 February 2019
          • Received: 1 February 2017
          Published in tweb Volume 13, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format