research-article

Combining URL and HTML Features for Entity Discovery in the Web

Authors:

Carina Friedrich Dorneles,

Renata GalanteAuthors Info & Claims

ACM Transactions on the Web (TWEB), Volume 13, Issue 4

Article No.: 20, Pages 1 - 27

https://doi.org/10.1145/3365574

Published: 04 December 2019 Publication History

Abstract

The web is a large repository of entity-pages. An entity-page is a page that publishes data representing an entity of a particular type, for example, a page that describes a driver on a website about a car racing championship. The attribute values published in the entity-pages can be used for many data-driven companies, such as insurers, retailers, and search engines. In this article, we define a novel method, called SSUP, which discovers the entity-pages on the websites. The novelty of our method is that it combines URL and HTML features in a way that allows the URL terms to have different weights depending on their capacity to distinguish entity-pages from other pages, and thus the efficacy of the entity-page discovery task is increased. SSUP determines the similarity thresholds on each website without human intervention. We carried out experiments on a dataset with different real-world websites and a wide range of entity types. SSUP achieved a 95% rate of precision and 85% recall rate. Our method was compared with two state-of-the-art methods and outperformed them with a precision gain between 51% and 66%.

References

[1]

T. W. Anderson and J. D. Finn. 1996. The New Statistical Analysis of Data. Springer.

[2]

Akari Asai, Sara Evensen, Behzad Golshan, Alon Y. Halevy, Vivian Li, Andrei Lopatenko, Daniela Stepanov, Yoshihiko Suhara, Wang-Chiew Tan, and Yinzhan Xu. 2018. HappyDB: A Corpus of 100, 000 crowdsourced happy moments. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).

[3]

Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. 2011. Modern Information Retrieval - The Concepts and Technology Behind Search.2nd ed. Pearson Education, Harlow, England. Retrieved from http://www.mir2ed.org/.

[4]

Lorenzo Blanco, Valter Crescenzi, and Paolo Merialdo. 2005. Efficiently locating collections of web pages to wrap. In International Conference on Web Information Systems and Technologies (WEBIST’05). INSTICC Press, Miami, FL, 247--254.

[5]

Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2008. Supporting the automatic construction of entity aware search engines. In International Workshop on Web Information and Data Management (WIDM’08), Chee Yong Chan and Neoklis Polyzotis (Eds.). ACM, 149--156.

Digital Library

[6]

Lorenzo Blanco, Nilesh Dalvi, and Ashwin Machanavajjhala. 2011. Highly efficient algorithms for structural clustering of large websites. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). ACM, New York, NY, 437--446.

Digital Library

[7]

Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Extraction and integration of partially overlapping web sources. PVLDB 6, 10 (2013), 805--816.

Digital Library

[8]

Andrew Carlson and Charles Schafer. 2008. Bootstrapping information extraction from semi-structured web pages. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I (ECML PKDD’08). Springer-Verlag, Berlin, 195--210.

[9]

Eric Crestan and Patrick Pantel. 2011. Web-scale table census and classification. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 545--554.

Digital Library

[10]

Fabio Fumarola, Tim Weninger, Rick Barber, Donato Malerba, and Jiawei Han. 2011. HyLiEn: A hybrid approach to general list extraction on the web. In Proceedings of the 20th International Conference Companion on World Wide Web (WWW’11). ACM, New York, NY, 35--36.

Digital Library

[11]

Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, and Cheng Wang. 2014. DIADEM: Thousands of websites to a single database. PVLDB 7, 14 (2014), 1845--1856.

Digital Library

[12]

Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, and C. Lee Giles. 2015. Improving researcher homepage classification with unlabeled data. ACM Transactions on the Web 9, 4, Article 17 (Oct. 2015), 32 pages.

Digital Library

[13]

Peter D. Grünwald. 2007. The Minimum Description Length Principle. MIT Press, London, England.

[14]

Pankaj Gulhane, Rajeev Rastogi, Srinivasan H. Sengamedu, and Ashwin Tengli. 2010. Exploiting content redundancy for web information extraction. PVLDB 3, 1 (2010), 578--587.

Digital Library

[15]

Alon Y. Halevy. 2017. Technical perspective: Building knowledge bases from messy data. Communications of the ACM 60, 5 (2017), 92.

Digital Library

[16]

Yeye He, Dong Xin, Venkatesh Ganti, Sriram Rajaraman, and Nirav Shah. 2013. Crawling deep web entity pages. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM’13). ACM, New York, NY, 355--364.

Digital Library

[17]

Inma Hernández, Carlos R. Rivero, David Ruiz, and Rafael Corchuelo. 2016. CALA: ClAssifying links automatically based on their URL. Journal of Systems and Software 115 (2016), 130--143.

Digital Library

[18]

Djoerd Hiemstra. 2001. Using Language Models for Information Retrieval. Ph.D. Dissertation. Centre for Telematics and Information Technology, University of Twente, Enschede.

[19]

Rianne Kaptein, Pavel Serdyukov, Arjen De Vries, and Jaap Kamps. 2010. Entity ranking using wikipedia as a pivot. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). ACM, New York, NY, 69--78.

Digital Library

[20]

Cindy Xide Lin, Bo Zhao, Tim Weninger, Jiawei Han, and Bing Liu. 2010. Entity relation discovery from web tables and links. In Proceedings of the 19th International Conference on World Wide Web (WWW’10). 1145--1146.

Digital Library

[21]

Edimar Manica, Carina F. Dorneles, and Renata Galante. 2017. Orion: A cypher-based web data extractor. In Database and Expert Systems Applications - 28th International Conference (DEXA’17), Proceedings, Part I. 275--289.

[22]

Edimar Manica, Carina F. Dorneles, and Renata Galante. 2017. R-Extractor: A method for data extraction from template-based entity-pages. In 41st IEEE Annual Computer Software and Applications Conference (COMPSAC’17). Volume 1. 778--787.

[23]

Edimar Manica, Renata Galante, and Carina F. Dorneles. 2014. SSUP - A URL-based method to entity-page discovery. In Web Engineering, 14th International Conference (ICWE’14), Proceedings. 254--271.

[24]

T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. 2015. Never-ending learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15).

[25]

Alfonso Murolo and Moira C. Norrie. 2016. Revisiting web data extraction using in-browser structural analysis and visual cues in modern web designs. In Web Engineering - 16th International Conference (ICWE’16), Proceedings. 114--131.

[26]

Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche. 2015. WADaR: Joint wrapper and data repair. PVLDB 8, 12 (2015), 1996--1999.

Digital Library

[27]

Stefano Ortona, Giorgio Orsi, Tim Furche, and Marcello Buoncristiano. 2016. Joint repairs for web wrappers. In Proceedings of theInternational Conference on Data Engineering. IEEE Computer Society, Washington,1146--1157.

[28]

Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, and Divesh Srivastava. 2015. Dexter: Large-scale discovery and extraction of product specifications on the web. Proceedings of the VLDB Endowment 8, 13 (Sept. 2015), 2194--2205.

Digital Library

[29]

Gianluca Quercini and Chantal Reynaud. 2013. Entity discovery and annotation in tables. In Proceedings of the 16th International Conference on Extending Database Technology (EDBT’13). ACM, New York, NY, 693--704.

Digital Library

[30]

Hassan A. Sleiman and Rafael Corchuelo. 2014. Trinity: On using trinary trees for unsupervised web data extraction. IEEE Transactions on Knowledge and Data Engineering 26, 6 (2014), 1544--1556.

Digital Library

[31]

Márcio L. A. Vidal, Altigran S. da Silva, Edleno S. de Moura, and João M. B. Cavalcanti. 2006. Structure-driven crawler generation by example. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, NY, 292--299.

[32]

Tim Weninger, Thomas J. Johnston, and Jiawei Han. 2013. The parallel path framework for entity discovery on the web. ACM Transactions on the Web 7, 3, Article 16 (Sept. 2013), 29 pages.

Digital Library

[33]

Naoki Yoshinaga and Kentaro Torisawa. 2007. Open-domain attribute-value acquisition from semi-structured texts. In Proceedings of the 6th International Semantic Web Conference (ISWC’07), Workshop on Text to Knowledge: The Lexicon/Ontology Interface (OntoLex’07). Springer, Busan, South Korea, 55--66.

[34]

Hwanjo Yu, Jiawei Han, and Kevin Chen-Chuan Chang. 2004. PEBL: Web page classification without negative examples. IEEE Transactions on on Knowledge and Data Engineerings 16, 1 (Jan. 2004), 70--81.

Digital Library

[35]

Yanhong Zhai and Bing Liu. 2005. Web data extraction based on partial tree alignment. In Proceedings of the 14th International Conference on World Wide Web (WWW’05). 76--85.

Digital Library

[36]

Ce Zhang, Christopher Ré, Michael J. Cafarella, Jaeho Shin, Feiran Wang, and Sen Wu. 2017. DeepDive: Declarative knowledge base construction. Communications of the ACM 60, 5 (2017), 93--102.

Digital Library

Cited By

Uzun E(2023)Scraping Relevant Images from Web Pages without DownloadACM Transactions on the Web10.1145/361684918:1(1-27)Online publication date: 11-Oct-2023
https://dl.acm.org/doi/10.1145/3616849
Hegade PLingadhal NJain SKhan UVijeth K(2021)Crawler by Contextual InferenceSN Computer Science10.1007/s42979-021-00574-z2:3Online publication date: 16-Apr-2021
https://dl.acm.org/doi/10.1007/s42979-021-00574-z

Index Terms

Combining URL and HTML Features for Entity Discovery in the Web
1. Information systems
  1. Data management systems
    1. Information integration
      1. Wrappers (data mining)
  2. World Wide Web
    1. Web mining
      1. Site wrapping
    2. Web searching and information discovery
      1. Web search engines
        Web crawling

Recommendations

Knowledge Discovery and Retrieval on World Wide Web Using Web Structure Mining
AMS '10: Proceedings of the 2010 Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation

The World Wide Web is nearing omnipresence. The explosively growing number of Web contents including Digitalized manuals, emails pictures, multimedia, and Web services require a distinct and elaborate structural framework that can provide a navigational ...
Interpretable Mining of Influential Patterns from Sparse Web
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology

Big data are everywhere. World Wide Web is an example of these big data. It has become a vast data production and consumption platform, at which threads of data evolve from multiple devices, by different human interactions, over worldwide locations, ...
The parallel path framework for entity discovery on the web

It has been a dream of the database and Web communities to reconcile the unstructured nature of the World Wide Web with the neat, structured schemas of the database paradigm. Even though databases are currently used to generate Web content in some sites,...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web

ACM Transactions on the Web Volume 13, Issue 4

November 2019

139 pages

ISSN:1559-1131

EISSN:1559-114X

DOI:10.1145/3372405

Editor:
Brian D. Davison
Lehigh University, USA

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 December 2019

Accepted: 01 September 2019

Revised: 01 February 2019

Received: 01 February 2017

Published in TWEB Volume 13, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
289
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Uzun E(2023)Scraping Relevant Images from Web Pages without DownloadACM Transactions on the Web10.1145/361684918:1(1-27)Online publication date: 11-Oct-2023
https://dl.acm.org/doi/10.1145/3616849
Hegade PLingadhal NJain SKhan UVijeth K(2021)Crawler by Contextual InferenceSN Computer Science10.1007/s42979-021-00574-z2:3Online publication date: 16-Apr-2021
https://dl.acm.org/doi/10.1007/s42979-021-00574-z

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents