Abstract
The web is a large repository of entity-pages. An entity-page is a page that publishes data representing an entity of a particular type, for example, a page that describes a driver on a website about a car racing championship. The attribute values published in the entity-pages can be used for many data-driven companies, such as insurers, retailers, and search engines. In this article, we define a novel method, called SSUP, which discovers the entity-pages on the websites. The novelty of our method is that it combines URL and HTML features in a way that allows the URL terms to have different weights depending on their capacity to distinguish entity-pages from other pages, and thus the efficacy of the entity-page discovery task is increased. SSUP determines the similarity thresholds on each website without human intervention. We carried out experiments on a dataset with different real-world websites and a wide range of entity types. SSUP achieved a 95% rate of precision and 85% recall rate. Our method was compared with two state-of-the-art methods and outperformed them with a precision gain between 51% and 66%.
- T. W. Anderson and J. D. Finn. 1996. The New Statistical Analysis of Data. Springer.Google Scholar
- Akari Asai, Sara Evensen, Behzad Golshan, Alon Y. Halevy, Vivian Li, Andrei Lopatenko, Daniela Stepanov, Yoshihiko Suhara, Wang-Chiew Tan, and Yinzhan Xu. 2018. HappyDB: A Corpus of 100, 000 crowdsourced happy moments. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC’18).Google Scholar
- Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. 2011. Modern Information Retrieval - The Concepts and Technology Behind Search.2nd ed. Pearson Education, Harlow, England. Retrieved from http://www.mir2ed.org/.Google Scholar
- Lorenzo Blanco, Valter Crescenzi, and Paolo Merialdo. 2005. Efficiently locating collections of web pages to wrap. In International Conference on Web Information Systems and Technologies (WEBIST’05). INSTICC Press, Miami, FL, 247--254.Google Scholar
- Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2008. Supporting the automatic construction of entity aware search engines. In International Workshop on Web Information and Data Management (WIDM’08), Chee Yong Chan and Neoklis Polyzotis (Eds.). ACM, 149--156.Google ScholarDigital Library
- Lorenzo Blanco, Nilesh Dalvi, and Ashwin Machanavajjhala. 2011. Highly efficient algorithms for structural clustering of large websites. In Proceedings of the 20th International Conference on World Wide Web (WWW’11). ACM, New York, NY, 437--446.Google ScholarDigital Library
- Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, and Paolo Papotti. 2013. Extraction and integration of partially overlapping web sources. PVLDB 6, 10 (2013), 805--816.Google ScholarDigital Library
- Andrew Carlson and Charles Schafer. 2008. Bootstrapping information extraction from semi-structured web pages. In Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I (ECML PKDD’08). Springer-Verlag, Berlin, 195--210.Google ScholarCross Ref
- Eric Crestan and Patrick Pantel. 2011. Web-scale table census and classification. In Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 545--554.Google ScholarDigital Library
- Fabio Fumarola, Tim Weninger, Rick Barber, Donato Malerba, and Jiawei Han. 2011. HyLiEn: A hybrid approach to general list extraction on the web. In Proceedings of the 20th International Conference Companion on World Wide Web (WWW’11). ACM, New York, NY, 35--36.Google ScholarDigital Library
- Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, and Cheng Wang. 2014. DIADEM: Thousands of websites to a single database. PVLDB 7, 14 (2014), 1845--1856.Google ScholarDigital Library
- Sujatha Das Gollapalli, Cornelia Caragea, Prasenjit Mitra, and C. Lee Giles. 2015. Improving researcher homepage classification with unlabeled data. ACM Transactions on the Web 9, 4, Article 17 (Oct. 2015), 32 pages.Google ScholarDigital Library
- Peter D. Grünwald. 2007. The Minimum Description Length Principle. MIT Press, London, England.Google Scholar
- Pankaj Gulhane, Rajeev Rastogi, Srinivasan H. Sengamedu, and Ashwin Tengli. 2010. Exploiting content redundancy for web information extraction. PVLDB 3, 1 (2010), 578--587.Google ScholarDigital Library
- Alon Y. Halevy. 2017. Technical perspective: Building knowledge bases from messy data. Communications of the ACM 60, 5 (2017), 92.Google ScholarDigital Library
- Yeye He, Dong Xin, Venkatesh Ganti, Sriram Rajaraman, and Nirav Shah. 2013. Crawling deep web entity pages. In Proceedings of the 6th ACM International Conference on Web Search and Data Mining (WSDM’13). ACM, New York, NY, 355--364.Google ScholarDigital Library
- Inma Hernández, Carlos R. Rivero, David Ruiz, and Rafael Corchuelo. 2016. CALA: ClAssifying links automatically based on their URL. Journal of Systems and Software 115 (2016), 130--143.Google ScholarDigital Library
- Djoerd Hiemstra. 2001. Using Language Models for Information Retrieval. Ph.D. Dissertation. Centre for Telematics and Information Technology, University of Twente, Enschede.Google Scholar
- Rianne Kaptein, Pavel Serdyukov, Arjen De Vries, and Jaap Kamps. 2010. Entity ranking using wikipedia as a pivot. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM’10). ACM, New York, NY, 69--78.Google ScholarDigital Library
- Cindy Xide Lin, Bo Zhao, Tim Weninger, Jiawei Han, and Bing Liu. 2010. Entity relation discovery from web tables and links. In Proceedings of the 19th International Conference on World Wide Web (WWW’10). 1145--1146.Google ScholarDigital Library
- Edimar Manica, Carina F. Dorneles, and Renata Galante. 2017. Orion: A cypher-based web data extractor. In Database and Expert Systems Applications - 28th International Conference (DEXA’17), Proceedings, Part I. 275--289.Google Scholar
- Edimar Manica, Carina F. Dorneles, and Renata Galante. 2017. R-Extractor: A method for data extraction from template-based entity-pages. In 41st IEEE Annual Computer Software and Applications Conference (COMPSAC’17). Volume 1. 778--787.Google ScholarCross Ref
- Edimar Manica, Renata Galante, and Carina F. Dorneles. 2014. SSUP - A URL-based method to entity-page discovery. In Web Engineering, 14th International Conference (ICWE’14), Proceedings. 254--271.Google Scholar
- T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling. 2015. Never-ending learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence (AAAI’15).Google Scholar
- Alfonso Murolo and Moira C. Norrie. 2016. Revisiting web data extraction using in-browser structural analysis and visual cues in modern web designs. In Web Engineering - 16th International Conference (ICWE’16), Proceedings. 114--131.Google Scholar
- Stefano Ortona, Giorgio Orsi, Marcello Buoncristiano, and Tim Furche. 2015. WADaR: Joint wrapper and data repair. PVLDB 8, 12 (2015), 1996--1999.Google ScholarDigital Library
- Stefano Ortona, Giorgio Orsi, Tim Furche, and Marcello Buoncristiano. 2016. Joint repairs for web wrappers. In Proceedings of theInternational Conference on Data Engineering. IEEE Computer Society, Washington,1146--1157.Google ScholarCross Ref
- Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, and Divesh Srivastava. 2015. Dexter: Large-scale discovery and extraction of product specifications on the web. Proceedings of the VLDB Endowment 8, 13 (Sept. 2015), 2194--2205.Google ScholarDigital Library
- Gianluca Quercini and Chantal Reynaud. 2013. Entity discovery and annotation in tables. In Proceedings of the 16th International Conference on Extending Database Technology (EDBT’13). ACM, New York, NY, 693--704.Google ScholarDigital Library
- Hassan A. Sleiman and Rafael Corchuelo. 2014. Trinity: On using trinary trees for unsupervised web data extraction. IEEE Transactions on Knowledge and Data Engineering 26, 6 (2014), 1544--1556.Google ScholarDigital Library
- Márcio L. A. Vidal, Altigran S. da Silva, Edleno S. de Moura, and João M. B. Cavalcanti. 2006. Structure-driven crawler generation by example. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, NY, 292--299.Google Scholar
- Tim Weninger, Thomas J. Johnston, and Jiawei Han. 2013. The parallel path framework for entity discovery on the web. ACM Transactions on the Web 7, 3, Article 16 (Sept. 2013), 29 pages.Google ScholarDigital Library
- Naoki Yoshinaga and Kentaro Torisawa. 2007. Open-domain attribute-value acquisition from semi-structured texts. In Proceedings of the 6th International Semantic Web Conference (ISWC’07), Workshop on Text to Knowledge: The Lexicon/Ontology Interface (OntoLex’07). Springer, Busan, South Korea, 55--66.Google Scholar
- Hwanjo Yu, Jiawei Han, and Kevin Chen-Chuan Chang. 2004. PEBL: Web page classification without negative examples. IEEE Transactions on on Knowledge and Data Engineerings 16, 1 (Jan. 2004), 70--81.Google ScholarDigital Library
- Yanhong Zhai and Bing Liu. 2005. Web data extraction based on partial tree alignment. In Proceedings of the 14th International Conference on World Wide Web (WWW’05). 76--85.Google ScholarDigital Library
- Ce Zhang, Christopher Ré, Michael J. Cafarella, Jaeho Shin, Feiran Wang, and Sen Wu. 2017. DeepDive: Declarative knowledge base construction. Communications of the ACM 60, 5 (2017), 93--102.Google ScholarDigital Library
Index Terms
- Combining URL and HTML Features for Entity Discovery in the Web
Recommendations
Knowledge Discovery and Retrieval on World Wide Web Using Web Structure Mining
AMS '10: Proceedings of the 2010 Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer SimulationThe World Wide Web is nearing omnipresence. The explosively growing number of Web contents including Digitalized manuals, emails pictures, multimedia, and Web services require a distinct and elaborate structural framework that can provide a navigational ...
Interpretable Mining of Influential Patterns from Sparse Web
WI-IAT '21: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent TechnologyBig data are everywhere. World Wide Web is an example of these big data. It has become a vast data production and consumption platform, at which threads of data evolve from multiple devices, by different human interactions, over worldwide locations, ...
The parallel path framework for entity discovery on the web
It has been a dream of the database and Web communities to reconcile the unstructured nature of the World Wide Web with the neat, structured schemas of the database paradigm. Even though databases are currently used to generate Web content in some sites,...
Comments