skip to main content
research-article

The parallel path framework for entity discovery on the web

Published: 30 September 2013 Publication History

Abstract

It has been a dream of the database and Web communities to reconcile the unstructured nature of the World Wide Web with the neat, structured schemas of the database paradigm. Even though databases are currently used to generate Web content in some sites, the schemas of these databases are rarely consistent across a domain. This makes the comparison and aggregation of information from different domains difficult. We aim to make an important step towards resolving this disparity by using the structural and relational information on the Web to (1) extract Web lists, (2) find entity-pages, (3) map entity-pages to a database, and (4) extract attributes of the entities. Specifically, given a Web site and an entity-page (e.g., university department and faculty member home page) we seek to find all of the entity-pages of the same type (e.g., all faculty members in the department), as well as attributes of the specific entities (e.g., their phone numbers, email addresses, office numbers). To do this, we propose a Web structure mining method which grows parallel paths through the Web graph and DOM trees and propagates relevant attribute information forward. We show that by utilizing these parallel paths we can efficiently discover entity-pages and attributes. Finally, we demonstrate the accuracy of our method with a large case study.

References

[1]
Blanco, L., Crescenzi, V., and Merialdo, P. 2005. Efficiently locating collections of web pages to wrap. In Proceedings of the International Conference on Web Information Systems and Technologies. 247--254.
[2]
Blanco, L., Crescenzi, V., Merialdo, P., and Papotti, P. 2008a. Flint: Google-basing the Web. In Proceedings of the International Conference on Extending Database Technology. ACM Press, New York, 720--724.
[3]
Blanco, L., Crescenzi, V., Merialdo, P., and Papotti, P. 2008b. Supporting the automatic construction of entity aware search engines. In Proceedings of the 10th ACM Workshop on Web Information and Data Management (WIDM'08). ACM Press, New York, 149.
[4]
Cafarella, M. J., Halevy, A., and Khoussainova, N. 2009. Data integration for the relational web. Proc. VLDB Endow. 2, 1, 1090--1101.
[5]
Cafarella, M. J., Halevy, A., Wang, D. Z., Wu, E., and Zhang, Y. 2008. WebTables: Exploring the power of tables on the web. Proc. VLDB Endow. 1, 1, 538--549.
[6]
Crescenzi, V., Mecca, G., and Merialdo, P. 2001. RoadRunner: Towards automatic data extraction from large web sites. In Proceedings of the International Conference on Very Large Databases. 109--118.
[7]
Crescenzi, V., Merialdo, P., and Missier, P. 2005. ClusteringWeb pages based on their structure. Data Knowl. Engin. 54, 3, 279--299.
[8]
Elmeleegy, H., Madhavan, J., and Halevy, A. 2011. Harvesting relational tables from lists on the web. VLDB J. 20, 2, 209--226.
[9]
Fumarola, F., Weninger, T., Barber, R., Malerba, D., and Han, J. 2011. HyLiEn: A hybrid approach to general list extraction on the web. In Proceedings of the International World Wide Web Conference. ACM Press, New York, 35.
[10]
Gatterbauer, W., Bohunsky, P., Herzog, M., Krüpl, B., and Pollak, B. 2007. Towards domain-independent information extraction from web tables. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). ACM Press, New York, 71.
[11]
Gupta, R. and Sarawagi, S. 2009. Answering table augmentation queries from unstructured lists on the Web. Proc. VLDB Endow. 2, 1, 289--300.
[12]
Hovy, E., Horacek, H., Métais, E., Muñoz, R., and Wolska, M. 2010. Natural Language Processing and Information Systems. Lecture Notes in Computer Science Series, vol. 5723, Springer.
[13]
Kaptein, R., Serdyukov, P., De Vries, A., and Kamps, J. 2010. Entity ranking using Wikipedia as a pivot. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM'10). ACM Press, New York, 69.
[14]
Kim, S.-M., Pantel, P., Duan, L., and Gaffney, S. 2009. Improving web page classification by label-propagation over click graphs. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM'09). ACM Press, New York, 1077.
[15]
Lerman, K., Getoor, L., Minton, S., and Knoblock, C. 2004. Using the structure of Web sites for automatic segmentation of tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'04). 119--130.
[16]
Limaye, G., Sarawagi, S., and Chakrabarti, S. 2010. Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3, 1--2, 1338--1347.
[17]
Lin, C. X., Zhao, B., Weninger, T., Han, J., and Liu, B. 2010. Entity relation discovery from web tables and links. In Proceedings of the International World Wide Web Conference. ACM Press, New York, 1145.
[18]
Liu, B. 2011. Web Data Mining 2nd Ed. Springer.
[19]
Liu, B., Grossman, R., and Zhai, Y. 2003. Mining data records in Web pages. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03). ACMPress, New York, 601.
[20]
Liu, W., Meng, X., and Meng, W. 2010. ViDE: A vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22, 3, 447--460.
[21]
Lopresti, D. and Tomkins, A. 1997. Block edit models for approximate string matching. Theor. Comput. Sci. 181, 1, 159--179.
[22]
Mansuri, I. and Sarawagi, S. 2006. Integrating Unstructured Data into Relational Databases. In Proceedings of the International Conference on Data Engineering. IEEE, 29.
[23]
Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., and Moser, L. E. 2009. Extracting data records from the web using tag path clustering. In Proceedings of the 18th International Conference on World Wide Web (WWW'09). ACM Press, New York, 981.
[24]
Qi, X. and Davison, B. D. 2009. Web page classification. ACM Comput. Surv. 41, 2, 1--31.
[25]
Raghavan, S. and Garcia-Molina, H. 2001. Crawling the hidden web. In Proceedings of the International Conference on Very Large Databases. 129--138.
[26]
Roy, P., Mohania, M., Bamba, B., and Raman, S. 2005. Towards automatic association of relevant unstructured content with structured query results. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM'05). ACM Press, New York, 405.
[27]
Shen, D., Sun, J.-T., Yang, Q., and Chen, Z. 2006. A comparison of implicit and explicit links for web page classification. In Proceedings of the 15th International Conference on World Wide Web (WWW'06). ACM Press, New York, 643.
[28]
Shen, X., Chen, J., Meng, X., Zhang, Y., and Liu, C. 2009. Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, vol. 5476, Springer.
[29]
Small, H. 1973. Co-citation in the scientific literature: A new measure of the relationship between two documents. J. Amer. Soc. Inf. Sci. 24, 4, 28--31.
[30]
Tong, S. and Dean, J. 2008. System and methods for automatically creating lists. US Patent 7350187.
[31]
Wang, R. C. and Cohen, W. W. 2007. Language-independent set expansion of named entities using the Web. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM'07). 342--350.
[32]
Weninger, T., Fumarola, F., Barber, R., Han, J., and Malerba, D. 2011a. Unexpected results in automatic list extraction on the web. ACM SIGKDD Explorations Newsl. 12, 2, 26.
[33]
Weninger, T., Fumarola, F., Han, J., and Malerba, D. 2010. Mapping web pages to database records via link paths. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM'10). ACM Press, New York, 1637.
[34]
Weninger, T., Fumarola, F., Lin, C. X., Barber, R., Han, J., and Malerba, D. 2011b. Growing parallel paths for entity-page discovery. In Proceedings of the 20th International Conference on World Wide Web (WWW'11). ACM Press, New York, 145.
[35]
Weninger, T., McCloskey, D., et al. 2011c. WINACS: Construction and analysis of web-based computer science information networks. In Proceedings of the International Conference on Management of Data (SIGMOD'11). ACM Press, New York, 1255.
[36]
Weninger, T., Zhai, C., and Han, J. 2012. Building enriched web page representations using link paths. In Proceedings of the 23rd ACM Conference on Hypertext and Social Media. ACM Press, New York, 53.
[37]
Yang, H. and Chua, T.-S. 2004. Effectiveness of web page classification on finding list answers. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York, 522--523.
[38]
Yen, J. Y. 1971. Finding the k shortest loopless paths in a network. Manage. Sci. 17, 11, 712--716.
[39]
Yu, H., Han, J., and Chang, K. C.-C. 2004. Pebl: Web page classification without negative examples. IEEE Trans. Knowl. Data Eng. 16, 1, 70--81.
[40]
Zhai, Y. and Liu, B. 2006. Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18, 12, 1614--1628.

Cited By

View all

Index Terms

  1. The parallel path framework for entity discovery on the web

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on the Web
      ACM Transactions on the Web  Volume 7, Issue 3
      September 2013
      149 pages
      ISSN:1559-1131
      EISSN:1559-114X
      DOI:10.1145/2516633
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 September 2013
      Accepted: 01 March 2013
      Revised: 01 July 2012
      Received: 01 February 2012
      Published in TWEB Volume 7, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Parallel paths
      2. entity pages
      3. semi-structured data
      4. web structure mining

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)2
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 02 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media