Skip to main content

Tabular Web Data: Schema Discovery and Integration

  • Conference paper
Data Warehousing and Knowledge Discovery (DaWaK 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8057))

Included in the following conference series:

  • 1406 Accesses

Abstract

Web data such as web tables, lists, and data records from a wide variety of domains can be combined for different purposes such as querying for information and creating example data sets. Tabular web data location, extraction, and schema discovery and integration are important for effectively combining, querying, and presenting it in a uniform format. We focus on schema generation and integration for both a static and a dynamic framework. We contribute algorithms for generating individual schemas from extracted tabular web data and integrating the generated schemas. Our approach is novel because it contributes functionality not previously addressed; it accommodates both the static and dynamic frameworks, different kinds of web data types, schema discovery and unification, and table integration.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cafarella, M.J., Halevy, A.Y., Khoussainova, N.: Data Integration for the Relational Web. In: Proceedings of the 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France, August 24-28, pp. 1090–1101 (2009)

    Google Scholar 

  2. Cafarella, M.J., Halevy, A.Y., Madhavan, J.: Structured Data on the Web. Communications of the ACM (CACM) 54(2), 72–79 (2011)

    Article  Google Scholar 

  3. Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: Exploring the Power of Tables on the Web. In: Proceedings of the 34th International Conference on Very Large Data Bases (VLDB 2008), Auckland, New Zealand, August 23-28, pp. 538–549 (2008)

    Google Scholar 

  4. Cafarella, M.J., Halevy, A.Y., Zhang, Y., Wang, D.Z., Wu, E.: Uncovering the Relational Web. In: Proceedings of the 11th International Workshop on Web and Databases (WebDB 2008), Vancouver, BC, Canada (June 13, 2008)

    Google Scholar 

  5. Cafarella, M.J., Madhavan, J., Halevy, A.Y.: Web-scale Extraction of Structured Data. ACM SIGMOD Record 37(4), 55–61 (2009)

    Article  Google Scholar 

  6. Embley, D.W., Lopresti, D.P., Nagy, G.: Notes on Contemporary Table Recognition. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 164–175. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  7. Elmeleegy, H., Madhavan, J., Halevy, A.Y.: Harvesting Relational Tables from Lists on the Web. In: Proceedings of the VLDB Endowment, vol. 20(1), pp. 209–226 (2009)

    Google Scholar 

  8. Embley, D.W., Ng, Y.-K., Xu, L.: Recognizing Ontology-applicable Multiple-record Web Documents. In: Kunii, H.S., Jajodia, S., Sølvberg, A. (eds.) ER 2001. LNCS, vol. 2224, pp. 555–570. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  9. Embley, D.W., Tao, C., Liddle, S.W.: Automating the Extraction of Data from HTML Tables with Unknown Structure. Data and Knowledge Engineering 54(1), 3–28 (2005)

    Article  Google Scholar 

  10. Gupta, R., Sarawagi, S.: Answering Table Augmentation Queries from Unstructured Lists on the Web. In: Proceedings of the VLDB Endowment, vol. 2(1), pp. 289–330 (2009)

    Google Scholar 

  11. Hurst, M.: Layout and Language: Challenges for Table Understanding on the Web. In: Proceedings of the International Workshop on Web Document Analysis (WDA 2001), Seattle, Washington, USA, pp. 27–30 (September 8, 2001)

    Google Scholar 

  12. Mergen, S., Freire, J., Heuser, C.: Mesa: A Search Engine for Querying Web Tables (2008), http://www.scholr.ly/paper/1328437/mesa-a-search-engine-for-querying-web-tables

  13. Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting Data Records from the Web Using Tag Path Clustering. In: Proceedings of the 18th International ACM Conference on World Wide Web (WWW 2009), Madrid, Spain, April 20-24, pp. 981–990 (2009)

    Google Scholar 

  14. Sarma, A.D., Fang, L., Gupta, N., Halevy, A.Y., Lee, H., Wu, F., Xin, R., Yu, C.: Finding Related Tables. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), Scottsdale, Arizona, USA, May 20-24, pp. 817–828 (2012)

    Google Scholar 

  15. Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: A Probabilistic Taxonomy for Text Understanding. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), Scottsdale, Arizona, USA, May 20-24, pp. 481–492 (2012)

    Google Scholar 

  16. Wang, J., Wang, H., Wang, Z., Zhu, K.Q.: Understanding tables on the web. In: Atzeni, P., Cheung, D., Ram, S. (eds.) ER 2012. LNCS, vol. 7532, pp. 141–155. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag GmbH Berlin Heidelberg

About this paper

Cite this paper

Janga, P., Davis, K.C. (2013). Tabular Web Data: Schema Discovery and Integration. In: Bellatreche, L., Mohania, M.K. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2013. Lecture Notes in Computer Science, vol 8057. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40131-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40131-2_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40130-5

  • Online ISBN: 978-3-642-40131-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics