Abstract
Web data such as web tables, lists, and data records from a wide variety of domains can be combined for different purposes such as querying for information and creating example data sets. Tabular web data location, extraction, and schema discovery and integration are important for effectively combining, querying, and presenting it in a uniform format. We focus on schema generation and integration for both a static and a dynamic framework. We contribute algorithms for generating individual schemas from extracted tabular web data and integrating the generated schemas. Our approach is novel because it contributes functionality not previously addressed; it accommodates both the static and dynamic frameworks, different kinds of web data types, schema discovery and unification, and table integration.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Cafarella, M.J., Halevy, A.Y., Khoussainova, N.: Data Integration for the Relational Web. In: Proceedings of the 35th International Conference on Very Large Data Bases (VLDB 2009), Lyon, France, August 24-28, pp. 1090–1101 (2009)
Cafarella, M.J., Halevy, A.Y., Madhavan, J.: Structured Data on the Web. Communications of the ACM (CACM) 54(2), 72–79 (2011)
Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: Exploring the Power of Tables on the Web. In: Proceedings of the 34th International Conference on Very Large Data Bases (VLDB 2008), Auckland, New Zealand, August 23-28, pp. 538–549 (2008)
Cafarella, M.J., Halevy, A.Y., Zhang, Y., Wang, D.Z., Wu, E.: Uncovering the Relational Web. In: Proceedings of the 11th International Workshop on Web and Databases (WebDB 2008), Vancouver, BC, Canada (June 13, 2008)
Cafarella, M.J., Madhavan, J., Halevy, A.Y.: Web-scale Extraction of Structured Data. ACM SIGMOD Record 37(4), 55–61 (2009)
Embley, D.W., Lopresti, D.P., Nagy, G.: Notes on Contemporary Table Recognition. In: Bunke, H., Spitz, A.L. (eds.) DAS 2006. LNCS, vol. 3872, pp. 164–175. Springer, Heidelberg (2006)
Elmeleegy, H., Madhavan, J., Halevy, A.Y.: Harvesting Relational Tables from Lists on the Web. In: Proceedings of the VLDB Endowment, vol. 20(1), pp. 209–226 (2009)
Embley, D.W., Ng, Y.-K., Xu, L.: Recognizing Ontology-applicable Multiple-record Web Documents. In: Kunii, H.S., Jajodia, S., Sølvberg, A. (eds.) ER 2001. LNCS, vol. 2224, pp. 555–570. Springer, Heidelberg (2001)
Embley, D.W., Tao, C., Liddle, S.W.: Automating the Extraction of Data from HTML Tables with Unknown Structure. Data and Knowledge Engineering 54(1), 3–28 (2005)
Gupta, R., Sarawagi, S.: Answering Table Augmentation Queries from Unstructured Lists on the Web. In: Proceedings of the VLDB Endowment, vol. 2(1), pp. 289–330 (2009)
Hurst, M.: Layout and Language: Challenges for Table Understanding on the Web. In: Proceedings of the International Workshop on Web Document Analysis (WDA 2001), Seattle, Washington, USA, pp. 27–30 (September 8, 2001)
Mergen, S., Freire, J., Heuser, C.: Mesa: A Search Engine for Querying Web Tables (2008), http://www.scholr.ly/paper/1328437/mesa-a-search-engine-for-querying-web-tables
Miao, G., Tatemura, J., Hsiung, W.-P., Sawires, A., Moser, L.E.: Extracting Data Records from the Web Using Tag Path Clustering. In: Proceedings of the 18th International ACM Conference on World Wide Web (WWW 2009), Madrid, Spain, April 20-24, pp. 981–990 (2009)
Sarma, A.D., Fang, L., Gupta, N., Halevy, A.Y., Lee, H., Wu, F., Xin, R., Yu, C.: Finding Related Tables. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), Scottsdale, Arizona, USA, May 20-24, pp. 817–828 (2012)
Wu, W., Li, H., Wang, H., Zhu, K.Q.: Probase: A Probabilistic Taxonomy for Text Understanding. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD 2012), Scottsdale, Arizona, USA, May 20-24, pp. 481–492 (2012)
Wang, J., Wang, H., Wang, Z., Zhu, K.Q.: Understanding tables on the web. In: Atzeni, P., Cheung, D., Ram, S. (eds.) ER 2012. LNCS, vol. 7532, pp. 141–155. Springer, Heidelberg (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag GmbH Berlin Heidelberg
About this paper
Cite this paper
Janga, P., Davis, K.C. (2013). Tabular Web Data: Schema Discovery and Integration. In: Bellatreche, L., Mohania, M.K. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2013. Lecture Notes in Computer Science, vol 8057. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40131-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-642-40131-2_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40130-5
Online ISBN: 978-3-642-40131-2
eBook Packages: Computer ScienceComputer Science (R0)