Skip to main content

Part of the book series: Studies in Computational Intelligence ((SCI,volume 375))

Abstract

In this chapter we investigate the crucial problem that poses the bases to the concept of dataspaces: the need for human interaction/intervention in the process of organizing (getting the structure of) unstructured data. We survey the existing techniques behind dataspaces to overcome that need, exploring the structure of a dataspace along three dimensions: dataspace profiling, querying and searching and application domain. We will further explore existing projects focusing on dataspaces, induction of data structure from documents, and data models where data schema and documents structure overlaps will be reviewed, such as Apache Hadoop, Cassandra on Amazon Dynamo, Google BigTable model and other DHT-based flexible data structures, Google Fusion Tables, iMeMex, U-DID, WebTables and Yahoo! SearchMonkey.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gounbark, L., Benhlima, L., Chiadmi, D.: Data integration system: toward a prototype. In: ACS/IEEE International Conference on Computer Systems and Applications, pp. 33–36 (2009)

    Google Scholar 

  2. Gatterbauer, W., Suciu, D.: Managing structured collections of community data. In: CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Online Proceedings, Asilomar (January 2011)

    Google Scholar 

  3. Dittrich, J.-P., Salles, M.A.V.: idm: A unified and versatile data model for personal dataspace management. In: Dayal, et al [52], pp. 367–378

    Google Scholar 

  4. Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Record 34(4), 27–33 (2005)

    Article  Google Scholar 

  5. Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: Vansummeren, S. (ed.) PODS, pp. 1–9. ACM, New York (2006)

    Google Scholar 

  6. Dong, X., Halevy, A.Y.: Indexing dataspaces. In: Chan, C.Y., Ooi, B.C., Zhou, A. (eds.) SIGMOD Conference, pp. 43–54. ACM, New York (2007)

    Google Scholar 

  7. Howe, B., Maier, D., Rayner, N., Rucker, J.: Quarrying dataspaces: Schemaless profiling of unfamiliar information sources. In: ICDEW 2008: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering Workshop, pp. 270–277. IEEE Computer Society Press, Washington, DC, USA (2008)

    Chapter  Google Scholar 

  8. Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 847–860. ACM, New York (2008)

    Chapter  Google Scholar 

  9. Hedeler, C., et al.: Pay-as-you-go mapping selection in dataspaces. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD 2011. ACM Press, New York (to appear 2011)

    Google Scholar 

  10. Madhavan, J., Halevy, A.Y., Cohen, S., Dong, X.L., Jeffery, S.R., Ko, D., Yu, C.: Structured data meets the web: A few observations. IEEE Data Eng. Bull. 29(4), 19–26 (2006)

    Google Scholar 

  11. Marshall, B.: Data quality and data profiling - a glossary (2007), http://www.w3.org/DesignIssues/LinkedData.html

  12. Hedeler, C., Belhajjame, K., Fernandes, A.A.A., Embury, S.M., Paton, N.W.: Dimensions of dataspaces. In: Sexton, A.P. (ed.) BNCOD 26. LNCS, vol. 5588, pp. 55–66. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  13. Lee, B.: Linked data - design issues (2006), http://www.w3.org/DesignIssues/LinkedData.html

  14. Liu, J., Dong, X., Halevy, A.Y.: Answering structured queries on unstructured data. In: WebDB (2006)

    Google Scholar 

  15. Halevy, A.Y., Rajaraman, A., Ordille, J.J.: Data integration: The teenage years. In: Dayal, et al [52], pp. 9–16

    Google Scholar 

  16. White, T.: Hadoop: The Definitive Guide, 1st edn. O’Reilly Media, Sebastopol (2009)

    Google Scholar 

  17. Apache Foundation Software. Apache hbase, subproject of hadoop (2006), http://hbase.apache.org/#Overview

  18. Lakshman, A., Malik, P.: Cassandra: a structured storage system on a p2p network. In: auf der Heide, F.M., Bender, M.A. (eds.) SPAA, p. 47. ACM, New York (2009)

    Chapter  Google Scholar 

  19. Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev. 41(6), 205–220 (2007)

    Article  Google Scholar 

  20. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2) (2008)

    Google Scholar 

  21. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: Bigtable: A distributed storage system for structured data (best paper award). In: OSDI [53], pp. 205–218

    Google Scholar 

  22. Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., Shen, W., Goldberg-Kidon, J.: Google fusion tables: web-centered data management and collaboration. In: Elmagarmid, Agrawal [54], pp. 1061–1066

    Google Scholar 

  23. Salles, M.A.V., Dittrich, J.-P., Karakashian, S.K., Girard, O.R., Blunschi, L.: itrails: Pay-as-you-go information integration in dataspaces. In: Koch, C., Gehrke, J., Garofalakis, M.N., Srivastava, D., Aberer, K., Deshpande, A., Florescu, D., Chan, C.Y., Ganti, V., Kanne, C.-C., Klas, W., Neuhold, E.J. (eds.) VLDB, pp. 663–674. ACM, New York (2007)

    Google Scholar 

  24. Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. PVLDB 1(1), 538–549 (2008)

    Google Scholar 

  25. Uren, V.S., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. J. Web Sem. 4(1), 14–28 (2006)

    Article  Google Scholar 

  26. Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005)

    Google Scholar 

  27. King, P.J.H., Poulovassilis, A.: Enhancing database technology to better manage and exploit partially structured data. Technical report bbkcs-00-14, Birkbeck University of London (2000), http://www.dcs.bbk.ac.uk/research/techreps/2000/bbkcs-00-14.pdf

  28. Bairoch, A., Boeckmann, B., Ferro, S., Gasteiger, E.: Swiss-prot: Juggling between evolution and stability. Briefings in Bioinformatics 5(1), 39–58 (2004)

    Article  Google Scholar 

  29. Doan, A., Halevy, A.Y.: Semantic-integration research in the database community. AI Mag. 26, 83–94 (2005)

    Google Scholar 

  30. Kalfoglou, Y., Schorlemmer, M.: Ontology mapping: the state of the art. Knowl. Eng. Rev. 18, 1–31 (2003)

    Article  Google Scholar 

  31. Choi, N., Song, I.-Y., Han, H.: A survey on ontology mapping. SIGMOD Rec. 35, 34–41 (2006)

    Article  Google Scholar 

  32. Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. Int. J. Semantic Web Inf. Syst. 5(3), 1–22 (2009)

    Article  Google Scholar 

  33. Talukdar, P.P., Ives, Z.G., Pereira, F.: Automatically incorporating new sources in keyword search-based data integration. In: Elmagarmid, Agrawal [54], pp. 387–398

    Google Scholar 

  34. Do, H.H., Rahm, E.: Matching large schemas: Approaches and evaluation. Inf. Syst. 32(6), 857–885 (2007)

    Article  Google Scholar 

  35. Talukdar, P.P., Reisinger, J., Pasca, M., Ravichandran, D., Bhagat, R., Pereira, F.: Weakly-supervised acquisition of labeled class instances using graph random walks. In: EMNLP, pp. 582–590. ACL (2008)

    Google Scholar 

  36. Dessì, N., Pes, B.: Towards scientific dataspaces. In: Web Intelligence, IAT Workshops, pp. 575–578. IEEE, Los Alamitos (2009)

    Google Scholar 

  37. Hamilton, J.: Perspectives: One size does not fit all (2009), http://perspectives.mvdirona.com/CommentViewguidafe46691-a293-4f9a-8900-5688a597726a.aspx

  38. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. In: Bressoud, T.C., Frans Kaashoek, M. (eds.) SOSP, pp. 205–220. ACM, New York (2007)

    Chapter  Google Scholar 

  39. Burrows, M.: The chubby lock service for loosely-coupled distributed systems. In: OSDI [53], pp. 335–350

    Google Scholar 

  40. Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: Scott, M.L., Peterson, L.L. (eds.) SOSP, pp. 29–43. ACM, New York (2003)

    Google Scholar 

  41. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI 2004, pp. 137–150 (2004)

    Google Scholar 

  42. Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)

    Article  Google Scholar 

  43. Dean, J.: Experiences with mapreduce, an abstraction for large-scale computation. In: PACT 2006: Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, p. 1. ACM Press, New York (2006)

    Chapter  Google Scholar 

  44. George, L.: Hbase architecture (2009), http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

  45. Apache Foundation Software. Apache hive, data warehouse infrastructure built on top of apache hadoop (2010), http://hive.apache.org/

  46. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)

    Google Scholar 

  47. Apache Foundation Software. The couchdb project (2008), http://couchdb.apache.org/

  48. Cloudant.com. Cloudant bigcouch (2008), https://cloudant.com/

  49. Evans, N.S., GauthierDickey, C., Grothoff, C.: Routing in the dark: Pitch black. In: ACSAC, pp. 305–314. IEEE Computer Society, Los Alamitos (2007)

    Google Scholar 

  50. Balakrishnan, H., Frans Kaashoek, M., Karger, D., Morris, R., Stoica, I.: Looking up data in p2p systems. Commun. ACM 46, 43–48 (2003)

    Article  Google Scholar 

  51. Yahoo! Searchmonkey (2011), http://developer.yahoo.com/searchmonkey/

  52. Dayal, U., Whang, K.-Y., Lomet, D.B., Alonso, G., Lohman, G.M., Kersten, M.L., Cha, S.K., Kim, Y.-K. (eds.): Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15. ACM, New York (2006)

    Google Scholar 

  53. Symposium on Operating Systems Design and Implementation (OSDI 2006), November 6-8. USENIX Association, Seattle (2006)

    Google Scholar 

  54. Elmagarmid, A.K., Agrawal, D. (eds.): Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, June 6-10. ACM, USA (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Atzori, M., Dessì, N. (2011). Dataspaces: Where Structure and Schema Meet. In: Biba, M., Xhafa, F. (eds) Learning Structure and Schemas from Documents. Studies in Computational Intelligence, vol 375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22913-8_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-22913-8_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-22912-1

  • Online ISBN: 978-3-642-22913-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics