Dataspaces: Where Structure and Schema Meet

Atzori, Maurizio; Dessì, Nicoletta

doi:10.1007/978-3-642-22913-8_5

Maurizio Atzori⁴ &
Nicoletta Dessì⁴

Part of the book series: Studies in Computational Intelligence ((SCI,volume 375))

706 Accesses

Abstract

In this chapter we investigate the crucial problem that poses the bases to the concept of dataspaces: the need for human interaction/intervention in the process of organizing (getting the structure of) unstructured data. We survey the existing techniques behind dataspaces to overcome that need, exploring the structure of a dataspace along three dimensions: dataspace profiling, querying and searching and application domain. We will further explore existing projects focusing on dataspaces, induction of data structure from documents, and data models where data schema and documents structure overlaps will be reviewed, such as Apache Hadoop, Cassandra on Amazon Dynamo, Google BigTable model and other DHT-based flexible data structures, Google Fusion Tables, iMeMex, U-DID, WebTables and Yahoo! SearchMonkey.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Dataspaces: Fundamentals, Principles, and Techniques

PolarDB: An infrastructure for specialized NoSQL datebases and DBMS

Article 01 December 2016

Data Mining in Databases: Languages and Indices

References

Gounbark, L., Benhlima, L., Chiadmi, D.: Data integration system: toward a prototype. In: ACS/IEEE International Conference on Computer Systems and Applications, pp. 33–36 (2009)
Google Scholar
Gatterbauer, W., Suciu, D.: Managing structured collections of community data. In: CIDR 2011, Fifth Biennial Conference on Innovative Data Systems Research, Online Proceedings, Asilomar (January 2011)
Google Scholar
Dittrich, J.-P., Salles, M.A.V.: idm: A unified and versatile data model for personal dataspace management. In: Dayal, et al [52], pp. 367–378
Google Scholar
Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Record 34(4), 27–33 (2005)
Article Google Scholar
Halevy, A.Y., Franklin, M.J., Maier, D.: Principles of dataspace systems. In: Vansummeren, S. (ed.) PODS, pp. 1–9. ACM, New York (2006)
Google Scholar
Dong, X., Halevy, A.Y.: Indexing dataspaces. In: Chan, C.Y., Ooi, B.C., Zhou, A. (eds.) SIGMOD Conference, pp. 43–54. ACM, New York (2007)
Google Scholar
Howe, B., Maier, D., Rayner, N., Rucker, J.: Quarrying dataspaces: Schemaless profiling of unfamiliar information sources. In: ICDEW 2008: Proceedings of the 2008 IEEE 24th International Conference on Data Engineering Workshop, pp. 270–277. IEEE Computer Society Press, Washington, DC, USA (2008)
Chapter Google Scholar
Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, pp. 847–860. ACM, New York (2008)
Chapter Google Scholar
Hedeler, C., et al.: Pay-as-you-go mapping selection in dataspaces. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD 2011. ACM Press, New York (to appear 2011)
Google Scholar
Madhavan, J., Halevy, A.Y., Cohen, S., Dong, X.L., Jeffery, S.R., Ko, D., Yu, C.: Structured data meets the web: A few observations. IEEE Data Eng. Bull. 29(4), 19–26 (2006)
Google Scholar
Marshall, B.: Data quality and data profiling - a glossary (2007), http://www.w3.org/DesignIssues/LinkedData.html
Hedeler, C., Belhajjame, K., Fernandes, A.A.A., Embury, S.M., Paton, N.W.: Dimensions of dataspaces. In: Sexton, A.P. (ed.) BNCOD 26. LNCS, vol. 5588, pp. 55–66. Springer, Heidelberg (2009)
Chapter Google Scholar
Lee, B.: Linked data - design issues (2006), http://www.w3.org/DesignIssues/LinkedData.html
Liu, J., Dong, X., Halevy, A.Y.: Answering structured queries on unstructured data. In: WebDB (2006)
Google Scholar
Halevy, A.Y., Rajaraman, A., Ordille, J.J.: Data integration: The teenage years. In: Dayal, et al [52], pp. 9–16
Google Scholar
White, T.: Hadoop: The Definitive Guide, 1st edn. O’Reilly Media, Sebastopol (2009)
Google Scholar
Apache Foundation Software. Apache hbase, subproject of hadoop (2006), http://hbase.apache.org/#Overview
Lakshman, A., Malik, P.: Cassandra: a structured storage system on a p2p network. In: auf der Heide, F.M., Bender, M.A. (eds.) SPAA, p. 47. ACM, New York (2009)
Chapter Google Scholar
Decandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. SIGOPS Oper. Syst. Rev. 41(6), 205–220 (2007)
Article Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26(2) (2008)
Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.: Bigtable: A distributed storage system for structured data (best paper award). In: OSDI [53], pp. 205–218
Google Scholar
Gonzalez, H., Halevy, A.Y., Jensen, C.S., Langen, A., Madhavan, J., Shapley, R., Shen, W., Goldberg-Kidon, J.: Google fusion tables: web-centered data management and collaboration. In: Elmagarmid, Agrawal [54], pp. 1061–1066
Google Scholar
Salles, M.A.V., Dittrich, J.-P., Karakashian, S.K., Girard, O.R., Blunschi, L.: itrails: Pay-as-you-go information integration in dataspaces. In: Koch, C., Gehrke, J., Garofalakis, M.N., Srivastava, D., Aberer, K., Deshpande, A., Florescu, D., Chan, C.Y., Ganti, V., Kanne, C.-C., Klas, W., Neuhold, E.J. (eds.) VLDB, pp. 663–674. ACM, New York (2007)
Google Scholar
Cafarella, M.J., Halevy, A.Y., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. PVLDB 1(1), 538–549 (2008)
Google Scholar
Uren, V.S., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. J. Web Sem. 4(1), 14–28 (2006)
Article Google Scholar
Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Reading (2005)
Google Scholar
King, P.J.H., Poulovassilis, A.: Enhancing database technology to better manage and exploit partially structured data. Technical report bbkcs-00-14, Birkbeck University of London (2000), http://www.dcs.bbk.ac.uk/research/techreps/2000/bbkcs-00-14.pdf
Bairoch, A., Boeckmann, B., Ferro, S., Gasteiger, E.: Swiss-prot: Juggling between evolution and stability. Briefings in Bioinformatics 5(1), 39–58 (2004)
Article Google Scholar
Doan, A., Halevy, A.Y.: Semantic-integration research in the database community. AI Mag. 26, 83–94 (2005)
Google Scholar
Kalfoglou, Y., Schorlemmer, M.: Ontology mapping: the state of the art. Knowl. Eng. Rev. 18, 1–31 (2003)
Article Google Scholar
Choi, N., Song, I.-Y., Han, H.: A survey on ontology mapping. SIGMOD Rec. 35, 34–41 (2006)
Article Google Scholar
Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. Int. J. Semantic Web Inf. Syst. 5(3), 1–22 (2009)
Article Google Scholar
Talukdar, P.P., Ives, Z.G., Pereira, F.: Automatically incorporating new sources in keyword search-based data integration. In: Elmagarmid, Agrawal [54], pp. 387–398
Google Scholar
Do, H.H., Rahm, E.: Matching large schemas: Approaches and evaluation. Inf. Syst. 32(6), 857–885 (2007)
Article Google Scholar
Talukdar, P.P., Reisinger, J., Pasca, M., Ravichandran, D., Bhagat, R., Pereira, F.: Weakly-supervised acquisition of labeled class instances using graph random walks. In: EMNLP, pp. 582–590. ACL (2008)
Google Scholar
Dessì, N., Pes, B.: Towards scientific dataspaces. In: Web Intelligence, IAT Workshops, pp. 575–578. IEEE, Los Alamitos (2009)
Google Scholar
Hamilton, J.: Perspectives: One size does not fit all (2009), http://perspectives.mvdirona.com/CommentViewguidafe46691-a293-4f9a-8900-5688a597726a.aspx
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value store. In: Bressoud, T.C., Frans Kaashoek, M. (eds.) SOSP, pp. 205–220. ACM, New York (2007)
Chapter Google Scholar
Burrows, M.: The chubby lock service for loosely-coupled distributed systems. In: OSDI [53], pp. 335–350
Google Scholar
Ghemawat, S., Gobioff, H., Leung, S.-T.: The google file system. In: Scott, M.L., Peterson, L.L. (eds.) SOSP, pp. 29–43. ACM, New York (2003)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI 2004, pp. 137–150 (2004)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)
Article Google Scholar
Dean, J.: Experiences with mapreduce, an abstraction for large-scale computation. In: PACT 2006: Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, p. 1. ACM Press, New York (2006)
Chapter Google Scholar
George, L.: Hbase architecture (2009), http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Apache Foundation Software. Apache hive, data warehouse infrastructure built on top of apache hadoop (2010), http://hive.apache.org/
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Google Scholar
Apache Foundation Software. The couchdb project (2008), http://couchdb.apache.org/
Cloudant.com. Cloudant bigcouch (2008), https://cloudant.com/
Evans, N.S., GauthierDickey, C., Grothoff, C.: Routing in the dark: Pitch black. In: ACSAC, pp. 305–314. IEEE Computer Society, Los Alamitos (2007)
Google Scholar
Balakrishnan, H., Frans Kaashoek, M., Karger, D., Morris, R., Stoica, I.: Looking up data in p2p systems. Commun. ACM 46, 43–48 (2003)
Article Google Scholar
Yahoo! Searchmonkey (2011), http://developer.yahoo.com/searchmonkey/
Dayal, U., Whang, K.-Y., Lomet, D.B., Alonso, G., Lohman, G.M., Kersten, M.L., Cha, S.K., Kim, Y.-K. (eds.): Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15. ACM, New York (2006)
Google Scholar
Symposium on Operating Systems Design and Implementation (OSDI 2006), November 6-8. USENIX Association, Seattle (2006)
Google Scholar
Elmagarmid, A.K., Agrawal, D. (eds.): Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, June 6-10. ACM, USA (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Cagliari, Italy
Maurizio Atzori & Nicoletta Dessì

Authors

Maurizio Atzori
View author publications
You can also search for this author in PubMed Google Scholar
Nicoletta Dessì
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of New York Tirana, Rr. Komuna E Parisit,, Tirana, Albania
Marenglen Biba
Technical University of Catalonia, Campus Nord, Ed. Omega, C/Jordi Girona 1-3, 08034, Barcelona, Spain
Fatos Xhafa

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Atzori, M., Dessì, N. (2011). Dataspaces: Where Structure and Schema Meet. In: Biba, M., Xhafa, F. (eds) Learning Structure and Schemas from Documents. Studies in Computational Intelligence, vol 375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22913-8_5

Download citation

DOI: https://doi.org/10.1007/978-3-642-22913-8_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22912-1
Online ISBN: 978-3-642-22913-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics