Skip to main content

From Metadata Catalogs to Distributed Data Processing for Smart City Platforms and Services: A Study on the Interplay of CKAN and Hadoop

  • Conference paper
  • First Online:
  • 609 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 864))

Abstract

Smart Cities are emerging based on the idea of provisioning and processing large amounts of urban data for various use cases. Thereby, Urban Data Platforms are usually employed to accumulate and expose the large amounts of governmental (i.e. public sector), sensor, static and real-time data in order to enable the community to create valuable applications and services for future Smart Cities. Hitherto, the Open Data initiative was seen as the key driver to providing large amounts of data within a city. Open Data platforms employ so-called data registries in order to keep track of the available datasets at various sources spread throughout the city, with CKAN currently being among the most popular data catalog software worldwide. With the emergence of frameworks for large scale distributed computing and storage, such as Hadoop and the belonging distributed file systems (HDFS), there is an inherent need for bridging the worlds of metadata catalogs and distributed data processing towards the goal of providing sophisticated urban ICT services. The current paper constitutes a first attempt on this new field, by prototyping and evaluating components that enable the collaboration and interplay between CKAN and Hadoop/HDFS. This interplay is realized through extensions to CKAN and its harvesting process and its benefits are demonstrated by belonging case studies.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Scholz, R., Tcholtchev, N.,Lämmel, P., Schieferdecker, I.: A CKAN plugin for data harvesting to the Hadoop distributed file system. In: 7th International Conference on Cloud Computing and Services Science (CLOSER) (2017). http://dx.doi.org/10.5220/0006230200470056

  2. CKAN Association: CKAN Overview. http://ckan.org

  3. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST2010 (2010). http://dx.doi.org/10.1109/MSST.2010.5496972

  4. Helene, M.: GovData - Das Datenportal für Deutschland. In: Hill, H., Martini, M., Wagner, E. (eds.) Transparenz, Partizipation, Kollaboration: Die digitale Verwaltung neu denken, pp. 109–116. Nomos Verlagsgesellschaft mbH & Co. KG, Baden-Baden (2014)

    Chapter  Google Scholar 

  5. Bundesministerium des Innern: Nationaler Aktionsplan der Bundesregierung zur Umsetzung der Open-Data-Charta der G8. https://www.bmi.bund.de/SharedDocs/Downloads/DE/Broschueren/2014/aktionsplan-open-data.pdf (2014)

  6. Mercader, A., et al.: ckanext-harvest - remote harvesting extension (2012). https://github.com/ckan/ckanext-harvest

  7. The Apache Software Foundation: Hadoop Project Webpage. http://hadoop.apache.org/

  8. Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: ZooKeeper: wait-free Coordination for Internet-scale systems. In: USENIX Annual Technical Conference, Boston, MA, USA, p. 9 (2010)

    Google Scholar 

  9. Dittrich, J., Quiané-Ruiz, J.-A.: Efficient big data processing in Hadoop MapReduce. Proc. VLDB Endow. 5, 2014–2015 (2012). https://doi.org/10.14778/2367502.2367562

    Article  Google Scholar 

  10. The Apache Software Foundation: Apache Flink: Scalable Stream and Batch Data Processing. https://flink.apache.org/

  11. The Apache Software Foundation: Apache Spark - Lightning-Fast Cluster Computing. https://spark.apache.org/

  12. Iqbal, M., Soomro, T.: Big Data Analysis: Apache Storm Perspective (2015). https://doi.org/10.14445/22312803/ijctt-v19p103

  13. Avery, C.: Giraph: large-scale graph processing infrastructure on hadoop. Proc. Hadoop Summit. St. Cl. 11, 5–9 (2011)

    Google Scholar 

  14. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a Map-Reduce framework. Proc. VLDB Endow. 2, 1626–1629 (2009). https://doi.org/10.14778/1687553.1687609

    Article  Google Scholar 

  15. Bittorf, M., Bobrovytsky, T., Erickson, C.C.A.C.J., Hecht, M.G.D., Kuff, M.J.I.J.L., Leblang, D.K.A., Robinson, N.L.I.P.H., Rus, D.R.S., Wanderman, J.R.D.T.S., Yoder, M.M.: Impala: A modern, open-source SQL engine for Hadoop. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (2015)

    Google Scholar 

  16. Vora, M.N.: Hadoop-HBase for large-scale data (2011). http://dx.doi.org/10.1109/ICCSNT.2011.6182030

  17. National Strategy Office of Information and Communications Technology in Cabinet Secretariat: data.go.jp. http://www.data.go.jp/?lang=english

  18. Matheus, R., Vaz, J., Maia Ribeiro, M.: Open Government Data and the Data Usage for Improvement of Public Services in the Rio de Janeiro City (2014). http://dx.doi.org/10.1145/2691195.2691240

  19. Socrata: Socrata - The Data Platform for 21st Century Digital Government. https://www.socrata.com/

  20. Knoema: knoema.com Webpage. https://knoema.com/

  21. Senatsverwaltung für Wirtschaft, E. und B.: Offene Daten Berlin. https://daten.berlin.de/

  22. European Commission Directorate-General Communication: European Data Portal. https://www.europeandataportal.eu/en/

  23. Lagoze, C., Van de Sompel, H., Nelson, M., Warner, S.: Open Archives Initiative Protocol for Metadata Harvesting (2015)

    Google Scholar 

  24. Open Archives Initiative: Object Reuse and Exchange Specifications and User Guides. https://www.openarchives.org/ore/1.0/toc

  25. Marienfeld, F.: Open Government Data (OGD) - Die Metadaten-Struktur für Open Government Data in Deutschland. http://open-data.fokus.fraunhofer.de/die-metadaten-struktur-fur-open-government-data-in-deutschland/

  26. Bartha, G., Kocsis, S.: Standardization of geographic data: the european inspire directive. Eur. J. Geogr. 2, 79–89 (2011)

    Google Scholar 

  27. Weibel, S., Kunze, J., Lagoze, C., Wolf, M.: Dublin core metadata for resource discovery (1998). https://doi.org/10.17487/rfc2413

  28. Coyle, K.: MARC21 as data: a start. Code4Lib J. 14, 1–10 (2011)

    Google Scholar 

  29. Liu, Xiaoming, Balakireva, Lyudmila, Hochstenbach, Patrick, Van de Sompel, Herbert: File-based storage of digital objects and constituent datastreams: XMLtapes and Internet Archive ARC files. In: Rauber, Andreas, Christodoulakis, Stavros, Tjoa, A.Min (eds.) ECDL 2005. LNCS, vol. 3652, pp. 254–265. Springer, Heidelberg (2005). https://doi.org/10.1007/11551362_23

    Chapter  Google Scholar 

  30. Open science and research initiative: OAI-PMH harvester for CKAN. https://github.com/kata-csc/ckanext-oaipmh

  31. Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 35–40 (2010). https://doi.org/10.1145/1773912.1773922

    Article  Google Scholar 

  32. McGninnis, S., et al.: OpenStack Block Storage Cinder. https://wiki.openstack.org/wiki/Cinder

  33. Amazon.com, In.: Amazon Web Services S3 - Simple Cloud Storage Service

    Google Scholar 

  34. Watkins, N., Sevilla, M., Jimenez, I., Maltzahn, C.: Ceph: An Open-Source Software-Defined Storage Stack

    Google Scholar 

  35. Dickinson, J., et al.: OpenStack Object Storage. https://wiki.openstack.org/wiki/Swift

  36. Nóbrega, T.: OpenStack Sahara. https://wiki.openstack.org/wiki/Sahara

  37. Red Hat Inc.: Using Hadoop with CephFS. http://docs.ceph.com/docs/master/cephfs/hadoop/

  38. Tierney, B., Kissel, E., Swany, M., Pouyoul, E.: Efficient data transfer protocols for big data (2012). http://dx.doi.org/10.1109/eScience.2012.6404462

  39. Kreps, J., Narkhede, N., Rao, J.: Kafka: a distributed messaging system for log processing. In: Proceedings of the NetDB, pp. 1–7 (2011)

    Google Scholar 

  40. Momjian, B.: PostgreSQL: Introduction and Concepts. Addison-Wesley, New York (2001)

    Google Scholar 

  41. The Apache Software Foundation: WebHDFS REST API. http://hadoop.apache.org/docs/%0Ar1.0.4/webhdfs.html

  42. Alinat, P., Pierrel, J.M.: Esprit II project 5516 Roars: robust analytic speech recognition system (1993)

    Google Scholar 

  43. Liu, Z., Li, H., Miao, G.: MapReduce-based Backpropagation Neural Network over large scale mobile data (2010). http://dx.doi.org/10.1109/ICNC.2010.5584323

  44. H2O.ai: AirlinesWithWeatherDemo. https://github.com/h2oai/sparkling-water/tree/master/examples/

  45. Klessmann, J., Denker, P., Schieferdecker, I., Schulz, S.: Open government data Deutschland. Eine Studie zu Open Government in Deutschland im Auftrag des Bundesministerium des Innern. Deutschland <Bundesrepublik>/Bundesministerium (2012)

    Google Scholar 

  46. Wuebker, J., Ney, H., Zens, R.: Fast and scalable decoding with language model look-ahead for phrase-based statistical machine translation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, vol. 2, pp. 28–32. Association for Computational Linguistics, Stroudsburg (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert Scholz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Scholz, R., Tcholtchev, N., Lämmel, P., Schieferdecker, I. (2018). From Metadata Catalogs to Distributed Data Processing for Smart City Platforms and Services: A Study on the Interplay of CKAN and Hadoop. In: Ferguson, D., Muñoz, V., Cardoso, J., Helfert, M., Pahl, C. (eds) Cloud Computing and Service Science. CLOSER 2017. Communications in Computer and Information Science, vol 864. Springer, Cham. https://doi.org/10.1007/978-3-319-94959-8_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-94959-8_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-94958-1

  • Online ISBN: 978-3-319-94959-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics