From Metadata Catalogs to Distributed Data Processing for Smart City Platforms and Services: A Study on the Interplay of CKAN and Hadoop

Scholz, Robert; Tcholtchev, Nikolay; Lämmel, Philipp; Schieferdecker, Ina

doi:10.1007/978-3-319-94959-8_7

From Metadata Catalogs to Distributed Data Processing for Smart City Platforms and Services: A Study on the Interplay of CKAN and Hadoop

Robert Scholz¹³,
Nikolay Tcholtchev¹³,
Philipp Lämmel¹³ &
…
Ina Schieferdecker¹³

Conference paper
First Online: 14 July 2018

609 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 864))

Abstract

Smart Cities are emerging based on the idea of provisioning and processing large amounts of urban data for various use cases. Thereby, Urban Data Platforms are usually employed to accumulate and expose the large amounts of governmental (i.e. public sector), sensor, static and real-time data in order to enable the community to create valuable applications and services for future Smart Cities. Hitherto, the Open Data initiative was seen as the key driver to providing large amounts of data within a city. Open Data platforms employ so-called data registries in order to keep track of the available datasets at various sources spread throughout the city, with CKAN currently being among the most popular data catalog software worldwide. With the emergence of frameworks for large scale distributed computing and storage, such as Hadoop and the belonging distributed file systems (HDFS), there is an inherent need for bridging the worlds of metadata catalogs and distributed data processing towards the goal of providing sophisticated urban ICT services. The current paper constitutes a first attempt on this new field, by prototyping and evaluating components that enable the collaboration and interplay between CKAN and Hadoop/HDFS. This interplay is realized through extensions to CKAN and its harvesting process and its benefits are demonstrated by belonging case studies.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Scholz, R., Tcholtchev, N.,Lämmel, P., Schieferdecker, I.: A CKAN plugin for data harvesting to the Hadoop distributed file system. In: 7th International Conference on Cloud Computing and Services Science (CLOSER) (2017). http://dx.doi.org/10.5220/0006230200470056
CKAN Association: CKAN Overview. http://ckan.org
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, MSST2010 (2010). http://dx.doi.org/10.1109/MSST.2010.5496972
Helene, M.: GovData - Das Datenportal für Deutschland. In: Hill, H., Martini, M., Wagner, E. (eds.) Transparenz, Partizipation, Kollaboration: Die digitale Verwaltung neu denken, pp. 109–116. Nomos Verlagsgesellschaft mbH & Co. KG, Baden-Baden (2014)
Chapter Google Scholar
Bundesministerium des Innern: Nationaler Aktionsplan der Bundesregierung zur Umsetzung der Open-Data-Charta der G8. https://www.bmi.bund.de/SharedDocs/Downloads/DE/Broschueren/2014/aktionsplan-open-data.pdf (2014)
Mercader, A., et al.: ckanext-harvest - remote harvesting extension (2012). https://github.com/ckan/ckanext-harvest
The Apache Software Foundation: Hadoop Project Webpage. http://hadoop.apache.org/
Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: ZooKeeper: wait-free Coordination for Internet-scale systems. In: USENIX Annual Technical Conference, Boston, MA, USA, p. 9 (2010)
Google Scholar
Dittrich, J., Quiané-Ruiz, J.-A.: Efficient big data processing in Hadoop MapReduce. Proc. VLDB Endow. 5, 2014–2015 (2012). https://doi.org/10.14778/2367502.2367562
Article Google Scholar
The Apache Software Foundation: Apache Flink: Scalable Stream and Batch Data Processing. https://flink.apache.org/
The Apache Software Foundation: Apache Spark - Lightning-Fast Cluster Computing. https://spark.apache.org/
Iqbal, M., Soomro, T.: Big Data Analysis: Apache Storm Perspective (2015). https://doi.org/10.14445/22312803/ijctt-v19p103
Avery, C.: Giraph: large-scale graph processing infrastructure on hadoop. Proc. Hadoop Summit. St. Cl. 11, 5–9 (2011)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a Map-Reduce framework. Proc. VLDB Endow. 2, 1626–1629 (2009). https://doi.org/10.14778/1687553.1687609
Article Google Scholar
Bittorf, M., Bobrovytsky, T., Erickson, C.C.A.C.J., Hecht, M.G.D., Kuff, M.J.I.J.L., Leblang, D.K.A., Robinson, N.L.I.P.H., Rus, D.R.S., Wanderman, J.R.D.T.S., Yoder, M.M.: Impala: A modern, open-source SQL engine for Hadoop. In: Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (2015)
Google Scholar
Vora, M.N.: Hadoop-HBase for large-scale data (2011). http://dx.doi.org/10.1109/ICCSNT.2011.6182030
National Strategy Office of Information and Communications Technology in Cabinet Secretariat: data.go.jp. http://www.data.go.jp/?lang=english
Matheus, R., Vaz, J., Maia Ribeiro, M.: Open Government Data and the Data Usage for Improvement of Public Services in the Rio de Janeiro City (2014). http://dx.doi.org/10.1145/2691195.2691240
Socrata: Socrata - The Data Platform for 21st Century Digital Government. https://www.socrata.com/
Knoema: knoema.com Webpage. https://knoema.com/
Senatsverwaltung für Wirtschaft, E. und B.: Offene Daten Berlin. https://daten.berlin.de/
European Commission Directorate-General Communication: European Data Portal. https://www.europeandataportal.eu/en/
Lagoze, C., Van de Sompel, H., Nelson, M., Warner, S.: Open Archives Initiative Protocol for Metadata Harvesting (2015)
Google Scholar
Open Archives Initiative: Object Reuse and Exchange Specifications and User Guides. https://www.openarchives.org/ore/1.0/toc
Marienfeld, F.: Open Government Data (OGD) - Die Metadaten-Struktur für Open Government Data in Deutschland. http://open-data.fokus.fraunhofer.de/die-metadaten-struktur-fur-open-government-data-in-deutschland/
Bartha, G., Kocsis, S.: Standardization of geographic data: the european inspire directive. Eur. J. Geogr. 2, 79–89 (2011)
Google Scholar
Weibel, S., Kunze, J., Lagoze, C., Wolf, M.: Dublin core metadata for resource discovery (1998). https://doi.org/10.17487/rfc2413
Coyle, K.: MARC21 as data: a start. Code4Lib J. 14, 1–10 (2011)
Google Scholar
Liu, Xiaoming, Balakireva, Lyudmila, Hochstenbach, Patrick, Van de Sompel, Herbert: File-based storage of digital objects and constituent datastreams: XMLtapes and Internet Archive ARC files. In: Rauber, Andreas, Christodoulakis, Stavros, Tjoa, A.Min (eds.) ECDL 2005. LNCS, vol. 3652, pp. 254–265. Springer, Heidelberg (2005). https://doi.org/10.1007/11551362_23
Chapter Google Scholar
Open science and research initiative: OAI-PMH harvester for CKAN. https://github.com/kata-csc/ckanext-oaipmh
Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev. 44, 35–40 (2010). https://doi.org/10.1145/1773912.1773922
Article Google Scholar
McGninnis, S., et al.: OpenStack Block Storage Cinder. https://wiki.openstack.org/wiki/Cinder
Amazon.com, In.: Amazon Web Services S3 - Simple Cloud Storage Service
Google Scholar
Watkins, N., Sevilla, M., Jimenez, I., Maltzahn, C.: Ceph: An Open-Source Software-Defined Storage Stack
Google Scholar
Dickinson, J., et al.: OpenStack Object Storage. https://wiki.openstack.org/wiki/Swift
Nóbrega, T.: OpenStack Sahara. https://wiki.openstack.org/wiki/Sahara
Red Hat Inc.: Using Hadoop with CephFS. http://docs.ceph.com/docs/master/cephfs/hadoop/
Tierney, B., Kissel, E., Swany, M., Pouyoul, E.: Efficient data transfer protocols for big data (2012). http://dx.doi.org/10.1109/eScience.2012.6404462
Kreps, J., Narkhede, N., Rao, J.: Kafka: a distributed messaging system for log processing. In: Proceedings of the NetDB, pp. 1–7 (2011)
Google Scholar
Momjian, B.: PostgreSQL: Introduction and Concepts. Addison-Wesley, New York (2001)
Google Scholar
The Apache Software Foundation: WebHDFS REST API. http://hadoop.apache.org/docs/%0Ar1.0.4/webhdfs.html
Alinat, P., Pierrel, J.M.: Esprit II project 5516 Roars: robust analytic speech recognition system (1993)
Google Scholar
Liu, Z., Li, H., Miao, G.: MapReduce-based Backpropagation Neural Network over large scale mobile data (2010). http://dx.doi.org/10.1109/ICNC.2010.5584323
H2O.ai: AirlinesWithWeatherDemo. https://github.com/h2oai/sparkling-water/tree/master/examples/
Klessmann, J., Denker, P., Schieferdecker, I., Schulz, S.: Open government data Deutschland. Eine Studie zu Open Government in Deutschland im Auftrag des Bundesministerium des Innern. Deutschland <Bundesrepublik>/Bundesministerium (2012)
Google Scholar
Wuebker, J., Ney, H., Zens, R.: Fast and scalable decoding with language model look-ahead for phrase-based statistical machine translation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers, vol. 2, pp. 28–32. Association for Computational Linguistics, Stroudsburg (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Fraunhofer Institute for Open Communication Systems (FOKUS), Berlin, Germany
Robert Scholz, Nikolay Tcholtchev, Philipp Lämmel & Ina Schieferdecker

Authors

Robert Scholz
View author publications
You can also search for this author in PubMed Google Scholar
Nikolay Tcholtchev
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Lämmel
View author publications
You can also search for this author in PubMed Google Scholar
Ina Schieferdecker
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Scholz .

Editor information

Editors and Affiliations

Columbia University, New York, New York, USA
Donald Ferguson
Escola d’Enginyeria, Barcelona, Spain
Víctor Méndez Muñoz
Departamento de Engenharia Informatica, Universidade da Coimbra, Coimbra, Portugal
Jorge Cardoso
Dublin City University, Dublin 9, Ireland
Markus Helfert
Free University of Bozen-Bolzano, Bolzano, Bolzano, Italy
Claus Pahl

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Scholz, R., Tcholtchev, N., Lämmel, P., Schieferdecker, I. (2018). From Metadata Catalogs to Distributed Data Processing for Smart City Platforms and Services: A Study on the Interplay of CKAN and Hadoop. In: Ferguson, D., Muñoz, V., Cardoso, J., Helfert, M., Pahl, C. (eds) Cloud Computing and Service Science. CLOSER 2017. Communications in Computer and Information Science, vol 864. Springer, Cham. https://doi.org/10.1007/978-3-319-94959-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-94959-8_7
Published: 14 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94958-1
Online ISBN: 978-3-319-94959-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics