Skip to main content
Log in

Data Lakes auf den Grund gegangen

Herausforderungen und Forschungslücken in der Industriepraxis

  • Fachbeitrag
  • Published:
Datenbank-Spektrum Aims and scope Submit manuscript

Zusammenfassung

Unternehmen stehen zunehmend vor der Herausforderung, große, heterogene Daten zu verwalten und den darin enthaltenen Wert zu extrahieren. In den letzten Jahren kam darum der Data Lake als neuartiges Konzept auf, um diese komplexen Daten zu verwalten und zu nutzen. Wollen Unternehmen allerdings einen solchen Data Lake praktisch umsetzen, so stoßen sie auf vielfältige Herausforderungen, wie beispielsweise Widersprüche in der Definition oder unscharfe und fehlende Konzepte. In diesem Beitrag werden konkrete Projekte eines global agierenden Industrieunternehmens genutzt, um bestehende Herausforderungen zu identifizieren und Anforderungen an Data Lakes herzuleiten. Diese Anforderungen werden mit der verfügbaren Literatur zum Thema Data Lake sowie mit existierenden Ansätzen aus der Forschung abgeglichen. Die Gegenüberstellung zeigt, dass fünf große Forschungslücken bestehen: 1. Unklare Datenmodellierungsmethoden, 2. Fehlende Data-Lake-Referenzarchitektur, 3. Unvollständiges Metadatenmanagementkonzept, 4. Unvollständiges Data-Lake-Governance-Konzept, 5. Fehlende ganzheitliche Realisierungsstrategie.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Abb. 1
Abb. 2
Abb. 3
Abb. 4
Abb. 5
Abb. 6
Abb. 7

Notes

  1. http://hadoop.apache.org/.

Literatur

  1. Gölzer P, Cato P, Amberg M (2015) Data processing requirements of industry 4.0—use cases for big data applications. Proceedings of the 23th European Conference on Information Systems (ECIS 2015).

    Google Scholar 

  2. Lee J, Kao H‑A, Yang S (2014) Service innovation and smart Analytics for industry 4.0 and big data environment. Proceedings of the 6th CIRP Conference on Industrial Product-Service Systems.

    Book  Google Scholar 

  3. Lv Z, Song H, Basanta-Val P, Steed A, Jo M (2017) Next-generation big data Analytics: state of the art, challenges, and future research topics. IEEE Trans Industr Inform 13(4):1891–1899

    Article  Google Scholar 

  4. Russom P (2011) Big data analytics. TDWI best pract. report, 4th quart.

    Google Scholar 

  5. Cao L (2017) Data Science. ACM Comput Surv 50(3):1–42

    Article  MathSciNet  Google Scholar 

  6. Mathis C (2017) Data lakes. Datenbank Spektrum 17(3):289–293

    Article  Google Scholar 

  7. Analytics IBM (2016) The governed data lake approach

    Google Scholar 

  8. Tyagi P, Demirkan H (2016) Data lakes: the biggest big data challenges. Analytics 9(6):56–63

    Google Scholar 

  9. Ravat F, Zhao Y (2019) Data lakes: trends and perspectives. Proceedings of the 30th International Conference on Database and Expert Systems Applications (DEXA 2019).

    Google Scholar 

  10. Chessell M, Jones NL, Limburn J, Radley D, Shan K (2015) Designing and operating a data reservoir

    Google Scholar 

  11. Giebler C, Gröger C, Hoos E, Schwarz H, Mitschang B (2019) Leveraging the data lake—current state and challenges. Proceedings of the 21st International Conference on Big Data Analytics and Knowledge Discovery (DaWaK 2019).

    Book  Google Scholar 

  12. Gausemeier J, Plass C (2014) Zukunftsorientierte Unternehmensgestaltung. Carl Hanser, München

    Book  Google Scholar 

  13. Gröger C (2018) Building an industry 4.0 analytics platform. Datenbank Spektrum 18(1):5–14

    Article  Google Scholar 

  14. Terrizzano I, Schwarz P, Roth M, Colino JE (2015) Data wrangling: the challenging journey from the wild to the lake. Proceedings of the 7th Biennial Conference on Innovative Data Systems Research (CIDR’15).

    Google Scholar 

  15. Stefanowski J, Krawiec K, Wrembel R (2017) Exploring complex and big data. Int J Appl Math Comput Sci 27(4):669–679

    Article  MathSciNet  MATH  Google Scholar 

  16. O’Leary DE (2014) Embedding AI and Crowdsourcing in the big data lake. IEEE Intell Syst 29(5):70–73

    Article  Google Scholar 

  17. Loshin D (2009) Master data management. Elsevier, Amsterdam

    MATH  Google Scholar 

  18. Schnider D, Jordan C, Welker P, Wehner J (2016) Data warehouse blueprints – business intelligence in der praxis. Carl Hanser, München

    Book  Google Scholar 

  19. Larson D, Chang V (2016) A review and future direction of agile, business intelligence, analytics and data science. Int J Inf Manage 36(5):700–710

    Article  Google Scholar 

  20. Chen H, Chiang RHL, Storey VC (2012) Business intelligence and Analytics: from big data to big impact. MIS Q 36(4):1165–1188

    Article  Google Scholar 

  21. Russom P (2017) Data lakes—purposes, practices, patterns, and platforms

    Google Scholar 

  22. Dixon J (2010) Pentaho, Hadoop, and data lakes. https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/. Zugegriffen: 22.01.2020

  23. Dixon J (2014) Data lakes revisited. https://jamesdixon.wordpress.com/2014/09/25/data-lakes-revisited/. Zugegriffen: 22.01.2020

  24. Madera C, Laurent A (2016) The next information architecture evolution: the data lake wave. Proceedings of the 8th International Conference on Management of Digital EcoSystems (MEDES). ACM, New York

    Book  Google Scholar 

  25. Fang H (2015) Managing data lakes in big data era: what’s a data lake and why has it became popular in data management ecosystem. Proceedings of the 2015 IEEE International Conference on Cyber Technology in Automation, Control, and Intelligent Systems (CYBER 2015).

    Google Scholar 

  26. Gröger C, Hoos E (2019) Ganzheitliches Metadatenmanagement im Data Lake: Anforderungen, IT-Werkzeuge und Herausforderungen in der Praxis. Proceedings der 18. Fachtagung Datenbanksysteme für Business, Technologie und Web (BTW).

    Google Scholar 

  27. Lock M (2016) Maximizing your data lake with a cloud or hybrid approach

    Google Scholar 

  28. Madsen M (2015) How to build an enterprise data lake: important considerations before jumping in

    Google Scholar 

  29. Gartner Inc. (2014) Gartner says beware of the data lake fallacy. https://www.gartner.com/en/newsroom/press-releases/2014-07-28-gartner-says-beware-of-the-data-lake-fallacy. Zugegriffen: 22.01.2020

  30. Patel P, Wood G, Diaz A (2017) Data lake governance best practices. Dzone Guid. to big data—data sci. Adv Anal 4:6–7

    Google Scholar 

  31. Chessell M, Scheepers F, Nguyen N, van Kessel R, van der Starre R (2014) Governing and managing big data for analytics and decision makers

    Google Scholar 

  32. Topchyan AR (2016) Enabling data driven projects for a modern enterprise. Proc Inst Syst Progr Ras 28(3):209–230

    Article  Google Scholar 

  33. Stein B, Morrison A (2014) The enterprise data lake: Better integration and deeper analytics. In: Technol Forecast Rethink Integr, Bd. 1

    Google Scholar 

  34. Stiglich P (2014) Data modeling in the age of big data. Bus Intell J 19(4):17–22

    Google Scholar 

  35. Houle P (2017) Data lakes, data ponds, and data droplets. http://ontology2.com/the-book/data-lakes-ponds-and-droplets.html. Zugegriffen: 22.01.2020

  36. Walker C, Alrehamy H (2015) Personal data lake with data gravity pull. Proceedings of the 2015 IEEE Fifth International Conference on Big Data and Cloud Computing (BDCloud’15).

    Book  Google Scholar 

  37. Giebler C, Gröger C, Hoos E, Schwarz H, Mitschang B (2019) Modeling data lakes with data vault: practical experiences, assessment, and lessons learned. Proceedings of the 38th Conference on Conceptual Modeling (ER 2019).

    Google Scholar 

  38. Cernjeka K, Jaksic D, Jovanovic V (2018) NoSQL document store translation to data vault based EDW. Proceedings of the 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO 2018).

    Book  Google Scholar 

  39. Gröger C, Schwarz H, Mitschang B (2014) The deep data warehouse: link-based integration and enrichment of warehouse data and unstructured content. Proceedings of the 2014 IEEE 18th International Enterprise Distributed Object Computing Conference (EDOC 2014).

    Google Scholar 

  40. Inmon B (2016) Data lake architecture—designing the data lake and avoiding the garbage dump (Technics Publications)

    Google Scholar 

  41. Sharma B (2018) Architecting data lakes—data management architectures for advanced business use cases. O’Reilly, Sebastopol

    Google Scholar 

  42. Marz N, Warren J (2015) Big data—principles and best practices of scalable real-time data systems. Manning, Shelter Island

    Google Scholar 

  43. Giebler C, Stach C, Schwarz H, Mitschang B (2018) BRAID—a hybrid processing architecture for big data. Proceedings of the 7th International Conference on Data Science, Technology and Applications (DATA 2018). SCITEPRESS, Setúbal

    Book  Google Scholar 

  44. Nadal S, Herrero V, Romero O, Abelló A, Franch X, Vansummeren S, Valerio D (2017) A software reference architecture for semantic-aware Big Data systems. Inf Softw Technol 90:75–92

    Article  Google Scholar 

  45. Zikopoulos P, DeRoos D, Bienko C, Buglio R, Andrews M (2015) Big data beyond the hype. McGraw-Hill, New York

    Google Scholar 

  46. Sadalage PJ, Fowler M (2013) NoSQL distilled—a brief guide to the emerging world of polyglot persistence. Pearson, London

    Google Scholar 

  47. Abraham R, Schneider J, vom Brocke J (2019) Data governance: a conceptual framework, structured view, and research agenda. Int J Inf Manage 49:424–438

    Article  Google Scholar 

  48. Quix C, Hai R, Vatov I (2016) Metadata extraction and management in data lakes with GEMMS. Complex Syst Inf Model Q 9(9):67–83

    Google Scholar 

  49. Gallinucci E, Golfarelli M, Rizzi S (2018) Schema profiling of document-oriented databases. Inf Syst 75:13–25

    Article  Google Scholar 

  50. Nogueira I, Romdhane M, Darmont J (2018) Modeling data lake Metadata with a data vault. Proceedings of the 22nd International Database Engineering Applications Symposium (IDEAS 2018).

    Book  Google Scholar 

  51. Sawadogo PN, Scholly É, Favre C, Ferey É, Loudcher S, Darmont J (2019) Metadata systems for data lakes: models and features. Proceedings of the 23rd European Conference on Advances in Databases and Information Systems (ADBIS 2019).

    Google Scholar 

  52. Sawadogo P, Kibata T, Darmont J (2019) Metadata management for textual documents in data lakes. Proceedings of the 21st International Conference on Enterprise Information Systems (ICEIS 2019). SCITEPRESS, Setúbal

    Book  Google Scholar 

  53. Ravat F, Zhao Y (2019) Metadata management for data lakes. Proceedings of the 23rd European Conference on Advances in Databases and Information Systems (ADBIS 2019).

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Corinna Giebler.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Giebler, C., Gröger, C., Hoos, E. et al. Data Lakes auf den Grund gegangen. Datenbank Spektrum 20, 57–69 (2020). https://doi.org/10.1007/s13222-020-00332-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13222-020-00332-0

Schlüsselwörter

Navigation