Skip to main content

Facilitating and Managing Machine Learning and Data Analysis Tasks in Big Data Environments Using Web and Microservice Technologies

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems XLV

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 12390))

Abstract

Driven by the current advances of machine learning in a wide range of application areas, the need for developing easy to use frameworks for instrumenting machine learning effectively for non data analytics experts as well as novices increased dramatically. Furthermore, building machine learning models in the context of Big Data environments still represents a great challenge. In the present article, those challenges are addressed by introducing a new generic framework for efficiently facilitating the training, testing, managing, storing and retrieving of machine learning models in the context of Big Data. The framework makes use of a powerful Big Data software stack platform, web technologies and a microservice architecture for a fully manageable and highly scalable solution. A highly configurable user interface hiding platform details from the user is introduced giving the user the ability to easily train, test and manage machine learning models. Moreover, the framework automatically indexes and characterizes models and allows flexible exploration of them in the visual interface. The performance and usability of the new framework is evaluated on state-of-the-arts machine learning algorithms: it is shown that executing, storing and retrieving machine learning models via the framework results in a well acceptable low overhead demonstrating that the framework can provide an efficient approach for facilitating machine learning in Big Data environments. It is also evaluated, how configuration options (e.g. caching of RDDs in Apache Spark) affect runtime performance. Furthermore, the evaluation provides indicators for when the utilization of distributed computing (i.e. parallel computation) based on Apache Spark on a cluster outperforms single computer execution of a machine learning model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://jupyter.org/.

  2. 2.

    https://www.automl.org/.

  3. 3.

    https://reactjs.org/.

  4. 4.

    https://www.npmjs.com/.

  5. 5.

    https://redux.js.org/.

  6. 6.

    https://redux-saga.js.org/.

  7. 7.

    https://react-bootstrap.github.io/.

  8. 8.

    https://hadoop.apache.org/docs/stable/.

  9. 9.

    https://spark.apache.org/docs/latest/rdd-programming-guide.html.

References

  1. Vernon, V.: Implementing Domain-Driven Design, p. 612. Addision-Wesley, Upper Saddle River (2013)

    Google Scholar 

  2. Fielding, R.T.: Architectural Styles and the Design of Network-Based Software Architectures. AAI9980887. University of California, Irvine (2000)

    Google Scholar 

  3. Nielsen, J.: 10 usability heuristics for user interface design. Nielsen Norman Group 1, 1 (1995)

    Google Scholar 

  4. Sebastiani, F.: Machine learning in automated texT categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  Google Scholar 

  5. Padmanabhan, J., Johnson Premkumar, M.J.: Machine learning in automatic speech recognition: a survey. IETE Tech. Rev. 32, 1–12 (2015)

    Article  Google Scholar 

  6. Kononenko, I.: Machine learning for medical diagnosis: history, state of the art and perspective. Artif. Intell. Med. 23(1), 89–109 (2001)

    Article  Google Scholar 

  7. Voyant, C., et al.: Machine learning methods for solar radiation forecasting: a review. Renew. Energy 105, 569–582 (2017)

    Article  Google Scholar 

  8. Jurado, S., Nebot, A., Mugica, F., Avellana, N.: Hybrid methodologies for electricity load forecasting: entropy-based feature selection with machine learning and soft computing techniques. Energy 86, 276–291 (2015)

    Article  Google Scholar 

  9. Gandomi, A., Haider, M.: Beyond the hype: Big Data concepts, methods and analytics. Int. J. Inf. Manag. 35(2), 137–144 (2015)

    Article  Google Scholar 

  10. Karun, A.K., Chitharanjan, K.: A review on Hadoop-HDFS infrastructure extensions. In: 2013 IEEE Conference on Information and Communication Technologies, pp. 132–137. IEEE (2013)

    Google Scholar 

  11. Nadareishvili, I., Mitra, R., McLarty, M., Amundsen, M.: Microservice Architecture: Aligning Principles, Practices and Culture. O’Reilly Media Inc. (2016)

    Google Scholar 

  12. Vartak, M., et al.: Model DB: a system for machine learning model management. In: Proceedings of the Workshop on Human-In-the-Loop Data Analytics, p. 14. ACM (2016)

    Google Scholar 

  13. Johanson, A., Flogel, S., Dullo, C., Hasselbring, W.: OceanTEA: exploring ocean-derived climate data using microservices (2016)

    Google Scholar 

  14. Brewer, R.S., Johnson, P.M.: WattDepot: an open source software ecosystem for enterprise-scale energy data collection, storage, analysis and visualization. In: 2010 First IEEE International Conference on Smart Grid Communications. 2010 1st IEEE International Conference on Smart Grid Communications (SmartGridComm), pp. 91–95, Gaithersburg, MD, USA. IEEE (2010)

    Google Scholar 

  15. Shrestha, C.: A web based user interface for machine learning analysis of health and education data (2016)

    Google Scholar 

  16. Schelter, S., Böse, J.-H., Kirschnick, J., Klein, T., Seufert, S.: Automatically tracking metadata and provenance of machine learning experiments (2017)

    Google Scholar 

  17. Obe, R.O., Hsu, L.S.: PostgreSQL: Up and Running: a Practical Guide to the Advanced Open Source Database. O’Reilly Media Inc. (2017)

    Google Scholar 

  18. Meng, X., et al.: MLlib: machine learning in Apache Spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)

    MathSciNet  MATH  Google Scholar 

  19. Zaharia, M., et al.: Accelerating the machine learning lifecycle with MLflow. IEEE Data Eng. Bull. 41(4), 39–45 (2018)

    Google Scholar 

  20. Chan, S., Stone, T., Szeto, K.P., Chan, K.H.: Predictionio: a distributed machine learning server for practical software development. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pp. 2493–2496. ACM (2013)

    Google Scholar 

  21. TensorFlow Serving. https://www.tensorflow.org/serving. Accessed 4 Feb 2020

  22. kubeflow. https://www.kubeflow.org/. Accessed 4 Feb 2020

  23. Candel, A., Parmar, V., LeDell, E., Arora, A.: Deep Learning with H2O. H2O. AI Inc. (2016)

    Google Scholar 

  24. Borthakur, D.: The Hadoop distributed file system: architecture and design. In: Hadoop Project Website, vol. 11, p. 21.0 (2007)

    Google Scholar 

  25. Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Incline Village, NV, USA, pp. 1–10. IEEE, May 2010

    Google Scholar 

  26. Vavilapalli, V.K., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing - SOCC 2013. The 4th Annual Symposium, pp. 1–16. ACM Press, Santa Clara (2013)

    Google Scholar 

  27. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  28. Microservices. https://martinfowler.com/articles/microservices.html. Accessed 18 Feb 2020

  29. Newman, S.: Building Microservices: Designing Fine-Grained Systems, 1st edn. O’Reilly Media, Beijing (2015)

    Google Scholar 

  30. Coughlin, K., Piette, M., Goldman, C., Kiliccote, S.: Estimating demand response load impacts: evaluation of base line load models for non-residential buildings in California. Technical report, Ernest Orlando Lawrence Berkeley National Laboratory, Berkeley, CA, USA (2008)

    Google Scholar 

  31. Khotanzad, A., Afkhami-Rohani, R., Lu, T.L., Abaye, A., Davis, M., Maratukulam, D.J.: ANNSTLF-a neural-network based electric load forecasting system. IEEE Trans. Neural Netw. 8(4), 835–846 (1997)

    Article  Google Scholar 

  32. Evans, E.: Domain-Driven Design: Tackling Complexity in the Heart of Software, p. 529. Addison-Wesley, Boston (2004)

    Google Scholar 

  33. Shoeb, A.H., Guttag, J.V.: Application of machine learning to epileptic seizure detection. In: ICML (2010)

    Google Scholar 

  34. Shahoud, S., Gunnarsdottir, S., Khalloof, H., Duepmeier, C., Hagenmeyer, V.: Facilitating and managing machine learning and data analysis tasks in Big Data environments using web and microservice technologies. In: Proceedings of the 11th International Conference on Management of Digital EcoSystems, pp. 80–87 (2019)

    Google Scholar 

  35. Witten, I.H., Frank, E., Hall, M.A., Pal, C.J.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2016)

    Google Scholar 

  36. Aman, S., Simmhan, Y., Prasanna, V.K.: Improving energy use forecast for campus micro-grids using indirect indicators. In: 2011 IEEE 11th International Conference on Data Mining Workshops. IEEE, pp. 389–397 (2011)

    Google Scholar 

  37. Hong, T., Gui, M., Baran, M., Willis, H.: Modeling and forecasting hourly electric load by multiple linear regression with interactions. In: IEEE PES General Meeting. IEEE, pp. 1–8 (2010)

    Google Scholar 

  38. Metaxiotis, K., Kagiannas, A., Askounis, D., Psarras, J.: Artificial intelligence in short term electric load forecasting. Energy Convers. Manag. 44(9), 1525–1534 (2003)

    Article  Google Scholar 

  39. Mori, H., Takahashi, A.: Hybrid intelligent method of relevant vector machine and regression tree for probabilistic load forecasting. In: 2011 2nd IEEE PES International Conference and Exhibition on Innovative Smart Grid Technologies, pp. 1–8. IEEE (2011)

    Google Scholar 

  40. Cui, C., Wu, T., Hu, M., Weir, J.D., Li, X.: Short-term building energy model recommendation system: a meta-learning approach. Appl. Energy 172(2016), 251–263 (2016)

    Article  Google Scholar 

  41. Mitchell, T.M.: Machine Learning. McGraw-Hill Series in Computer Science, 414 pp. McGraw-Hill, New York (1997)

    Google Scholar 

  42. Cruz, J.A., Wishart, D.S.: Applications of machine learning in cancer prediction and prognosis. Cancer Inform. 2, 59–77 (2006)

    Article  Google Scholar 

  43. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  44. Machine Learning Library (MLlib) Guide. https://spark.apache.org/docs/latest/ml-guide.html. Accessed 19 Feb 2020

  45. Dougherty, J., Kohavi, R., Sahami, M.: Supervised and unsupervised discretization of continuous features. In: Proceedings of the Twelfth International Conference on Machine Learning, vol. 12, pp. 194–202 (1995)

    Google Scholar 

  46. Hahne, F., Huber, W., Gentleman, R., Falcon, S.: Bioconductor Case Studies. Springer, New York (2010). https://doi.org/10.1007/978-0-387-77240-0

    Book  Google Scholar 

  47. Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised learning. IEEE Trans. Neural Netw. 20(3), 542–542 (2009). (Chapelle, O. et al. (eds.) (2006)) (bibbook reviews)

    Article  Google Scholar 

  48. Kaelbling, L., Littman, M., Moore, A.: Reinforcement learning: a survey. J. Artif. Intell. Res. 4, 237–285 (1996)

    Article  Google Scholar 

  49. Mikowski, M., Powell, J.: Single Page Web Applications: JavaScript End-to-End. Manning Publications Co. (2013)

    Google Scholar 

  50. Kuan, J.: Learning Highcharts. Packt Publishing Ltd. (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shadi Shahoud .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer-Verlag GmbH Germany, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Shahoud, S., Gunnarsdottir, S., Khalloof, H., Duepmeier, C., Hagenmeyer, V. (2020). Facilitating and Managing Machine Learning and Data Analysis Tasks in Big Data Environments Using Web and Microservice Technologies. In: Hameurlain, A., et al. Transactions on Large-Scale Data- and Knowledge-Centered Systems XLV. Lecture Notes in Computer Science(), vol 12390. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-62308-4_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-62308-4_6

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-62307-7

  • Online ISBN: 978-3-662-62308-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics