Skip to main content

The Emerging Challenges of Big Data Lakes, and a Real-Life Framework for Representing, Managing and Supporting Machine Learning on Big Arctic Data

  • Conference paper
  • First Online:
Book cover Advances in Intelligent Networking and Collaborative Systems (INCoS 2022)

Abstract

Given the evolving character of Big Data, a new kind of way to manage data has become a requisite. The domain had a growing interest in recent years and has been, therefore, investigated for use for the new kind of massively generated data. In this sense, the concept of Data Lakes was found to be promising. In fact, any kind of data (structured, semi-structured or unstructured) could be the input of data lakes, where their processing is performed on a “lazy-basis” and executed at the time of use, depending on the actual needs of the user, and based on a schema-on-read approach. One pertinent application of data lakes is relating to data collected during arctic expeditions. Indeed, these data are various especially in nature and in volume and, hence, are suitable for data lakes. In this paper, we detail the challenges stemming from using Big Data Lakes along with machine learning to manage, at will, the collected Big Arctic Data samples.

A. Cuzzocrea—This research has been made in the context of the Excellence Chair in Computer Engineering at LORIA, University of Lorraine, Nancy, France.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 299.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 379.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bikakis, N., Papastefanatos, G., Papaemmanouil, O.: Big data exploration, visualization and analytics. Big Data Res. 18, art. 100123 (2019)

    Google Scholar 

  2. Wang, X., et al.: A general framework for big data knowledge discovery and integration. Concurr. Comput. Pract. Exp. 30(13), art. 100123 (2018)

    Google Scholar 

  3. Eberius, J., Thiele, M., Lehner, W.: Exploratory ad-hoc analytics for big data. In: Zomaya, A.Y., Sakr, S. (eds.) Handbook of Big Data Technologies, pp. 365–407. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-49340-4_11

    Chapter  Google Scholar 

  4. Chopade, P., Zhan, J.: Structural and functional analytics for community detection in large-scale complex networks. J. Big Data 2, art.11 (2015)

    Google Scholar 

  5. Cuzzocrea, A., Song, I.-Y.: Big graph analytics: the state of the art and future research agenda. In: DOLAP 2014, pp. 99–101 (2014)

    Google Scholar 

  6. Barh, D., et al.: Multi-omics-based identification of SARS-CoV-2 infection biology and candidate drugs against COVID-19. Comput. Biol. Med. 126, 104051:1–104051:13 (2020)

    Google Scholar 

  7. Jiang, F., et al.: Mining sequential patterns from uncertain big DNA in the Spark framework. In: IEEE BIBM 2016, pp. 874–881 (2016)

    Google Scholar 

  8. Leung, C.K., et al.: Predictive analytics on genomic data with high-performance computing. In: IEEE BIBM 2020, pp. 2187–2194 (2020)

    Google Scholar 

  9. Pawliszak, T., et al.: Operon-based approach for the inference of rRNA and tRNA evolutionary histories in bacteria. BMC Genom. 21(Supplement 2), 252:1–252:14 (2020)

    Google Scholar 

  10. Sarumi, O.A., Leung, C.K.: Adaptive machine learning algorithm and analytics of big genomic data for gene prediction. In: Mehta, M., Fournier-Viger, P., Patel, M., Lin, J.C.-W. (eds.) Tracking and Preventing Diseases with Artificial Intelligence. ISRL, vol. 206, pp. 103–123. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-76732-7_5

    Chapter  Google Scholar 

  11. Sarumi, O.A., Leung, C.K.: Exploiting anti-monotonic constraints for mining palindromic motifs from big genomic data. In: IEEE BigData 2019, pp. 4864–4873 (2019)

    Google Scholar 

  12. Gupta, P., Hoi, C.S.H., Leung, C.K., Yuan, Y., Zhang, X., Zhang, Z.: Vertical data mining from relational data and its application to COVID-19 data. In: Lee, W., Leung, C.K., Nasridinov, A. (eds.) Big Data Analyses, Services, and Smart Data. AISC, vol. 899, pp. 106–116. Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-8731-3_8

    Chapter  Google Scholar 

  13. Leung, C.K., et al.: Towards trustworthy artificial intelligence in healthcare. In: IEEE ICHI 2022, pp. 626–632 (2022)

    Google Scholar 

  14. Souza, J., Leung, C.K., Cuzzocrea, A.: An innovative big data predictive analytics framework over hybrid big data sources with an application for disease analytics. In: Barolli, L., Amato, F., Moscato, F., Enokido, T., Takizawa, M. (eds.) AINA 2020. AISC, vol. 1151, pp. 669–680. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44041-1_59

    Chapter  Google Scholar 

  15. Tsumoto, S., et al.: Estimation of disease code from electronic patient records. In: IEEE BigData 2019, pp. 2698–2707 (2019)

    Google Scholar 

  16. Tran, N.D.T., et al.: A deep learning based predictive model for healthcare analytics. In: IEEE ICHI 2022, pp. 547–549 (2022)

    Google Scholar 

  17. Chanda, A.K., et al.: A new framework for mining weighted periodic patterns in time series databases. Expert Syst. Appl. 79, 207–224 (2017)

    Article  Google Scholar 

  18. Leung, C.K., et al.: A machine learning approach for stock price prediction. In: IDEAS 2014, pp. 274–277 (2014)

    Google Scholar 

  19. Murray, M., et al.: Large scale financial filing analysis on HPCC systems. In: IEEE BigData 2020, pp. 4429–4436 (2020)

    Google Scholar 

  20. Sharma, R., et al.: Tale of three states: analysis of large person-to-person online financial transactions in three Baltic countries. In: IEEE BigData 2019, pp. 1497–1505 (2019)

    Google Scholar 

  21. Cabusas, R.M., Epp, B.N., Gouge, J.M., Kaufmann, T.N., Leung, C.K., Tully, J.R.A.: Mining for fake news. In: Barolli, L., Hussain, F., Enokido, T. (eds.) AINA 2022, Part II. LNNS, vol 450, pp. 154–166. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99587-4_14

  22. Chowdhury, M.E.S., et al.: A new approach for mining correlated frequent subgraphs. ACM Trans. Manage. Inf. Syst. 13(1), 9:1–9:28 (2022)

    Google Scholar 

  23. Czubryt, T.J., Leung, C.K., Pazdor, A.G.M.: Q-VIPER: quantitative vertical bitwise algorithm to mine frequent patterns. In: Wrembel, R., Gamper, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2022. LNCS, vol. 13428, pp. 219–233. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-12670-3_19

  24. Leung, C.K., et al.: Fast algorithms for frequent itemset mining from uncertain data. In: IEEE ICDM 2014, pp. 893–898 (2014)

    Google Scholar 

  25. Ishita, S.Z., et al.: New approaches for mining regular high utility sequential patterns. Appl. Intell. 52, 3781–3806 (2022)

    Google Scholar 

  26. Madill, E.W., Leung, C.K., Gouge, J.M.: Enhanced sliding window-based periodic pattern mining from dynamic streams. In: Wrembel, R., Gamper, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds.) DaWaK 2022. LNCS, vol. 13428, pp. 234–240. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-12670-3_20

  27. Smallwood, J.F., et al.: Mining the impacts of COVID-19 pandemic on the labour market. In: IMCOM 2022, pp. 337–344 (2022)

    Google Scholar 

  28. Rahman, M.M., et al.: Mining weighted frequent sequences in uncertain databases. Inf. Sci. 479, 76–100 (2019)

    Article  Google Scholar 

  29. Roy, K.K., et al.: Mining weighted sequential patterns in incremental uncertain databases. Inf. Sci. 582, 865–896 (2022)

    Article  Google Scholar 

  30. Roy, K.K., Moon, M.H.H., Rahman, M.M., Ahmed, C.F., Leung, C.K.: Mining sequential patterns in uncertain databases using hierarchical index structure. In: Karlapalem, K., et al. (eds.) PAKDD 2021, Part II. LNCS (LNAI), vol. 12713, pp. 29–41. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-75765-6_3

    Chapter  Google Scholar 

  31. Jiang, F., et al.: Web page recommendation based on bitwise frequent pattern mining. In: IEEE/WIC/ACM WI 2016, pp. 632–635 (2016)

    Google Scholar 

  32. He, C., et al.: Finding mutual X at WeChat-scale social network in ten minutes. In: IEEE BigData 2019, pp.288–297 (2019)

    Google Scholar 

  33. Cameron, J.J., et al.: Finding strong groups of friends among friends in social networks. In: IEEE DASC 2011, pp. 824–831 (2011)

    Google Scholar 

  34. Leung, C.K.: Mathematical model for propagation of influence in a social network. In: Alhajj, R., Rokne, J. (eds.) Encyclopedia of Social Network Analysis and Mining, 2nd edn., pp. 1261–1269. Springer, New York (2018). https://doi.org/10.1007/978-1-4939-7131-2_110201

  35. Leung, C.K., et al.: Big data analytics of social network data: who cares most about you on Facebook? In: Moshirpour, M., Far, B., Alhajj, R. (eds.) Highlighting the Importance of Big Data Management and Analysis for Various Applications. Studies in Big Data, vol. 27, pp. 1–15. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-60255-4_1

  36. Leung, C.K., et al.: Parallel social network mining for interesting ‘following’ patterns. Concurr. Comput. Pract. Exp. 28(15), 3994–4012 (2016)

    Article  Google Scholar 

  37. Leung, C.K., et al.: Personalized DeepInf: enhanced social influence prediction with deep learning and transfer learning. In: IEEE BigData 2019, pp. 2871–2880 (2019)

    Google Scholar 

  38. Leung, C.K.-S., Jiang, F.: Big data analytics of social networks for the discovery of “following” patterns. In: Madria, S., Hara, T. (eds.) DaWaK 2015. LNCS, vol. 9263, pp. 123–135. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22729-0_10

    Chapter  Google Scholar 

  39. Deligiannis, K., Raftopoulou, P., Tryfonopoulos, C., Platis, N., Vassilakis, C.: Hydria: an online data lake for multi-faceted analytics in the cultural heritage domain. Big Data Cogn. Comput. 4(2), art. 7 (2020)

    Google Scholar 

  40. Alserafi, A., Abelló, A., Romero, O., Calders, T.: Keeping the data lake in form: proximity mining for pre-filtering schema matching. ACM Trans. Inf. Syst. 38(3), 26:1–26:30 (2020)

    Google Scholar 

  41. Olawoyin, A.M., et al.: Open data lake to support machine learning on Arctic big data. In: IEEE BigData 2021, pp. 5215–5224 (2021)

    Google Scholar 

  42. Bala, M., Boussaid, O., Alimazighi, Z.: a fine-grained distribution approach for ETL processes in big data environments. Data Knowl. Eng. 111, 114–136 (2017)

    Article  Google Scholar 

  43. Prabhune, A., Ansari, H., Keshav, A., Stotzka, R., Gertz, M., Hesser, J.: MetaStore: a metadata framework for scientific data repositories. In: IEEE BigData 2016, pp. 3026–3035 (2016)

    Google Scholar 

  44. Cuzzocrea, A.: Combining multidimensional user models and knowledge representation and management techniques for making web services knowledge-aware. Web Intell. Agent Syst. 4(3), 289–312 (2006)

    Google Scholar 

  45. Coimbra, M.E., Francisco, A.P., Veiga, L.: Distributed graphs: in search of fast, low-latency, resource-efficient, semantics-rich big-data processing. CoRR, abs/1911.11624 (2019)

    Google Scholar 

  46. Hoi, C.S.H. Hoi, et al.: Data, information and knowledge visualization for frequent patterns. In: IV 2022, pp. 227–232 (2022). https://doi.org/10.1109/IV56949.2022.00045

  47. Leung, C.K.-S., Carmichael, C.L., Teh, E.W.: Visual analytics of social networks: mining and visualizing co-authorship networks. In: Schmorrow, D.D., Fidopiastis, C.M. (eds.) FAC 2011. LNCS (LNAI), vol. 6780, pp. 335–345. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21852-1_40

    Chapter  Google Scholar 

  48. Bellatreche, L., Cuzzocrea, A., Benkrid, S.: F&A: a methodology for effectively and efficiently designing parallel relational data warehouses on heterogenous database clusters. In: Bach Pedersen, T., Mohania, M.K., Tjoa, A.M. (eds.) Data Warehousing and Knowledge Discovery. DaWaK 2010. Lecture Notes in Computer Science, vol. 6263, pp. 89–104. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15105-7_8

    Chapter  Google Scholar 

  49. Ceci, M., Cuzzocrea, A., Malerba, D.: Effectively and efficiently supporting roll-up and drill-down OLAP operations over continuous dimensions via hierarchical clustering. J. Intell. Inf. Syst. 44(3), 309–333 (2013). https://doi.org/10.1007/s10844-013-0268-1

    Article  Google Scholar 

  50. Ahn, S., et al.: A Fuzzy logic based machine learning tool for supporting big data business analytics in complex artificial intelligence environments. In: FUZZ-IEEE 2019, pp. 1259–1264 (2019)

    Google Scholar 

  51. Morris, K.J., et al.: Token-based adaptive time-series prediction by ensembling linear and non-linear estimators: a machine learning approach for predictive analytics on big stock data. In: IEEE ICMLA 2018, pp. 1486–1491 (2018)

    Google Scholar 

  52. Audu, A.-R., Cuzzocrea, A., Leung, C.K., MacLeod, K.A., Ohin, N.I., Pulgar-Vidal, N.C.: An intelligent predictive analytics system for transportation analytics on open data towards the development of a smart city. In: Barolli, L., Hussain, F.K., Ikeda, M. (eds.) CISIS 2019. AISC, vol. 993, pp. 224–236. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-22354-0_21

    Chapter  Google Scholar 

Download references

Acknowledgements

This research has been partially supported by Arctic Research Foundation (ARF), Mitacs Inc., NSERC (Canada) and University of Manitoba, and the French PIA project “Lorraine Université d’Excellence”, reference ANR-15-IDEX-04-LUE.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alfredo Cuzzocrea .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cuzzocrea, A., Leung, C.K., Soufargi, S., Olawoyin, A.M. (2022). The Emerging Challenges of Big Data Lakes, and a Real-Life Framework for Representing, Managing and Supporting Machine Learning on Big Arctic Data. In: Barolli, L., Miwa, H. (eds) Advances in Intelligent Networking and Collaborative Systems. INCoS 2022. Lecture Notes in Networks and Systems, vol 527. Springer, Cham. https://doi.org/10.1007/978-3-031-14627-5_16

Download citation

Publish with us

Policies and ethics