Abstract
As companies store, process, and analyse bigger and bigger volumes of highly heterogeneous data, novel research and technological challenges are emerging. Traditional and rigid data integration and processing techniques become inadequate for a new class of data-intensive applications. There is a need for new architectural, software, and hardware solutions that are capable of providing dynamic data integration, assuring high data quality, and offering safety and security mechanisms, while facilitating online data analysis. In this context, we propose moduli, a novel disaggregated data management reference architecture for data-intensive applications that organizes data processing in various zones. Working on moduli allowed us also to identify open research and technological challenges.
- Ahmadov, A., Thiele, M., Eberius, J., Lehner, W., and Wrembel, R. 2015. Towards a hybrid imputation approach using web tables. In IEEE/ACM Int. Symposium on Big Data Computing (BDC). IEEE, 21--30.Google Scholar
- Ali, S. M. F. and Wrembel, R. 2017. From conceptual design to performance optimization of ETL workflows: current state of research and open problems. The VLDB Journal 26, 6, 777--801.Google ScholarDigital Library
- Amer-Yahia, S., Koutrika, G., Braschler, M., Calvanese, D., Lanti, D., Lücke-Tieke, H., Mosca, A., de Farias, T. M., Papadopoulos, D., Patil, Y., Rull, G., Smith, E., Skoutas, D., Subramanian, S., and Stockinger, K. 2021. INODE: building an end-to-end data exploration system in practice. SIGMOD Record 50, 4, 23--29.Google ScholarDigital Library
- Backasch, R., Hempel, G., Werner, S., Groppe, S., and Pionteck, T. 2014. Identifying homogenous reconfigurable regions in heterogeneous fpgas for module relocation. In International Conference on ReConFigurable Computing and FPGAs (ReConFig), Cancun, Mexico.Google Scholar
- Barreno, M., Nelson, B., Joseph, A. D., and Tygar, J. D. 2010. The security of machine learning. Mach. Learn. 81, 2, 121--148.Google ScholarDigital Library
- Batini, C., Cappiello, C., Francalanci, C., and Maurino, A. 2009. Methodologies for data quality assessment and improvement. ACM Comput. Surv. 41, 3, 16:1--16:52.Google Scholar
- Batini, C. and Scannapieco, M. 2016. Data and Information Quality - Dimensions, Principles and Techniques. Data-Centric Systems and Applications. Springer.Google Scholar
- Bleiholder, J. and Naumann, F. 2009. Data Fusion. ACM Comput. Surv. 41, 1.Google ScholarDigital Library
- Boeschoten, S., Catal, C., Tekinerdogan, B., Lommen, A., and Blokland, M. 2023. The automation of the development of classification models and improvement of model quality using feature engineering techniques. Expert Systems with Applications 213, Part, 118912.Google Scholar
- Caro, M. C., Huang, H.-Y., Cerezo, M., Sharma, K., Sornborger, A., Cincio, L., and Coles, P. J. 2022. Generalization in quantum machine learning from few training data. Nat. Commun. 13, 1.Google ScholarCross Ref
- Ceravolo, P., Azzini, A., Angelini, M., Catarci, T., Cudré-Mauroux, P., Damiani, E., Mazak, A., van Keulen, M., Jarrar, M., Santucci, G., Sattler, K., Scannapieco, M., Wimmer, M., Wrembel, R., and Zaraket, F. A. 2018. Big data semantics. J. Data Semant. 7, 2, 65--85.Google ScholarCross Ref
- Ceravolo, P. and Bellini, E. 2019. Towards configurable composite data quality assessment. In IEEE Conf. on Business Informatics (CBI). IEEE, 249--257.Google Scholar
- Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., and Stefanidis, K. 2021. An overview of end-to-end entity resolution for big data. ACM Computing Surveys 53, 6, 127:1--127:42.Google Scholar
- Chu, X., Ilyas, I. F., Krishnan, S., and Wang, J. 2016. Data cleaning: Overview and emerging challenges. In Int. Conf. on Management of Data (SIGMOD), F. Özcan, G. Koutrika, and S. Madden, Eds. ACM, 2201--2206.Google Scholar
- Codd, E. F. 1970. A relational model of data for large shared data banks. Commun. ACM 13, 6, 377--387.Google ScholarDigital Library
- Console, M. and Lenzerini, M. 2014. Data quality in ontology-based data access: The case of consistency. In AAAI Conf. on Artificial Intelligence. AAAI Press, 1020--1026.Google Scholar
- Cudré-Mauroux, P. 2020. Leveraging knowledge graphs for big data integration: the XI pipeline. Semantic Web 11, 1, 13--17.Google ScholarDigital Library
- Dasgupta, D. 2021. Delta lake: New hybrid between data lake & data warehouse.Google Scholar
- Dong, X. L. and Srivastava, D. 2013. Big data integration. VLDB Endow. 6, 11, 1188--1189.Google ScholarDigital Library
- Dong, X. L. and Srivastava, D. 2015. Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers.Google Scholar
- Durner, D., Chandramouli, B., and Li, Y. 2021. Crystal: A unified cache storage system for analytical databases. VLDB Endow. 14, 11, 2432--2444.Google ScholarDigital Library
- Dwivedi, R., Dave, D., Naik, H., Singhal, S., Rana, O. F., Patel, P., Qian, B., Wen, Z., Shah, T., Morgan, G., and Ranjan, R. 2023. Explainable AI (XAI): core ideas, techniques, and solutions. ACM Comput. Surv. 55, 9, 194:1--194:33.Google Scholar
- Eppler, M. and Helfert, M. 2004. A classification and analysis of data quality costs. In Int. Conf. on Information Quality. MIT, 311--325.Google Scholar
- Farid, M. H., Roatis, A., Ilyas, I. F., Hoffmann, H., and Chu, X. 2016. CLAMS: bringing quality to data lakes. In Int. Conf. on Management of Data (SIGMOD), F. Özcan, G. Koutrika, and S. Madden, Eds. ACM, 2089--2092.Google Scholar
- Frénay, B. and Verleysen, M. 2014. Classification in the presence of label noise: A survey. IEEE Trans. Neural Networks Learn. Syst. 25, 5, 845--869.Google ScholarCross Ref
- Führing, P. and Naumann, F. 2007. Emergent data quality annotation and visualization. In Int. Conf. on Information Quality. MIT, 424--430.Google Scholar
- Ghosh, D., Gupta, P., Mehrotra, S., and Sharma, S. 2022. A case for enrichment in data management systems. SIGMOD Rec. 51, 2, 38--43.Google ScholarDigital Library
- Glavic, B., Siddique, J., Andritsos, P., and Miller, R. J. 2013. Provenance for Data Mining. In Worksh. on the Theory and Practice of Provenance (TaPP).Google Scholar
- Golshan, B., Halevy, A., Mihaila, G., and Tan, W.-C. 2017. Data integration: After the teenage years. In ACM SIGMOD-SIGACT-SIGAI Symp. on Principles of Database Systems (PODS). 101--106.Google Scholar
- Groppe, S. 2020. Emergent models, frameworks, and hardware technologies for big data analytics. The Journal of Supercomputing 76, 3, 1800--1827.Google ScholarCross Ref
- Groppe, S., Groppe, J., Çalikyilmaz, U., Winker, T., and Gruenwald, L. 2022. Quantum data management and quantum machine learning for data management: State-of-the-art and open challenges. In EAI Int. Conf. on Intelligent Systems and Machine Learning (EAI ICISML).Google Scholar
- Gu, Z., Lanti, D., Mosca, A., Xiao, G., Xiong, J., and Calvanese, D. 2022. Ontology-based data federation. In Int. Joint Conference on Knowledge Graphs (IJCKG). ACM, 10--19.Google Scholar
- Günnemann, S. 2017. Machine learning meets databases. Datenbank-Spektrum 17, 1, 77--83.Google ScholarCross Ref
- Hai, R., Quix, C., and Jarke, M. 2021. Data lake concept and systems: a survey. CoRR abs/2106.09592.Google Scholar
- Harby, A. A. and Zulkernine, F. H. 2022. From data warehouse to lakehouse: A comparative review. In IEEE Int. Conf. on Big Data. IEEE, 389--395.Google Scholar
- Haug, A., Zachariassen, F., and Van Liempd, D. 2011. The costs of poor data quality. Journal of Industrial Engineering and Management (JIEM) 4, 2, 168--193.Google ScholarCross Ref
- He, X., Zhao, K., and Chu, X. 2021. AutoML: A survey of the state-of-the-art. Knowl. Based Syst. 212, 106622.Google ScholarCross Ref
- Herschel, M., Diestelkämper, R., and Ben Lahmar, H. 2017. A survey on provenance: What for? What form? What from? VLDB J. 26, 6, 881--906.Google ScholarDigital Library
- Huang, H.-Y., Broughton, M., Mohseni, M., Babbush, R., Boixo, S., Neven, H., and McClean, J. R. 2021. Power of data in quantum machine learning. Nat. Commun. 12, 1.Google Scholar
- Huang, L., Joseph, A. D., Nelson, B., Rubinstein, B. I. P., and Tygar, J. D. 2011. In ACM Worksh. on Security and Artificial Intelligence (AISec). ACM, 43--58.Google Scholar
- Ilyas, I. F. and Rekatsinas, T. 2022. Machine learning and data cleaning: Which serves the other? ACM J. Data Inf. Qual. 14, 3, 13:1--13:11.Google Scholar
- Jouppi, N., Young, C., Patil, N., and Patterson, D. 2018. Motivation for and evaluation of the first tensor processing unit. IEEE Micro 38, 3, 10--19.Google ScholarCross Ref
- Juran, J. and Godfrey, A. 1999. Juran's Quality Handbook. McGraw Hill.Google Scholar
- Karkouch, A., Mousannif, H., Moatassime, H. A., and Noël, T. 2016. Data quality in internet of things: A state-of-the-art survey. J. Netw. Comput. Appl. 73, 57--81.Google ScholarDigital Library
- Kohavi, R., Mason, L., Parekh, R., and Zheng, Z. 2004. Lessons and challenges from mining retail e-commerce data. Mach. Learn. 57, 1-2, 83--113.Google ScholarDigital Library
- Kreuzberger, D., Kühl, N., and Hirschl, S. 2023. Machine learning operations (mlops): Overview, definition, and architecture. IEEE Access 11, 31866--31879.Google ScholarCross Ref
- Lee, S., Lerner, A., Ryser, A., Park, K., Jeon, C., Park, J., Song, Y. H., and Cudré-Mauroux, P. 2022. X-SSD: A storage system with native support for database logging and replication. In Int. Conf. on Management of Data (SIGMOD). ACM, 988--1002.Google Scholar
- Lerner, A., Hussein, R., and Cudré-Mauroux, P. 2019. The case for network accelerated query processing. In Biennial Conf. on Innovative Data Systems Research (CIDR). www.cidrdb.org.Google Scholar
- Lerner, A., Jasny, M., Jepsen, T., Binnig, C., and Cudré-Mauroux, P. 2022. DBMS annihilator: A high-performance database workload generator in action. VLDB Endow. 15, 12, 3682--3685.Google ScholarDigital Library
- Li, Z., Sharaf, M. A., Sitbon, L., Sadiq, S. W., Indulska, M., and Zhou, X. 2014. A web-based approach to data imputation. World Wide Web 17, 5, 873--897.Google ScholarDigital Library
- Liu, Z., Park, J., Rekatsinas, T., and Tzamos, C. 2021. On robust mean estimation under coordinate-level corruption. In Int. Conf. on Machine Learning ICML, M. Meila and T. Zhang, Eds. Vol. 139. PMLR, 6914--6924.Google Scholar
- Makinen, S., Skogstrom, H., Laaksonen, E., and Mikkonen, T. 2021. Who needs MLOps: What data scientists seek to accomplish and how can MLOps help? In IEEE/ACM Worksh. on AI Engineering - Software Engineering for AI (WAIN). IEEE.Google Scholar
- Mansour, E., Srinivas, K., and Hose, K. 2021. Federated Data Science to Break Down Silos. SIGMOD Rec. 50, 4, 16--22.Google ScholarDigital Library
- Mauri, L. and Damiani, E. 2022. Estimating degradation of machine learning data assets. ACM J. Data Inf. Qual. 14, 2, 9:1--9:15.Google Scholar
- Mavlyutov, R., Curino, C., Asipov, B., and Cudré-Mauroux, P. 2017. Dependency-driven analytics: A compass for uncharted data oceans. In Biennial Conf. on Innovative Data Systems Research (CIDR). www.cidrdb.org.Google Scholar
- McMahan, B. and Ramage, D. 2017. Federated learning: Collaborative machine learning without centralized training data.Google Scholar
- Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A. 2022. A survey on bias and fairness in machine learning. ACM Comput. Surv. 54, 6, 115:1--115:35.Google Scholar
- Mezzanzanica, M., Boselli, R., Cesarini, M., and Mercorio, F. 2015. A model-based approach for developing data cleansing solutions. ACM J. Data Inf. Qual. 5, 4, 13:1--13:28.Google Scholar
- Miao, X., Gao, Y., Guo, S., and Liu, W. 2018. Incomplete data management: a survey. Frontiers Comput. Sci. 12, 1, 4--25.Google ScholarDigital Library
- Nadal, S., Abelló, A., Romero, O., Vansummeren, S., and Vassiliadis, P. 2023. Graph-driven federated data management. IEEE Trans. Knowl. Data Eng. 35, 1, 509--520.Google Scholar
- Nargesian, F., Zhu, E., Miller, R. J., Pu, K. Q., and Arocena, P. C. 2019a. Data lake management: Challenges and opportunities. VLDB Endow. 12, 12, 1986--1989.Google ScholarDigital Library
- Nargesian, F., Zhu, E., Miller, R. J., Pu, K. Q., and Arocena, P. C. 2019b. Data lake management: Challenges and opportunities. VLDB Endow. 12, 12, 1986--1989.Google ScholarDigital Library
- Natarajan, N., Dhillon, I. S., Ravikumar, P., and Tewari, A. 2013. Learning with noisy labels. In Annual Conf. on Neural Information Processing Systems (NIPS). 1196--1204.Google Scholar
- Nath, R. P. D., Romero, O., Pedersen, T. B., and Hose, K. 2022. High-level ETL for semantic data warehouses. Semantic Web 13, 1, 85--132.Google ScholarDigital Library
- Nayak, N., Rehfeld, J., Winker, T., Warnke, B., Çalikyilmaz, U., and Groppe, S. 2023. Constructing optimal bushy join trees by solving qubo problems on quantum hardware and simulators. In Proceedings of the International Workshop on Big Data in Emergent Distributed Environments (BiDEDE), Seattle, WA, USA.Google Scholar
- Ng, D., Lan, X., Yao, M. M.-S., Chan, W. P., and Feng, M. 2021. Federated learning: a collaborative effort to achieve better medical imaging models for individual sites that have small labelled datasets. Quantitative Imaging in Medicine and Surgery 11, 2, 852--857.Google ScholarCross Ref
- Northcutt, C. G., Athalye, A., and Mueller, J. 2021. Pervasive label errors in test sets destabilize machine learning benchmarks. In Neural Information Processing Systems Track on Datasets and Benchmarks 1, J. Vanschoren and S. Yeung, Eds.Google Scholar
- Noy, N. F., Gao, Y., Jain, A., Narayanan, A., Patterson, A., and Taylor, J. 2019. Industry-scale knowledge graphs: lessons and challenges. Commun. ACM 62, 8, 36--43.Google ScholarDigital Library
- Nurvitadhi, E., Sim, J., Sheffield, D., Mishra, A., Krishnan, S., and Marr, D. 2016. Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC. In Int. Conf. on Field Programmable Logic and Applications (FPL). IEEE, 1--4.Google Scholar
- Paggi, H., Soriano, J., Lara, J. A., and Damiani, E. 2021. Towards the definition of an information quality metric for information fusion models. Comput. Electr. Eng. 89.Google Scholar
- Park, K., Saur, K., Banda, D., Sen, R., Interlandi, M., and Karanasos, K. 2022. End-to-end optimization of machine learning prediction queries. In Int. Conf. on Management of Data (SIGMOD). ACM.Google Scholar
- Psallidas, F., Zhu, Y., Karlas, B., Henkel, J., Interlandi, M., Krishnan, S., Kroth, B., Emani, K. V., Wu, W., Zhang, C., Weimer, M., Floratou, A., Curino, C., and Karanasos, K. 2022. Data science through the looking glass: Analysis of millions of github notebooks and ML.NET pipelines. SIGMOD Rec. 51, 2, 30--37.Google ScholarDigital Library
- Ratner, A. J., Sa, C. D., Wu, S., Selsam, D., and Ré, C. 2016. Data programming: Creating large training sets, quickly. In Annual Conf. on Neural Information Processing Systems (NIPS), D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett, Eds. 3567--3575.Google Scholar
- Rebentrost, P., Mohseni, M., and Lloyd, S. 2014. Quantum support vector machine for big data classification. Phys. Rev. Lett. 113, 130503.Google ScholarCross Ref
- Romero, O. and Wrembel, R. 2020. Data engineering for data science: Two sides of the same coin. In Int. Conf. Big Data Analytics and Knowledge Discovery (DAWAK). LNCS, vol. 12393. Springer, 157--166.Google Scholar
- Sakr, S., Bonifati, A., Voigt, H., Iosup, A., Ammar, K., Angles, R., Aref, W. G., Arenas, M., Besta, M., Boncz, P. A., Daudjee, K., Valle, E. D., Dumbrava, S., Hartig, O., Haslhofer, B., Hegeman, T., Hidders, J., Hose, K., Iamnitchi, A., Kalavri, V., Kapp, H., Martens, W., Özsu, M. T., Peukert, E., Plantikow, S., Ragab, M., Ripeanu, M., Salihoglu, S., Schulz, C., Selmer, P., Sequeda, J. F., Shinavier, J., Szárnyas, G., Tommasini, R., Tumeo, A., Uta, A., Varbanescu, A. L., Wu, H., Yakovets, N., Yan, D., and Yoneki, E. 2021. The future is big graphs: a community view on graph processing systems. Commun. ACM 64, 9, 62--71.Google ScholarDigital Library
- Sattler, K.-U. 2009. Data Quality Dimensions. Springer, 612--615.Google Scholar
- Sessions, V. and Valtorta, M. 2009. Towards a method for data accuracy assessment utilizing a bayesian network learning algorithm. ACM J. Data Inf. Qual. 1, 3, 14:1--14:34.Google Scholar
- Stedman, C. 2022. What is data governance and why does it matter? https://www.techtarget.com/searchdatamanagement/definition/data-governance.Google Scholar
- Stein, D. 2022. Open sourcing feathr - linkedin's feature store for productive machine learning.Google Scholar
- Stonebraker, M. and Ilyas, I. F. 2018. Data Integration: The Current Status and the Way Forward. IEEE Data Eng. Bull. 41, 2, 3--9.Google Scholar
- Suriarachchi, I. and Plale, B. 2016. Provenance as essential infrastructure for data lakes. In Int. Provenance and Annotation Worksh. (IPAW). LNCS, vol. 9672. Springer, 178--182.Google Scholar
- Tagliabue, J., Greco, C., and Bigon, L. 2023. Building a serverless data lakehouse from spare parts. In Workshops at the Int. Conf. on Very Large Data Bases VLDB. CEUR Workshop Proceedings, vol. 3462. CEUR-WS.org.Google Scholar
- Terrizzano, I. G., Schwarz, P. M., Roth, M., and Colino, J. E. 2015. Data wrangling: The challenging yourney from the wild to the lake. In Biennial Conf. on Innovative Data Systems Research (CIDR).Google Scholar
- Villamizar, N., Wahrman, J., and Villasana, M. 2023. Comparing vectorization techniques, supervised and unsupervised classification methods for scientific publication categorization in the UNESCO taxonomy. In IFIP WG 12.5 Int. Conf. Artificial Intelligence Applications and Innovations AIAI. IFIP Advances in Information and Communication Technology, vol. 675. Springer, 356--368.Google Scholar
- Wand, Y. and Wang, R. Y. 1996. Anchoring data quality dimensions in ontological foundations. Commun. ACM 39, 11, 86--95.Google ScholarDigital Library
- Wang, R. Y., Storey, V. C., and Firth, C. P. 1995. A framework for analysis of data quality research. IEEE Trans. Knowl. Data Eng. 7, 4, 623--640.Google ScholarDigital Library
- Wang, R. Y. and Strong, D. M. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 4, 5--33.Google ScholarDigital Library
- Ward, J. S. and Barker, A. 2013. Undefined by data: A survey of big data definitions. CoRR abs/1309.5821.Google Scholar
- Wiederhold, G. 1992. Mediators in the architecture of future information systems. Computer 25, 3, 38--49.Google ScholarDigital Library
- Winker, T., Groppe, S., Uotila, V., Yan, Z., Lu, J., Franz, M., and Mauerer, W. 2023. Quantum machine learning: Foundation, new techniques, and opportunities for database research. In Int. Conf. on Management of Data (SIGMOD).Google Scholar
- Winker, T., Çalikyilmaz, U., Gruenwald, L., and Groppe, S. 2023. Quantum machine learning for join order optimization using variational quantum circuits. In Proceedings of the International Workshop on Big Data in Emergent Distributed Environments (BiDEDE), Seattle, WA, USA.Google Scholar
- Wrembel, R. 2023. Data integration revitalized: From data warehouse through data lake to data mesh. In Int. Conf. Database and Expert Systems Applications DEXA. Lecture Notes in Computer Science, vol. 14146. Springer, 3--18.Google Scholar
- Xu, L., Qiu, S., Yuan, B., Jiang, J., Renggli, C., Gan, S., Kara, K., Li, G., Liu, J., Wu, W., Ye, J., and Zhang, C. 2022. In-database machine learning with CorgiPile: Stochastic gradient descent without full data shuffle. In Int. Conf. on Management of Data (SIGMOD).Google Scholar
- Çalikyilmaz, U., Groppe, S., Groppe, J., Winker, T., Prestel, S., Shagieva, F., Arya, D., Preis, F., and Gruenwald, L. 2023. Opportunities for quantum acceleration of databases: Optimization of queries and transaction schedules. Proc. VLDB Endow. 16, 9, 2344--2353.Google ScholarDigital Library
Index Terms
- moduli: A Disaggregated Data Management Architecture for Data-Intensive Workflows
Recommendations
Big Data Management: Advanced Issues and Approaches
The objective of this article is to provide the advanced issues and approaches of big data management. The literature review indicates the overview of big data management; the aspects of Big Data Analytics BDA; the importance of big data management; the ...
Comments