Preface
Guest Editorial – DaWaK 2019 Special Issue – Evolving Big Data Analytics Towards Data Science

https://doi.org/10.1016/j.datak.2020.101838Get rights and content

Introduction

Data warehouses evolved and went through a disruptive transformation to become Big Data repositories (so-called Data Lakes), where data exploded in Volume, Velocity and Variety. This evolution brought faster approaches, tools and algorithms to load data, querying beyond SQL considering semantics, mixing text with tables, stream processing and computing machine learning models. A latter revolution brought Veracity in the presence of contradictory information and even Value, given so many options and the investment to exploit big data. Nevertheless, having so much information in a central repository enabled more sophisticated exploratory analysis, beyond multivariate statistics and queries. Over time people realized that managing so much diverse data required not only database technology, but also a more principled approach laying its foundation on one hand in mathematics (probability, machine learning, numerical optimization) and on the other hand, more abstract, highly analytic, programming (combining multiple languages, pre-processing data, integrating diverse data sources). This new trend is not another fad: data science is now considered a competing discipline to computer science and applied mathematics.

The Data Warehousing and Knowledge Discovery (DaWaK) conference is the “child” of the marriage between data warehousing and knowledge discovery. DaWak was launched in 1999 aimed at bringing together researchers, analysts and developers to discuss research issues and experience in developing and deploying data warehousing and knowledge discovery systems, applications, and tools. From 1999 till 2014, the DaWaK conference series received and accepted papers related to the topics covered by these two technologies. Thus in 2015, DaWak the first part of its name has been replaced by Big Data Analytics and became Big Data Analytics and Knowledge Discovery, but keeping the now well-established Dawak acronym. Later, starting in 2015 the scope was expanded to accept big data papers, a trend which is now morphing again given the explosion of data, but also the evolution of software and hardware. Towards 2020 Dawak made another major leap forward to become a data science conference, heeding not only the big volume aspect, but also looking back at its data warehousing and data mining roots.

This special issue contains selected papers from the 21st International Conference on Big Data Analytics (DaWaK 2019). These papers reflect an expanded scope truly focusing on big data analytics and large-scale data science, instead of the ultra popular trend today: machine learning on benchmark (but generally small) data sets. DaWaK 2019 attracted 61 submissions from which 22 papers were accepted. After their presentation at DaWaK in Linz, Austria, August 26–29, 2019, and further discussion among the PC Chairs, we invited 4 out of those 22 papers to this special issue with a strict requirement to extend their paper with at least 40% new content and to carefully consider conference reviewers feedback. Following our initiative from Dawak 2018, we made no distinction between full and short papers in order to select novel, but still good, promising, papers. This initiative allowed all papers to be on level ground. Our goal was basically to avoid republishing full papers with minor technical extensions, which does not help advancing the field. Our paper selection was based mainly on the presentation at the conference (clear contribution), the reviews (especially at least one strong accept), and the authors response to reviewers during the conference (an extra slide). That is, we gave authors an opportunity to improve their work considering reviewers suggestions in order to write a novel, high quality, journal article. After the second round of reviews, only two papers made the final cut. These papers provide a glimpse of important research issues in Big Data Analytics and Data Science today: a new architecture to store big data and process modeling.

Here we provide a summary of the two selected papers:

  • The first article [1], titled “Design and Implementation of ETL Processes using BPMN and Relational Algebra”, formalizes and extends ETL workflows using elegant BPMN diagrams, using the relational database model as a theoretical foundation.

  • The second article [2], titled “Mo.Re.Farming: A Hybrid Architecture for Tactical and Strategic Precision Agriculture” presents an interesting big data architecture to capture spatial data sets and decision support for farmers in Italy.

Section snippets

Conclusions

Big data has brought a new research angle, including not having a database model [3], innovative storage beyond rows (e.g. columns, arrays [4]), and scale-out parallel processing [5]. Many assumptions based on a centralized data warehouse or rigid database have been weakened and even disappeared. It is fair to say big data analytics has evolved leaving data warehousing and data mining research as history, paving the way for data science. The papers included in our special issue show this trend.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Carlos Ordonez got a Ph.D. degree in Computer Science from the Georgia Institute of Technology, USA, in 2000. Dr. Ordonez joined the University of Houston in 2006, where he conducts on research on scalable data science systems and parallel data processing. Dr. Ordonez worked 8 years extending the Teradata DBMS with machine learning and cube techniques. Dr. Ordonez was a visiting researcher at MIT from 2014 to 2016 working on array and columnar parallel database systems. Dr. Ordonez worked

References (5)

There are more references available in the full text version of this article.

Cited by (0)

Carlos Ordonez got a Ph.D. degree in Computer Science from the Georgia Institute of Technology, USA, in 2000. Dr. Ordonez joined the University of Houston in 2006, where he conducts on research on scalable data science systems and parallel data processing. Dr. Ordonez worked 8 years extending the Teradata DBMS with machine learning and cube techniques. Dr. Ordonez was a visiting researcher at MIT from 2014 to 2016 working on array and columnar parallel database systems. Dr. Ordonez worked research scientist at ATT Labs from 2014 to 2015, focusing on the R language. His research is centered on large-scale data science, parallel database systems and big data. Dr. Ordonez research has produced over 120 papers, over 3500 citations and has been funded by NSF grants.

Il-Yeol Song is professor in the College of Computing and Informatics of Drexel University. He served as Deputy Director of NSF-sponsored research center on Visual and Decision Informatics (CVDI) between 2012 and 2014. His research topics include conceptual modeling, data warehousing, big data management and analytics, and smart aging. He is an ACM Distinguished Scientist and an ER Fellow. He is the recipient of 2015 Peter P. Chen Award in Conceptual Modeling. Dr. Song published over 200 peer-reviewed papers in data management areas. He is a co-Editor-in-Chief of Journal of Computing Science and Engineering (JCSE) and Consulting Editor for Data and Knowledge Engineering. He won the Best Paper Award in the IEEE CIBCB 2004. He won four teaching awards from Drexel, including the most prestigious Lindback Distinguished Teaching Award. Dr. Song served as the Steering Committee chair of the ER conference between 2010 and 2012. He delivered a keynote speech on big data at the First Asia-Pacific iSchool Conference in 2014, ACM SAC 2015 conference, ER2015 Conference, EDB 2016, and A-LIEP 2016 Conference.

View full text