Abstract
In recent years, the amount of data is growing extensively. In companies, spreadsheets are one common approach to conduct data processing and statistical analysis. However, especially when working with massive amounts of data, spreadsheet applications have their limitations. To cope with this issue, we introduce a human-in-the-loop approach for scalable data preprocessing using sampling. In contrast to state-of-the-art approaches, we also consider conflict resolution and recommendations based on data not contained in the sample itself. We implemented a fully functional prototype and conducted a user study with 12 participants. We show that our approach delivers a significantly higher error correction than comparable approaches which only consider the sample dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
OpenRefine: https://openrefine.org.
- 2.
KNIME: https://www.knime.com.
- 3.
RapidMiner: https://rapidminer.com.
- 4.
Trifacta Wrangler: https://www.trifacta.com.
References
Abedjan, Z., et al.: Detecting data errors: where are we and what needs to be done? Proc. VLDB Endow. 9(12), 993–1004 (2016)
Bendre, M., et al.: Anti-freeze for large and complex spreadsheets: asynchronous formula computation. In: Proceedings of the International Conference on Management of Data (SIGMOD) (2019)
Cypher, A. (ed.): Watch What I Do - Programming by Demonstration. MIT Press, Cambridge (1993)
Dua, D., Graff, C.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery in databases. AI Mag. 17(3), 37 (1996)
Gandel, S.: Damn Excel! How the ‘most important software application of all time’ is ruining the world (2013). http://fortune.com/2013/04/17/damn-excel-how-the-most-important-software-application-of-all-time-is-ruining-the-world/
Gulwani, S., Marron, M.: NLyze: interactive programming by natural language for spreadsheet data analysis and manipulation. In: Proceedings of the International Conference on Management of Data (SIGMOD) (2014)
Gulwani, S., et al.: Spreadsheet data manipulation using examples. Commun. ACM 55(8), 97–105 (2012)
International Business Machines Corporation: Transforming the Common Spreadsheet: A Smarter Approach to Budgeting, Planning and Forecasting, Technical report (2009)
Kandel, S., et al.: Wrangler: interactive visual specification of data transformation scripts. In: Proceedings of the Conference on Human Factors in Computing Systems (CHI) (2011)
Kemper, H.G., et al.: Datenbereitstellung und -modellierung. In: Business Intelligence - Grundlagen und praktische Anwendungen: Eine Einführung in die IT-basierte Managementunterstützung (2010)
Lohr, S.L.: Sampling: Design and Analysis. Brooks/Cole (2009)
Mack, K., et al.: Characterizing scalability issues in spreadsheet software using online forums. In: Extended Abstracts of the Conference on Human Factors in Computing Systems (CHI EA) (2018)
Moore, S.: Gartner says more than 40 percent of data science tasks will be automated by 2020 (2017). https://www.gartner.com/en/newsroom/press-releases/2017-01-16-gartner-says-more-than-40-percent-of-data-science-tasks-will-be-automated-by-2020
Reimann, P., Schwarz, H., Mitschang, B.: A pattern approach to conquer the data complexity in simulation workflow design. In: Meersman, R., et al. (eds.) OTM 2014. LNCS, vol. 8841, pp. 21–38. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-45563-0_2
Rekatsinas, T., et al.: HoloClean - holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10(11) (2017)
Shearer, C.: The CRISP-DM model: the new blueprint for data mining. J. Data Warehouse. 5(4) (2000)
Stodder, D.: Visual Analytics for Making Smarter Decisions Faster. Technical report, TDWI, Renton, WA, USA (2015)
Wache, H., et al.: Ontology-based integration of information - a survey of existing approaches. In: Proceedings of the Workshop on Ontologies and Information Sharing, International Joint Conference on Artificial Intelligence (IJCAI) (2001)
Acknowlegements
This research was performed in the project ‘IMPORT’ as part of the Software Campus program, which is funded by the German Federal Ministry of Education and Research (BMBF) under Grant No.: 01IS17051.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Behringer, M., Hirmer, P., Fritz, M., Mitschang, B. (2020). Empowering Domain Experts to Preprocess Massive Distributed Datasets. In: Abramowicz, W., Klein, G. (eds) Business Information Systems. BIS 2020. Lecture Notes in Business Information Processing, vol 389. Springer, Cham. https://doi.org/10.1007/978-3-030-53337-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-53337-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-53336-6
Online ISBN: 978-3-030-53337-3
eBook Packages: Computer ScienceComputer Science (R0)