Abstract
Data curation processes constitute a number of activities, such as transforming, filtering or de-duplicating data. These processes consume an excessive amount of time in data science projects, due to datasets often being external, re-purposed and generally not ready for analytics. Overall, data curation processes are difficult to automate and require human input, which results in a lack of repeatability and potential errors propagating into analytical results. In this paper, we explore a crowd intelligence-based approach to building robust data curation processes. We study how data workers engage with data curation activities, specifically related to data quality detection, and how to build a robust and effective data curation process by learning from the wisdom of the crowd. With the help of a purpose-designed data curation platform based on iPython Notebook, we conducted a lab experiment with data workers and collected a multi-modal dataset that includes measures of task performance and behaviour data. Our findings identify avenues by which effective data curation processes can be built through crowd intelligence.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Trifacta, https://www.trifacta.com/.
- 2.
Tamr Agile Data Unification and Management Systems, https://www.tamr.com/.
- 3.
Talend Open Studio, https://www.talend.com/products/talend-open-studio/.
- 4.
Parallel Data Generation Framework (PDGF), https://www.bankmark.de/products-and-services/.
- 5.
See http://130.102.97.188/caise20/goldenNotebook/ for our “golden notebook”.
- 6.
P2 got 1.0 for recall while P14 got 0.5 in this simulation.
- 7.
In this case, we should choose the code that achieves 1.0 recall, so that the coverage of the errors can be guaranteed.
- 8.
The four participants with 1.0 recall in this task (i.e., P3, P6, P50 and P58) have provided the code performing exactly the same functionality.
References
Azuan, N.A., Embury, S.M., Paton, N.W.: Observing the data scientist: using manual corrections as implicit feedback. In: Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics, p. 13. ACM (2017)
Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking for queries. In: Proceedings of WSDM, pp. 179–188. ACM (2015)
Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of WWW, pp. 469–478. ACM (2012)
Demartini, G., Difallah, D.E., Gadiraju, U., Catasta, M., et al.: An introduction to hybrid human-machine information systems. Found. Trends® Web Sci. 7(1), 1–87 (2017)
Filatova, E.: Irony and sarcasm: corpus generation and analysis using crowdsourcing. In: Lrec, pp. 392–398. Citeseer (2012)
Freitas, A., Curry, E.: Big data curation. In: Cavanillas, J.M., Curry, E., Wahlster, W. (eds.) New Horizons for a Data-Driven Economy, pp. 87–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-21569-3_6
Hart, S.G.: Nasa-task load index (NASA-TLX); 20 years later. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 50, pp. 904–908 (2006)
Hey, T., Trefethen, A.: The data deluge: an e-science perspective. In: Grid computing: Making the global infrastructure a reality, pp. 809–824 (2003)
Jewitt, C.: National centre for research methods working paper 03/12. an introduction to using video for research. Lontoo: Institute of education (2012)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Lin, Y., Shen, S., Liu, Z., Luan, H., Sun, M.: Neural relation extraction with selective attention over instances. In: Proceedings of the 54th Annual Meeting of the ACL (Volume 1: Long Papers), pp. 2124–2133 (2016)
Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller, R.C.: Crowdsourced databases: Query processing with people. CIDR (2011)
Mehrotra, R., et al.: Deep sequential models for task satisfaction prediction. In: Proceedings of the 2017 ACM CIKM Conference, pp. 737–746 (2017)
Minelli, R., Mocci, A., Lanza, M.: I know what you did last summer: an investigation of how developers spend their time. In: Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension, pp. 25–35 (2015)
Muller, M., et al.: How data science workers work with data: discovery, capture, curation, design, creation. In: Proceedings of the 2019 CHI Conference (2019)
Narasimhan, K., Reichenbach, C.: Copy and paste redeemed (t). In: 2015 30th IEEE/ACM International Conference on ASE, pp. 630–640. IEEE (2015)
Palmer, A., Stonebraker, M., Bates-Haus, N., Cleary, L., Marinelli, M.: Getting DataOps Right. O’Reilly Media, Sebastopol (2019)
Patil, D.: Data Jujitsu. O’Reilly Media Inc., Sebastopol (2012)
Piorkowski, D.J., et al.: The whats and hows of programmers’ foraging diets. In: Proceedings of the CHI Conference, pp. 3063–3072 (2013)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Sadiq, S., et al.: Data quality: the role of empiricism. ACM SIGMOD Rec. 46(4), 35–43 (2018)
Stonebraker, M., et al.: Data curation at scale: the data tamer system. In: CIDR (2013)
Sutton, C., Hobson, T., Geddes, J., Caruana, R.: Data diff: interpretable, executable summaries of changes in distributions for data wrangling. In: Proceedings of the 24th ACM SIGKDD Conference, pp. 2279–2288 (2018)
Thusoo, A., Sarma, J.: Creating a Data-Driven Enterprise with DataOps. O’Reilly Media, Incorporated, Sebastopol (2017)
Zhang, R., Indulska, M., Sadiq, S.: Discovering data quality problems. Bus. Inf. Syst. Eng. 61(5), 575–593 (2019)
Acknowledgement
This work is partly supported by ARC Discovery Project DP190102141 on Building Crowd Sourced Data Curation Processes.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, T., Han, L., Demartini, G., Indulska, M., Sadiq, S. (2020). Building Data Curation Processes with Crowd Intelligence. In: Herbaut, N., La Rosa, M. (eds) Advanced Information Systems Engineering. CAiSE 2020. Lecture Notes in Business Information Processing, vol 386. Springer, Cham. https://doi.org/10.1007/978-3-030-58135-0_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-58135-0_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58134-3
Online ISBN: 978-3-030-58135-0
eBook Packages: Computer ScienceComputer Science (R0)