Building Data Curation Processes with Crowd Intelligence

Chen, Tianwa; Han, Lei; Demartini, Gianluca; Indulska, Marta; Sadiq, Shazia

doi:10.1007/978-3-030-58135-0_3

Building Data Curation Processes with Crowd Intelligence

Tianwa Chen⁸,
Lei Han⁸,
Gianluca Demartini⁸,
Marta Indulska⁹ &
…
Shazia Sadiq⁸

Conference paper
First Online: 28 August 2020

576 Accesses
3 Citations

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 386))

Abstract

Data curation processes constitute a number of activities, such as transforming, filtering or de-duplicating data. These processes consume an excessive amount of time in data science projects, due to datasets often being external, re-purposed and generally not ready for analytics. Overall, data curation processes are difficult to automate and require human input, which results in a lack of repeatability and potential errors propagating into analytical results. In this paper, we explore a crowd intelligence-based approach to building robust data curation processes. We study how data workers engage with data curation activities, specifically related to data quality detection, and how to build a robust and effective data curation process by learning from the wisdom of the crowd. With the help of a purpose-designed data curation platform based on iPython Notebook, we conducted a lab experiment with data workers and collected a multi-modal dataset that includes measures of task performance and behaviour data. Our findings identify avenues by which effective data curation processes can be built through crowd intelligence.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
Trifacta, https://www.trifacta.com/.
2.
Tamr Agile Data Unification and Management Systems, https://www.tamr.com/.
3.
Talend Open Studio, https://www.talend.com/products/talend-open-studio/.
4.
Parallel Data Generation Framework (PDGF), https://www.bankmark.de/products-and-services/.
5.
See http://130.102.97.188/caise20/goldenNotebook/ for our “golden notebook”.
6.
P2 got 1.0 for recall while P14 got 0.5 in this simulation.
7.
In this case, we should choose the code that achieves 1.0 recall, so that the coverage of the errors can be guaranteed.
8.
The four participants with 1.0 recall in this task (i.e., P3, P6, P50 and P58) have provided the code performing exactly the same functionality.

References

Azuan, N.A., Embury, S.M., Paton, N.W.: Observing the data scientist: using manual corrections as implicit feedback. In: Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics, p. 13. ACM (2017)
Google Scholar
Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking for queries. In: Proceedings of WSDM, pp. 179–188. ACM (2015)
Google Scholar
Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of WWW, pp. 469–478. ACM (2012)
Google Scholar
Demartini, G., Difallah, D.E., Gadiraju, U., Catasta, M., et al.: An introduction to hybrid human-machine information systems. Found. Trends® Web Sci. 7(1), 1–87 (2017)
Article Google Scholar
Filatova, E.: Irony and sarcasm: corpus generation and analysis using crowdsourcing. In: Lrec, pp. 392–398. Citeseer (2012)
Google Scholar
Freitas, A., Curry, E.: Big data curation. In: Cavanillas, J.M., Curry, E., Wahlster, W. (eds.) New Horizons for a Data-Driven Economy, pp. 87–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-21569-3_6
Chapter Google Scholar
Hart, S.G.: Nasa-task load index (NASA-TLX); 20 years later. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 50, pp. 904–908 (2006)
Google Scholar
Hey, T., Trefethen, A.: The data deluge: an e-science perspective. In: Grid computing: Making the global infrastructure a reality, pp. 809–824 (2003)
Google Scholar
Jewitt, C.: National centre for research methods working paper 03/12. an introduction to using video for research. Lontoo: Institute of education (2012)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Lin, Y., Shen, S., Liu, Z., Luan, H., Sun, M.: Neural relation extraction with selective attention over instances. In: Proceedings of the 54th Annual Meeting of the ACL (Volume 1: Long Papers), pp. 2124–2133 (2016)
Google Scholar
Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller, R.C.: Crowdsourced databases: Query processing with people. CIDR (2011)
Google Scholar
Mehrotra, R., et al.: Deep sequential models for task satisfaction prediction. In: Proceedings of the 2017 ACM CIKM Conference, pp. 737–746 (2017)
Google Scholar
Minelli, R., Mocci, A., Lanza, M.: I know what you did last summer: an investigation of how developers spend their time. In: Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension, pp. 25–35 (2015)
Google Scholar
Muller, M., et al.: How data science workers work with data: discovery, capture, curation, design, creation. In: Proceedings of the 2019 CHI Conference (2019)
Google Scholar
Narasimhan, K., Reichenbach, C.: Copy and paste redeemed (t). In: 2015 30th IEEE/ACM International Conference on ASE, pp. 630–640. IEEE (2015)
Google Scholar
Palmer, A., Stonebraker, M., Bates-Haus, N., Cleary, L., Marinelli, M.: Getting DataOps Right. O’Reilly Media, Sebastopol (2019)
Google Scholar
Patil, D.: Data Jujitsu. O’Reilly Media Inc., Sebastopol (2012)
Google Scholar
Piorkowski, D.J., et al.: The whats and hows of programmers’ foraging diets. In: Proceedings of the CHI Conference, pp. 3063–3072 (2013)
Google Scholar
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
Sadiq, S., et al.: Data quality: the role of empiricism. ACM SIGMOD Rec. 46(4), 35–43 (2018)
Article MathSciNet Google Scholar
Stonebraker, M., et al.: Data curation at scale: the data tamer system. In: CIDR (2013)
Google Scholar
Sutton, C., Hobson, T., Geddes, J., Caruana, R.: Data diff: interpretable, executable summaries of changes in distributions for data wrangling. In: Proceedings of the 24th ACM SIGKDD Conference, pp. 2279–2288 (2018)
Google Scholar
Thusoo, A., Sarma, J.: Creating a Data-Driven Enterprise with DataOps. O’Reilly Media, Incorporated, Sebastopol (2017)
Google Scholar
Zhang, R., Indulska, M., Sadiq, S.: Discovering data quality problems. Bus. Inf. Syst. Eng. 61(5), 575–593 (2019)
Article Google Scholar

Download references

Acknowledgement

This work is partly supported by ARC Discovery Project DP190102141 on Building Crowd Sourced Data Curation Processes.

Author information

Authors and Affiliations

School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, Australia
Tianwa Chen, Lei Han, Gianluca Demartini & Shazia Sadiq
Business School, The University of Queensland, Brisbane, Australia
Marta Indulska

Authors

Tianwa Chen
View author publications
You can also search for this author in PubMed Google Scholar
Lei Han
View author publications
You can also search for this author in PubMed Google Scholar
Gianluca Demartini
View author publications
You can also search for this author in PubMed Google Scholar
Marta Indulska
View author publications
You can also search for this author in PubMed Google Scholar
Shazia Sadiq
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tianwa Chen .

Editor information

Editors and Affiliations

Université Paris1 Panthéon-Sorbonne, Paris, France
Nicolas Herbaut
University of Melbourne, Melbourne, VIC, Australia
Marcello La Rosa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, T., Han, L., Demartini, G., Indulska, M., Sadiq, S. (2020). Building Data Curation Processes with Crowd Intelligence. In: Herbaut, N., La Rosa, M. (eds) Advanced Information Systems Engineering. CAiSE 2020. Lecture Notes in Business Information Processing, vol 386. Springer, Cham. https://doi.org/10.1007/978-3-030-58135-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-58135-0_3
Published: 28 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58134-3
Online ISBN: 978-3-030-58135-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics