Skip to main content

Building Data Curation Processes with Crowd Intelligence

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 386))

Abstract

Data curation processes constitute a number of activities, such as transforming, filtering or de-duplicating data. These processes consume an excessive amount of time in data science projects, due to datasets often being external, re-purposed and generally not ready for analytics. Overall, data curation processes are difficult to automate and require human input, which results in a lack of repeatability and potential errors propagating into analytical results. In this paper, we explore a crowd intelligence-based approach to building robust data curation processes. We study how data workers engage with data curation activities, specifically related to data quality detection, and how to build a robust and effective data curation process by learning from the wisdom of the crowd. With the help of a purpose-designed data curation platform based on iPython Notebook, we conducted a lab experiment with data workers and collected a multi-modal dataset that includes measures of task performance and behaviour data. Our findings identify avenues by which effective data curation processes can be built through crowd intelligence.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    Trifacta, https://www.trifacta.com/.

  2. 2.

    Tamr Agile Data Unification and Management Systems, https://www.tamr.com/.

  3. 3.

    Talend Open Studio, https://www.talend.com/products/talend-open-studio/.

  4. 4.

    Parallel Data Generation Framework (PDGF), https://www.bankmark.de/products-and-services/.

  5. 5.

    See http://130.102.97.188/caise20/goldenNotebook/ for our “golden notebook”.

  6. 6.

    P2 got 1.0 for recall while P14 got 0.5 in this simulation.

  7. 7.

    In this case, we should choose the code that achieves 1.0 recall, so that the coverage of the errors can be guaranteed.

  8. 8.

    The four participants with 1.0 recall in this task (i.e., P3, P6, P50 and P58) have provided the code performing exactly the same functionality.

References

  1. Azuan, N.A., Embury, S.M., Paton, N.W.: Observing the data scientist: using manual corrections as implicit feedback. In: Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics, p. 13. ACM (2017)

    Google Scholar 

  2. Blanco, R., Ottaviano, G., Meij, E.: Fast and space-efficient entity linking for queries. In: Proceedings of WSDM, pp. 179–188. ACM (2015)

    Google Scholar 

  3. Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Zencrowd: leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In: Proceedings of WWW, pp. 469–478. ACM (2012)

    Google Scholar 

  4. Demartini, G., Difallah, D.E., Gadiraju, U., Catasta, M., et al.: An introduction to hybrid human-machine information systems. Found. Trends® Web Sci. 7(1), 1–87 (2017)

    Article  Google Scholar 

  5. Filatova, E.: Irony and sarcasm: corpus generation and analysis using crowdsourcing. In: Lrec, pp. 392–398. Citeseer (2012)

    Google Scholar 

  6. Freitas, A., Curry, E.: Big data curation. In: Cavanillas, J.M., Curry, E., Wahlster, W. (eds.) New Horizons for a Data-Driven Economy, pp. 87–118. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-21569-3_6

    Chapter  Google Scholar 

  7. Hart, S.G.: Nasa-task load index (NASA-TLX); 20 years later. In: Proceedings of the Human Factors and Ergonomics Society Annual Meeting, vol. 50, pp. 904–908 (2006)

    Google Scholar 

  8. Hey, T., Trefethen, A.: The data deluge: an e-science perspective. In: Grid computing: Making the global infrastructure a reality, pp. 809–824 (2003)

    Google Scholar 

  9. Jewitt, C.: National centre for research methods working paper 03/12. an introduction to using video for research. Lontoo: Institute of education (2012)

    Google Scholar 

  10. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  11. Lin, Y., Shen, S., Liu, Z., Luan, H., Sun, M.: Neural relation extraction with selective attention over instances. In: Proceedings of the 54th Annual Meeting of the ACL (Volume 1: Long Papers), pp. 2124–2133 (2016)

    Google Scholar 

  12. Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller, R.C.: Crowdsourced databases: Query processing with people. CIDR (2011)

    Google Scholar 

  13. Mehrotra, R., et al.: Deep sequential models for task satisfaction prediction. In: Proceedings of the 2017 ACM CIKM Conference, pp. 737–746 (2017)

    Google Scholar 

  14. Minelli, R., Mocci, A., Lanza, M.: I know what you did last summer: an investigation of how developers spend their time. In: Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension, pp. 25–35 (2015)

    Google Scholar 

  15. Muller, M., et al.: How data science workers work with data: discovery, capture, curation, design, creation. In: Proceedings of the 2019 CHI Conference (2019)

    Google Scholar 

  16. Narasimhan, K., Reichenbach, C.: Copy and paste redeemed (t). In: 2015 30th IEEE/ACM International Conference on ASE, pp. 630–640. IEEE (2015)

    Google Scholar 

  17. Palmer, A., Stonebraker, M., Bates-Haus, N., Cleary, L., Marinelli, M.: Getting DataOps Right. O’Reilly Media, Sebastopol (2019)

    Google Scholar 

  18. Patil, D.: Data Jujitsu. O’Reilly Media Inc., Sebastopol (2012)

    Google Scholar 

  19. Piorkowski, D.J., et al.: The whats and hows of programmers’ foraging diets. In: Proceedings of the CHI Conference, pp. 3063–3072 (2013)

    Google Scholar 

  20. Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  21. Sadiq, S., et al.: Data quality: the role of empiricism. ACM SIGMOD Rec. 46(4), 35–43 (2018)

    Article  MathSciNet  Google Scholar 

  22. Stonebraker, M., et al.: Data curation at scale: the data tamer system. In: CIDR (2013)

    Google Scholar 

  23. Sutton, C., Hobson, T., Geddes, J., Caruana, R.: Data diff: interpretable, executable summaries of changes in distributions for data wrangling. In: Proceedings of the 24th ACM SIGKDD Conference, pp. 2279–2288 (2018)

    Google Scholar 

  24. Thusoo, A., Sarma, J.: Creating a Data-Driven Enterprise with DataOps. O’Reilly Media, Incorporated, Sebastopol (2017)

    Google Scholar 

  25. Zhang, R., Indulska, M., Sadiq, S.: Discovering data quality problems. Bus. Inf. Syst. Eng. 61(5), 575–593 (2019)

    Article  Google Scholar 

Download references

Acknowledgement

This work is partly supported by ARC Discovery Project DP190102141 on Building Crowd Sourced Data Curation Processes.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tianwa Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, T., Han, L., Demartini, G., Indulska, M., Sadiq, S. (2020). Building Data Curation Processes with Crowd Intelligence. In: Herbaut, N., La Rosa, M. (eds) Advanced Information Systems Engineering. CAiSE 2020. Lecture Notes in Business Information Processing, vol 386. Springer, Cham. https://doi.org/10.1007/978-3-030-58135-0_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58135-0_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58134-3

  • Online ISBN: 978-3-030-58135-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics