Skip to main content

Advertisement

Log in

A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis

  • Published:
Mobile Networks and Applications Aims and scope Submit manuscript

Abstract

Supervised Machine Learning (ML) requires that smart algorithms scrutinize a very large number of labeled samples before they can make right predictions. And this is not always true either. In our experience, in fact, a neural network trained with a huge database comprised of over fifteen million water meter readings had essentially failed to predict when a meter would malfunction/need disassembly based on a history of water consumption measurements. With a second step, we developed a methodology, based on the enforcement of a specialized data semantics, that allowed us to extract only those samples for training that were not noised by data impurities. With this methodology, we re-trained the neural network up to a prediction accuracy of over 80%. Yet, we simultaneously realized that the new training dataset was significantly different from the initial one in statistical terms, and much smaller, as well. We had reached a sort of paradox: We had alleviated the initial problem with a better interpretable model, but we had changed the replicated form of the initial data. To reconcile that paradox, we further enhanced our data semantics with the contribution of field experts. This has finally led to the extrapolation of a training dataset truly representative of regular/defective water meters and able to describe the underlying statistical phenomenon, while still providing an excellent prediction accuracy of the resulting classifier. At the end of this path, the lesson we have learnt is that a human-in-the-loop approach may significantly help to clean and re-organize noised datasets for an empowered ML design experience.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Figure 4

Similar content being viewed by others

References

  1. Pettersen L (2018) Why artificial intelligence will not outsmart complex knowledge work. Work, Employment and Society. Sage. To appear

  2. Jordan MI, Mitchell TM (2015) Machine learning: trends, perspectives, and prospects. Science 349(6245):255–260

    Article  MathSciNet  Google Scholar 

  3. Delnevo G, Roccetti M, Mirri S (2019) Intelligent and good machines? The role of domain and context codification, Mobile networks and applications, Elsevier. To appear

  4. Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann

  5. Alkowaileet W, Alsubaiee S, Carey M, Li C, Ramampiaro H, Sinthong P, Wang X (2018) Enhancing big data with semantics: the AsterixDB approach. In Proc. of 12th IEEE international conference on semantic computing, 314-315. IEEE

  6. Emani CK, Cullot N, Nicolle C (2015) Understandable big data: a survey. Comput Sci Rev 17:70–81

    Article  MathSciNet  Google Scholar 

  7. Casini L, Delnevo G, Roccetti M, Zagni N, Cappiello G (2019, August) Deep water: predicting water meter failures through a human-machine intelligence collaboration. In international conference on human interaction and emerging technologies (pp. 688-694). Springer, Cham

  8. Roccetti M, Delnevo G, Casini L, Zagni N, Cappiello G (2019, September). A paradox in ML design: less data for a smarter water metering cognification experience. In proceedings of the 5th EAI international conference on smart objects and Technologies for Social Good (pp. 201-206). ACM

  9. Roccetti M, Delnevo G, Casini L, Cappiello G (2019) Is bigger always better? A controversial journey to the center of machine learning design, with uses and misuses of big data for predicting water meter failures. J Big Data 6(1):70

    Article  Google Scholar 

  10. Wang RY, Storey VC, Firth CP (1995) A framework for analysis of data quality research. IEEE Trans Knowl Data Eng 4:623–640

    Article  Google Scholar 

  11. ISO 8000-8:2015, https://www.iso.org/obp/ui/#iso:std:iso:8000:-8:ed-1:v1:en

  12. Juran J, Godfrey AB (1999) Quality handbook. Republished McGraw-Hill, 173-178

  13. Kodra Y, De La Paz MP, Coi A, Santoro M, Bianchi F, Ahmed F, ... Taruscio D (2017) Data quality in rare diseases registries. In rare diseases epidemiology: update and overview (pp. 149–164). Springer, Cham

  14. Scannapieco M, Missier P, Batini C (2005) Data quality at a glance. Datenbank-Spektrum, 14(January), 6–14

  15. Sidi F, Panahy PHS, Affendey LS, Jabar MA, Ibrahim H, Mustapha A (2012, March). Data quality: a survey of data quality dimensions. In 2012 international conference on Information Retrieval & Knowledge Management (pp. 300-304). IEEE

  16. Pipino LL, Lee YW, Wang RY (2002) Data quality assessment. Commun ACM 45(4):211–218

    Article  Google Scholar 

  17. Cai L, Zhu Y (2015) The challenges of data quality and data quality assessment in the big data era. Data Sci J 14

  18. Chen H, Hailey D, Wang N, Yu P (2014) A review of data quality assessment methods for public health information systems. Int J Environ Res Public Health 14;11(5):5170–5207. https://doi.org/10.3390/ijerph110505170

  19. Chen JV, Su BC, Widjaja AE (2016) Facebook C2C social commerce: a study of online impulse buying. Decis Support Syst 83:57–69

    Article  Google Scholar 

  20. Von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P (2002) Comparative assessment of large-scale data sets of protein–protein interactions. Nature 417(6887):399

    Article  Google Scholar 

  21. Burggräf P, Dannapfel M, Förstmann R, Adlon T, Fölling C (2018, January). Data quality-based process enabling: application to logistics supply processes in low-volume ramp-up context. In 2018 international conference on information management and processing (ICIMP) (pp. 36-41). IEEE

  22. Breck E, Polyzotis N, Roy S, Whang SE, Zinkevich M (2018, January). Data Infrastructure for Machine Learning. In SysML Conference

  23. Sessions V, Valtorta M (2006) The effects of data quality on machine learning algorithms. ICIQ

  24. Foidl H, Felderer M (2019, August). Risk-based data validation in machine learning-based software systems. In proceedings of the 3rd ACM SIGSOFT international workshop on machine learning techniques for software quality evaluation (pp. 13-18). ACM

  25. Wang RY, Strong DM (1996) Beyond accuracy: what data quality means to data consumers. J Manag Inf Syst 12(4):5–33

    Article  Google Scholar 

Download references

Acknowledgements

We are indebted towards the company that has provided us with the data of interest. Our thanks also go to Giuseppe Cappiello and Nicolò Zagni (University of Bologna) for their participation to a previous phase of this research activity.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Roccetti.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Roccetti, M., Delnevo, G., Casini, L. et al. A Cautionary Tale for Machine Learning Design: why we Still Need Human-Assisted Big Data Analysis. Mobile Netw Appl 25, 1075–1083 (2020). https://doi.org/10.1007/s11036-020-01530-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11036-020-01530-6

Keywords

Navigation