Skip to main content

ML-PipeDebugger: A Debugging Tool for Data Processing Pipelines

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2019)

Abstract

Data pre-processing for data analysis usually requires a considerable number of interdependent steps, many of which are liable to errors or to introduce unwanted biases. Such errors can lead to cases where predictions for similar data instances differ unexpectedly much. An important question is then to find out where in the data processing pipeline the deviation was caused. We present a tool that can help identify critical data processing steps, allowing to “debug” or improve data pre-processing and model generation. More generally, the tool gives a view of how different data instances behave in relation to each other throughout a pipeline. The task to identify critical steps turns out to be rather complex, mostly because features of different types and ranges have to be compared, because required statistical measures must be obtained from often small samples, and because time series can be involved.

The research reported in this paper has been supported by the Austrian Ministry for Transport, Innovation and Technology, the Federal Ministry for Digital and Economic Affairs, and the Province of Upper Austria in the frame of the COMET center SCCH.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Al-Naymat, G.H.: New methods for mining sequential and time series data. Ph.D. thesis, University of Sydney (2009). https://doi.org/10.1.1.877.2611, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.877.2611&rep=rep1&type=pdf

  2. Carbin, M., Rinard, M.: Automatically identifying critical input regions and code in applications. In: Proceedings of the 19th International Symposium on Software Testing and Analysis (ISSTA 2010), pp. 37–48 (2010)

    Google Scholar 

  3. Chen, A., Wu, Y., Haeberlen, A., Zhou, W., Loo, B.T.: The good, the bad, and the differences: better network diagnostics with differential provenance. In: Proceedings of SIGCOMM 2016, pp. 115–128 (2016). https://doi.org/10.1145/2934872.2934910

  4. Fernando, T.: WorkflowDSL: scalable workflow execution with provenance. Master thesis, KTH Royal Institute of Technology, School of Information and Communication Technology, Stockholm, Sweden (2017). http://www.diva-portal.org/smash/get/diva2:1149093/FULLTEXT01.pdf

  5. Fu, T.C.: A review on time series data mining. Eng. Appl. Artif. Intell. 24, 164–181 (2011). https://doi.org/10.1016/j.engappai.2010.09.007

    Article  Google Scholar 

  6. Gulzar, M.A., Interlandi, M., Han, X., Li, M., Condie, T., Kim, M.: Automated debugging in data-intensive scalable computing. In: Proceedings of SoCC 2017, pp. 520–534 (2017). https://doi.org/10.1145/3127479.3131624

  7. Kagermann, H., Wahlster, W., Helbig, J.: Recommendations for implementing the strategic initiative INDUSTRIE 4.0, April 2013. https://www.din.de/blob/76902/e8cac883f42bf28536e7e8165993f1fd/recommendations-for-implementing-industry-4-0-data.pdf

  8. Polyzotis, N., Roy, S., Whang, S.E., Zinkevich, M.: Data lifecycle challenges in production machine learning: a survey. SIGMOD Rec. 47(2), 17–28 (2018)

    Article  Google Scholar 

  9. Wang, X., Dong, X.L., Meliou, A.: Data X-RAy: a diagnostic tool for data errors. In: Proceedings of SIGMOD 2015, pp. 1231–1245 (2015). https://doi.org/10.1145/2723372.2750549

  10. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 2nd edn. Morgan Kaufmann, Burlington (2005)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Felix Kossak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kossak, F., Zwick, M. (2019). ML-PipeDebugger: A Debugging Tool for Data Processing Pipelines. In: Hartmann, S., Küng, J., Chakravarthy, S., Anderst-Kotsis, G., Tjoa, A., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2019. Lecture Notes in Computer Science(), vol 11707. Springer, Cham. https://doi.org/10.1007/978-3-030-27618-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-27618-8_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-27617-1

  • Online ISBN: 978-3-030-27618-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics